Combinatorial Correlation Clustering

Correlation Clustering is a classic clustering objective arising in numerous machine learning and data mining applications. Given a graph $G=(V,E)$, the goal is to partition the vertex set into clusters so as to minimize the number of edges between clusters plus the number of edges missing within clusters. The problem is APX-hard and the best known polynomial time approximation factor is 1.73 by Cohen-Addad, Lee, Li, and Newman [FOCS'23]. They use an LP with $|V|^{1/\epsilon^{\Theta(1)}}$ variables for some small $\epsilon$. However, due to the practical relevance of correlation clustering, there has also been great interest in getting more efficient sequential and parallel algorithms. The classic combinatorial \emph{pivot} algorithm of Ailon, Charikar and Newman [JACM'08] provides a 3-approximation in linear time. Like most other algorithms discussed here, this uses randomization. Recently, Behnezhad, Charikar, Ma and Tan [FOCS'22] presented a $3+\epsilon$-approximate solution for solving problem in a constant number of rounds in the Massively Parallel Computation (MPC) setting. Very recently, Cao, Huang, Su [SODA'24] provided a 2.4-approximation in a polylogarithmic number of rounds in the MPC model and in $\tilde{O} (|E|^{1.5})$ time in the classic sequential setting. They asked whether it is possible to get a better than 3-approximation in near-linear time? We resolve this problem with an efficient combinatorial algorithm providing a drastically better approximation factor. It achieves a $\sim 2-2/13<1.847$-approximation in sub-linear ($\tilde O(|V|)$) sequential time or in sub-linear ($\tilde O(|V|)$) space in the streaming setting. In the MPC model, we give an algorithm using only a constant number of rounds that achieves a $\sim 2-1/8<1.876$-approximation.


Introduction
Correlation clustering is a fundamental clustering objective that models a large number of machine learning and data mining applications.Given a set of data elements, represented as vertices V in a graph G, and a set of pairs of similar elements, represented as edges E in the graph, the goal is to find a partition of the vertex sets that minimizes the number of missing edges within the parts of the partition plus the number of edges across the parts of the partition.An alternative formulation is that we want a graph consisting of disjoint cliques, minimizing the symmetric difference to the input graph.Below n = |V | and m = |E|.
The problem was originally introduced by Bansal, Blum, and Chawla in the early 2000 [BBC04] and has since then found a large variety of applications ranging from finding clustering ensembles [BGU13], duplicate detection [ARS09], community mining [CSX12], disambiguation tasks [KCMNT08], to automated labelling [AHK + 09, CKP08] and many more.
Thanks to its tremendous modelling success, Correlation clustering has been widely studied in the algorithm design, machine learning and data mining communities.From a complexity standpoint, the problem was shown to be APX-Hard [CGW05] and Bansal, Blum, and Chawla [BBC04] gave the first O(1)-approximation algorithm.The constant was later improved to 4 by Charikar, Guruswami, and Wirth [CGW05].
In a landmark paper, Ailon, Charikar and Newman [ACN08] gave two very important pivotbased algorithms.First, they presented a combinatorial -in the sense that it does not rely on solving a linear program -pivot algorithm that iteratively picks a random vertex uniformly at random in the set of unclustered vertices and clusters it with all its unclustered neighbors.They showed that this algorithm achieves a 3 approximation (and their analysis is tight).Next, they improved the approximation factor to 2.5 using a standard linear program (LP) which was shown to have an integrality gap of at least 2. They still used a random pivot, but instead of creating a cluster containing all the unclustered neighbors of the pivot, they randomly assigned each vertex to the cluster of the pivot based on the LP solution.
A better rounding of this LP of Ailon, Charikar and Newman [ACN08] was later presented by Chawla, Makarychev, Schramm and Yaroslavtev [CMSY15] who achieved a 2.06-approximation still relying on a pivot-based approach to round, hence coming close to a nearly-optimal rounding given its integrality gap of at least 2.
Since the integrality gap of this LP is at least 2, 2 has appeared as a strong barrier for approximating Correlation Clustering, and 3 has remained the best-known non-LP-based algorithm to this day.Recently, different LP formulations have been used to get better than 2 approximations.Cohen-Addad, Lee and Newman [CLN22] showed that the Sherali-Adams hierarchy helps bypass the standard LP integrality gap of 2 by providing a 1.995 approximation which was very recently improved to 1.73 by Cohen-Addad, Lee, Li and Newman [CLLN23] using a new linear programming formulation and relying on the Sherali-Adams hierarchy too.However, these improvements have happened at an even greater computational cost: While the pivot-rounding algorithm required to solve a linear program on n 2 variables, the above two algorithms require n 1/ǫ Θ(1) variables and running time for some small ǫ.

Efficient algorithms
Motivated by the large number of applications in machine learning and data mining applications, researchers have put an intense effort in efficiently approximating Correlation Clustering: From obtaining linear time algorithm [ACN08], to dynamic algorithms [BDH + 19], distributed (Map-Reduce or Massively Parallel Computation (MPC) models) [BCMT22, CLM + 21], stream-ing [BCMT23a, CM23a, CKL + 24, AW22, CLM + 21], or sublinear time [AW22].In most of these models, the LP-based approaches have remained unsuccessful and the state-of-the-art algorithm remains the combinatorial pivot algorithm.Arguably, the modularity and simplicity of the combinatorial pivot algorithm and its analysis have been key to obtaining these results.
Thus, the question of how fast and well and in which model one can approximate Correlation Clustering better than a factor 3 has naturally emerged.At one end of the spectrum, we have the above 1.73 in time n 1/ǫ Θ(1) .At the other end of the spectrum, Cohen-Addad, Lattanzi, Mitrovic, Norouzi-Fard, Parotsidis, and Tarnawski [CLM + 21] and Assadi and Wang [AW22] showed that Correlation Clustering can be approximated to an O(1)-factor in linear time (i.e.: O(m) time) and in fact in sublinear time (i.e.: n log O(1) n) using random sampling of neighbors.However, the constant proven in these works is larger than 500.In a very recent work, [CHS24] showed that one can approximately solve and round the standard LP in time O(m 1.5 ) and achieve a 2.4approximation, coming close to the linear time-bound.In the streaming model, [BCMT22] gave a 3 + ǫ-approximation using O(1/ǫ) passes.Later [BCMT23a] gave a 1-pass algorithm achieving a 5-approximation.Very recently, [CKL + 24] and [CM23b] independently provided a 1-pass 3 + ǫapproximation.[CHS24] thus asks: How fast and in which models can one approximate Correlation Clustering within a factor 3 and beyond?

Our Results
Our main contribution is a novel combinatorial algorithm that we show achieves a 2 − 2/13approximation.Our new algorithm alternates between performing a local search and a certain flip.The flip takes all cut edges in the current solution and aims to make them internal, doing so by increasing the cost of cutting them in the objective for the next local search.One can think of this as a very systematic method to escape a bad local minimum, leading to strong theoretical guarantees.
We highlight that despite the fact that the number of local search iterations could be close to linear in the worst-case, we show that with a minor loss in approximation factor, the whole local search can be implemented in near-linear total time.
In fact, using random sampling of neighbors, we can implement it in sublinear time, improving over the state-of-the-art sublinear-time > 500-approximation [AW22, CLM + 21].On top of that, we can implement it in the streaming model.A variation of this algorithm can be implemented in the MPC model with a constant number of rounds, achieves a 2 − 1/8-approximation.
In addition, for any δ = Ω(1), there is an algorithm that w.h.p., achieves a 2 − 1/8 + ǫapproximation and can be implemented to terminate in (ǫδ) −O(1) rounds in the MPC model with n δ memory per machine and total memory O(ǫ −O(1) m log n).
Fixing ǫ = 0.0008, we get an approximation factor of 2 − 2/13 + ǫ < 1.847 and 2 ǫ −O(1) = O(1) though with a very large hidden constant.Our approach thus opens up the possibility of achieving a better than 2 approximation in other models such as dynamic or CONGEST, and it is an interesting direction to see whether this can be achieved.

Techniques
The local search framework we consider maintains a clustering.At each iteration, the algorithm tries to swap in a cluster (i.e.: a set of vertices of arbitrary size), creating it as a new cluster and removing its elements from the clusters they belonged to.The swap is improving if the resulting clustering has a smaller cost, and we keep doing improving swaps until no more are found.
Of course, as described above the algorithm would have a running time exponential in the input size since it for potential swaps needs to consider all subsets of vertices of the input but let's first put aside the running time consideration and focus on the approximation guarantee.We first discuss that the algorithm achieves a 2-approximation; this serves as a backbone for our further improvements.We further discuss how we can implement the above variant of the above algorithm in polynomial time while losing only a (1 + ǫ) factor in the approximation guarantee.
Achieving a 2-approximation using a simple (exponential time) local search The analysis of the above algorithm simply consists of the following steps: Let S be the solution output by the algorithm and Opt the optimum solution.For each cluster C in Opt we will consider the solution S C which is created from S by inserting C as a cluster (by creating cluster C and removing its elements from the clusters they belong to in S).
By local optimality, we know that the cost of solution S is no larger than the cost of the solution S C .The difference in cost between the two solutions is only coming from edges that are adjacent to at least one vertex in C.More concretely, denote by E + S the set of edges cut by S and by E − S the set of non-edges that are not cut by S, i.e.: the cost of S is equal to Then the cost of solution S C minus the cost of solution S is positive and equal to This statement holds for any cluster C of the optimum solution and so summing up the above inequality over all these clusters shows that on the left-hand side, the cost of the local search solution is no more than twice the cost of the optimum solution.This hence provides a simple (exponential time) 2 approximation algorithm for Correlation Clustering.
Equipped with the above analysis, our work then moves in two complementary directions: (1) Improving over the approximation bound of 2 with certain flips; and (2) showing how to implement the above algorithm while losing at most a (1 + ǫ) factor in the overall guarantee.

Flips between local searches
Let us first focus on (1).When considering the above analysis, one can note that the factor of 2 arises when summing over all clusters and so each edge (u, v) ∈ E \ E + S where u ∈ C and v ∈ C ′ = C in the optimum solution.Indeed, in this case, the edge (u, v) appears twice in the sum: once when upper bounding the cost of the vertices in C and once when upper bounding the cost of the vertices in C ′ .On the other hand, the ratio coming from either the non-edges or the pairs (u, v) ∈ E + S ∪ E − S that are also paid by Opt is in fact 1.It follows that the solution S obtained by local search is only close to a 2 approximation if a, say, .99fraction of the cost of the optimum solution is due to the edges (and not the non-edges) cut by the optimum solution and that are not in E + S .Hence, for the ratio of 2 to be tight, the solution S, the optimum solution and the underlying graph must be very peculiar and satisfy the following: • .99fraction of the cost of the optimum solution is due to edges (and not non-edges) • S and the optimum solution must have at most .01fraction of their cost shared (i.e.: intersection of the edges and non-edges paid by the Opt and S must be at most a .01fraction of their overall cost).
Under our assumptions, we see that Opt and S share very few cut edges, so to get closer to Opt, it would make sense to flip the cut edges of S; seeking a solution where they are no longer cut.We do this by increasing the cost of cutting these edges.
The above may seem a bit naive, but it turns out to yield an approximation factor substantially below 2.More precisely, after having found the first local search solution S, we double the cost of cutting the weight of the cut edges of S. This is called the flip.Then we run a new local search on this modified input; yielding a new local optimum S ′ .Finally, we return the one of S and S ′ that minimizes the original cost.Magically, it turns out that this local-search-flip-local-search always provides a 15/8-approximation.We provide a description of this algorithm and its approximation ratio as a warm-up in Section 3.1.

Iterated-flipping Local Search
We then go one step beyond and analyze the bad instances for the solution output after we perform the flip (i.e.: the weighting of the edges cut by the initial solution S).Our first new ingredient is to start iterating the flip process with different adaptive weights on the edges cut by the solutions output by the local search algorithm on the different weighted graphs (the motivation is again that the bad case in these scenarios is when the above two assumptions are nearly satisfied).In this case, we use the sequence of solutions found by applying the local search to further and further refine our weighting scheme.The second key ingredient is a procedure that pivots over three clusterings to output a new clustering that combines them.The pivot procedure iteratively creates a new clustering by taking the largest set of vertices that are together in all three clusterings, and creating a new cluster consisting of vertices that are together with the chosen ones in at least two out of three clusterings, and repeating on the remaining vertices.We then build a sequence of solutions that iteratively modify the weights of the edges of the graph and pivot on the three last solutions created.We show that the best solution among the ones created achieves our final bound of 2 − 2/13.

Efficient implementations
In the previous paragraph, we have worked with the idealistic local search algorithm that can swap in and out clusters of arbitrary sizes.
A striking phenomenon here is that it is impossible to approximate the cost of the optimum solution up to any o(n) factor on graphs with n vertices in the sublinear memory regime; hence all previous algorithms on this model output a solution without estimating its cost.In light of this, it may seem hopeless to use a local-search-based approach that requires estimating the cost of swapping in and out clusters to improve the cost of the current solution.Yet, we show that thanks to the coarse preclustering, sampling techniques are enough to estimate the cost of potential swaps and implement our approach.
Another crucial feature of our local-search-based approach is that it can also be implemented in a distributed environment and allow to perform local search iterations in parallel.
We now discuss the main ideas that led to achieving a linear running time, and derive algorithms for the sublinear time and massively-parallel computation models.Let's focus on implementations for the simple local search presented at the beginning of this section.To implement our approach we need to show how to implement the following key primitive: Given a clustering S, is there a cluster C such that the solution S C has a better cost than S. Assume first that we are in the case where the optimum solution has cost at least ǫ times the total number of edges of the graph.In this case, we can tolerate an additive cost proportional to ǫ 2 times the sum of the degrees of the vertices in C and our analysis would be unchanged up to an additive term of 2ǫ times the cost of Opt.
In this context, we make use of the fact that the clusters of any optimum solution are dense (i.e.: each cluster C of Opt contains at least |C|(|C| − 1)/2 − 1 edges since otherwise, the vertices could be placed into a singleton and the cost would improve).This means that we can pick uniformly at random a set of, say, poly(1/ǫ) vertices from the cluster and this provides an accurate estimate of the fraction of neighbors in the cluster for any other vertex v in the graph, up to an additive ǫ|C| term.Moreover, we show that there exists a near-optimal solution Opt * where each vertex in C has degree proportional to |C|.Our algorithm then aims at sampling a set of poly(1/ǫ) vertices from C, and places these vertices in a tentative cluster C ′ .It then iterates over a set of candidate vertices that could tentatively be included in C ′ (and that contains the vertices in C).We would then like to show that the total amount of mistakes made by the algorithm is proportional to the sum of the degrees of the vertices in C. To achieve this, we require two more ingredients: (1) the candidate vertices have degree proportional to |C| and their number is proportional to |C|; and (2) that these greedy steps are consistent with the uniform sample we have selected, namely that the uniform sample we are using to make our decisions is also a uniform sample of C ′ (since otherwise the decisions made are not relevant w.r.t. the final cluster produced).
To both achieve (1) and ensure that we can tolerate an error of ǫ times the sum of the edges in the instance, we use the Preclustering algorithm recently introduced in [CLLN23] which merges vertices that they prove can be clustered together in a near-optimal solution.
To achieve (2), we use an iterated approach where the cluster is constructed in batches of size ǫ q |C| for some constant q and the sampling is "updated" after each batch.Decisions made for the ith batch are based on a sample obtained by sampling the set of vertices that have already joined C ′ and the set of vertices that we would like to join C ′ .The mistakes for the ith batched are thus proportional to ǫ 2q |C| and since the total number of batches is O(1/ǫ q ) (thanks to (1)) the final mistake obtained is ǫ q |C|.
Finally, we then show that the algorithm can be implemented in linear time and in fact sublinear time and MPC models.The key observation here is that when looking for a new cluster C to swap in, one can focus on a vertex v of C and look at the set of vertices with similar degree to v and that are adjacent to v or to a neighbor of v with degree similar to v. The size of this set is proportional to the degree of v and so proportional to the cluster C. Thus, in time proportional to the degree of v one can run the above procedure to build the cluster C.

Further Related Work
If the number of clusters is a hard constraint and a constant, then a polynomial-time approximation scheme exists [GG06,KS09] (the running time depends exponentially on the number of clusters).As hinted at in the introduction, there is a large body of on Correlation Clustering in other computation models: there exists online algorithms [ In the case where the edges of the graph are arbitrarily weighted, and so the cost induced by violating an edge is its weight, the problem is as hard to approximate as the Multicut problem (and in fact "approximation equivalent") and an O(log n)-approximation is known [DEFI06].The work of Chawla, Krauthgamer, Kumar, Rabani, and Sivakumar thus implies that there is no polynomialtime constant factor approximation algorithm for weighted Correlation Clustering assuming the Unique Game Conjecture [CKK + 06].
If we flip the objective and aim at maximizing the number of non-violated edges, a PTAS has been given by Bansal, Blum and Chawla [BBC04], and a .77-approximationif the edges of the graph are weighted was given by Charikar, Guruswami and Wirth [CGW05] and Swamy [Swa04].

Preliminaries
Graph notation.We will work with simple undirected edges.For a graph G, by V (G) and E(G) we denote its vertex and edge set, respectively.If the graph is clear from the context, we denote Correlation clustering.A clustering scheme in G is a partition of V (G).Every set in a clustering scheme is called a cluster.For a clustering scheme C, let Int(C) be the set of pairs of vertices that are in the same cluster of C and Ext(C) be the set of edges across different clusters.
denotes the total cost of C. We drop the subscript if the weight function w is clear from the context.
The input to Correlation Clustering is a simple undirected graph G, with additionally a weight function w in the weighted variant.We ask for a clustering scheme of the minimum possible cost.
We will also use the notation Int and Ext for individual clusters C ⊆ V (G).Then, Preclustering.Let G be an unweighted instance to Correlation Clustering and let Opt be a clustering scheme of minimum cost.Note that every v ∈ V (G) is in a cluster of size at most 2d(v) + 1, as otherwise it would be cheaper to cut v out into a single-vertex cluster.However, v can be put in Opt in a cluster much smaller than d(v).
We can ensure that v is put in a cluster of size comparable with d(v) allowing a small slack in the quality of the solution.Let ε > 0 be an accuracy parameter.Note that if the cluster of v in Opt is of size smaller than ε 1+ε d(v), then one can cut out v into a single-vertex cluster and charge the cost to at least 1 1+ε d(v) already cut edges incident with v.That is, there exists a (1 + ε)-approximate solution with the following property: every v ∈ V (G) is either in a single-vertex cluster or in a cluster of size between ε 1+ε d(v) and 2d(v) + 1.For fixed ε > 0, this is of order d(v).Preprocessing introduced in [CLLN23] takes the above approach further and also identifies some pairs of vertices that can be safely assumed to be together or separate in a (1+ ε)-approximate clustering scheme.The crux lies in the following guarantee: the number of vertex pairs of "unknown" status is comparable with the cost of the optimum solution.
We now formally state the outcome of this preprocessing.
Definition 2. For an unweighted Correlation Clustering instance G, a preclustered instance is a pair (K, E adm ), where K is a family of disjoint subsets of V (G) (not necessarily a partition), and 2 is a set of pairs called admissible pairs such that for every uv ∈ E adm , at least one of u and v is not in K.
Each set K ∈ K is called an atom.We use V K := K to denote the set of all vertices in atoms.A pair uv ∈ V (G) 2 with u and v in the same K ∈ K is called an atomic pair.A pair (u, v) between two vertices u, v in the same K ∈ K is called an atomic edge.A pair that is neither an atomic pair nor an admissible pair is called a non-admissible pair.
For a vertex u, let d adm (u) denote the number of v ∈ V (G) such that uv is an admissible pair.Note that u∈V (G) d adm (u) = 2|E adm |.
Definition 3. Let G be an unweighted Correlation Clustering instance, let (K, E adm ) be a preclustered instance for G, and let ε > 0 be an accuracy parameter.We say that (K, E adm ) is a ε-good preclustered instance if the following holds: 3. for every atom K ∈ K, for every vertex v ∈ K, we have that v is adjacent to at least a (1 − O(ǫ)) fraction of the vertices in K and has at most O(ǫ|K|) neighbors not in K.
Therefore, in a preclustered instance, the set V (G) 2 is partitioned into atomic, admissible, and non-admissible pairs.By the definition of E adm , a pair uv between two different atoms is non-admissible.
Definition 4 (Good clusters and good clustering schemes).Let G be an unweighted Correlation Clustering instance, let (K, E adm ) be a preclustered instance for G, and let ε > 0 be an accuracy parameter.Assume that • for any u in C, for any v, if uv is an atomic pair, then v is also in C.
• for any distinct u, v in C, we have that uv is not a non-admissible pair.
Moreover, a clustering scheme C is called (ε, δ)-good with respect to (K, E adm ) if all clusters C ∈ C are (ε, δ)-good.We will also say a cluster/clustering scheme is ε- That is, a good cluster can not break an atom, or join a non-admissible pair.As a result, two atoms can not be in the same cluster as the pairs between them are non-admissible.
We consider the following algorithm to compute a preclustering due to [CLLN23] Preclustering(G, ε) Each non-singleton cluster is an atom.let K be the set of atoms produced.
(2) it is degree-similar, and (3) the number of common neighbors that are degree-similar to both u and v is at least ε•min{d(u), d(v)}.
We let E adm be the set of all admissible pairs in V (G) 2 .
The following theorem is in fact a restatement from [CLLN23]: The only addition from the Theorem 4 of [CLLN23] (arXiv version) is that we require the instance to be ε-good: Namely that the 3 bullets of Definition 3 hold.The last two bullets are shown to be true in the proof of Theorem 4 of [CLLN23] (arXiv version) while the last bullet follows from the description of Algorithm 1 of [CLM + 21].

Sublinear, streaming and MPC models
We further define the sublinear, streaming and MPC models.Let G = (V, E) be a graph.
Definition 6.In the sublinear model, the algorithm can query the following information in O(1) time: • Degree queries: What is the degree of vertex v?
• Neighbor queries: What is the i-th neighbor of v ∈ V for i less or equal to the degree of v?
Definition 7. In the streaming model, the algorithm knows the vertices of the graph in advance, and the edges arrive in a stream.The algorithm only has O(n polylog(n)) space and can only read the stream once.
In the MPC model, the set of edges is distributed to a set of machines.Computation then proceeds in synchronous rounds.In each round, each machine first receives messages from the other machines (if any), then performs some computation based on this information and its own internal allocated memory, and terminates the round by sending messages to other machines that will be received by the machine at the start of the next round.Each message has size O(1) words.Each machine has limited local memory, which restricts the total number of messages it can receive or send in a round.
For the Correlation Clustering problem, the algorithm's output at the end of the final computation round is that each machine is required to store in its local memory the IDs of the cluster of the vertices that it initially had edges adjacent to.
We consider the strictly sublinear MPC regime where each machine has local memory O(n δ ), where δ > 0 is a constant that can be made arbitrarily small, and Õ(m) total space.
The theorem below is immediate from Theorem 1 in [CLM + 21].The following sampling techniques will be used in our implementations in the sublinear and streaming models.

Lemma 9 ([AW22]
).In O(n log n) time (in the sublinear model) or in O(n log n) space (in the streaming model), we can sample Θ(log n) neighbors (with repetition) of each vertex v.

Lemma 10 ([AW22]
).In O(n log n) time (in the sublinear model) or in O(n log n) space with high probability (in the streaming model), we can sample each vertex v with probability Θ(log n/d(v)) and store all edges incident to sampled vertices.
We remark that in the streaming model, we can also get several independent realizations of the above sampling schemes in one pass which only increases the space complexity.
The next theorem follows the decomposition theorem and the framework of Assadi and Wang [AW22, CLM + 21, CLLN23].
Theorem 11 ([AW22, CLM + 21, CLLN23]).In O(n log 2 n) time (in the sublinear model) or in O(n log n) space (in the streaming model), one can compute a partition K of the vertices and a data structure such that there exists an ǫ-good preclustering (K, E adm ) satisfying the conditions of Theorem 5 with and such that with success probability at least 1 − 1/n 2 , for any pair of vertices, the data structure can answer in O(log n) time whether the edge is in E adm or not, and for any vertex v, the data structure can list all vertices admissible to The proof of the above statement follows from the following observation: The set K is partitioned into disjoint dense vertex set computed by the framework of Assadi and Wang.We initialize one realization of Lemma 10, which samples each vertex v with probability 40ε −2 log n/d(v).With high probability, each vertex v will only have O(log n) degree-similar sampled neighbors.Then, when the data structure is queried with a pair u, v it can implement the second step of the preclustering by (1) verifying that none of u, v belong to K in O(1) time, (2) verifying that u, v are degree similar with two degree queries, and (3) looking at the degree-similar sampled neighbors of u and v in total time O(log n) and verifying that the intersection is at least ε min{d(u), d(v)}/2 of the Preclustering algorithm.An immediate application of Chernoff bound implies that the probability of misreporting that an edge is not admissible is at most 1/n 4 .Then by a union bound, with probability 1 − 1/n 2 it can correctly answer all queries.On the other hand, the number of edges reported is the same as in Theorem 5 applied to an ε twice smaller.Then, when we want to list all vertices admissible to v, we look at every degree-similar sampled neighbor u of v.For each neighbor of u, we query whether it is admissible to v. Given that the data structure can correctly answer all queries, we must find all candidates in this way, and the total number of queries we make is

Local Search with Flips
In this section, we present and analyse a few local search based approaches to Correlation Clustering.
First, we present a simple warm-up (2 − 1 8 )-approximation.Here, we will not discuss the implementation of a local search step, but only properties of a local optimum.Then, we present a more involved (2 − 2 13 + ε)-approximation, where we in the analysis also care to leave placeholders for our implementation of a local search step.This implementation will be via sampling, and can only guarantee being in an "(1 + ε)-approximate" local optimum.
Let C be a clustering scheme in a graph G with weight function w and let C ⊆ V (G).By C + C we denote the clustering scheme obtained from C by first removing the vertices of C from all clusters and then creating a new cluster C. We will analyse the following algorithm.

Local-Search-Flip-Local-Search
• Let Ls1 be a local optimum of (G, w).
• Double the weight of edges in E + ∩ Ext(Ls1) to get a new weight function w ′ .
• Let Ls2 be a local optimum of (G, w ′ ).
• Output the best of Ls1 and Ls2 with respect to the original cost function.
The remainder of this section is devoted to the proof of the following theorem.
Theorem 13.The Local-Search-Flip-Local-Search algorithm applied to an instance with uniform weight (w ≡ 1) outputs a solution of cost within a (2 − 1 8 ) ratio of the optimum solution.We start the proof with a few observations about local optimum.Lemma 14.Let Ls be a local optimum and let C be a clustering scheme.Then, the following holds.
Proof.Since Ls is a local optimum, for every cluster C of C it holds that Cost(Ls+C) ≥ Cost(Ls).
That is, Summing the above over every cluster C of C proves (1), as every pair of Ext(C) will appear twice and every pair of Int(C) will appear once in the summands.Inequality (2) follows from (1) by simple algebraic manipulation, noting that Inequality (3) is just (2) with the nonnegative term w(E − ∩ Int(Ls) ∩ Ext(C)) dropped.For (4), we use the following immediate estimate: (5) Inequality (4) follows from (2) by first adding (5) to the sides and then observing that Let δ = 1 8 .In the proof, it is instructive to think about δ as a small constant; only in the end we will observe that setting δ = 1 8 gives the best result.Let Opt be an optimum solution.We say that a clustering scheme The structure of the proof of Theorem 13 is as follows: assuming that both Ls1 and Ls2 are not (2 − δ)-competitive, we construct a solution which is better than 8δ-competitive.This finishes the proof, as we cannot have a solution better than the optimum one.
To execute this line, we analyse in detail costs of various sets of (non)edges of G using inequalities of Lemma 14.
Here we use Cost ′ to denote the cost on the instance (G, w ′ ).By the definition of w ′ , for any clustering scheme C, We remark that Opt still denotes the optimal solution of (G, w), and the "−" cost for any clustering scheme remains the same in (G, w ′ ).
Proof.By Equation (6), we have: By applying (4) to Ls2 and Opt in (G, w ′ ), we obtain: Here, the last inequality follows from the fact that all weights of w are nonnegative and after the flip w ′ is w with some weights doubled.
Combining them, we get: The lemma follows.
We combine the estimates of Lemmata 15 and 16 into the following.
Lemma 17.If both Ls1 and Ls2 are not (2 − δ)-competitive, then Proof.By Lemma 16, the left-hand-side is at most which, by Lemma 15, is at most 4δCost(Opt).
We now use Lemma 17 to show a clustering scheme of cost strictly less than 8δCost(Opt).
To this end, we will need the following Pivot operation.Let C x , C y , C z be three clustering schemes of G. Within each clustering scheme, number the clusters with positive integers.For each vertex of G, we associate it with a 3-dimensional vector (x, y, z) ∈ Z 3 + if it is in x-th cluster in C x , in the y-th cluster in C y , and in the z-th cluster in C z .We define the distance between two vertices u and v, denoted d(u, v), to be the Hamming distance of their coordinate vectors.
The operation Pivot(C x , C y , C z ) produces a new clustering scheme C as follows.Initially, all vertices of G are unassigned to clusters, that is, we start with C = ∅.While C = V (G), that is, not all vertices of G are assigned into clusters, we define a new cluster in G and add to C. The new cluster is found as follows: we take a coordinate (x, y, z) with the maximum number of vertices of V (G) \ C assigned to it (breaking ties arbitrarily), and creates a new cluster consisting of all vertices of V (G) \ C within distance at most 1 from (x, y, z).This concludes the definition of the Pivot operation.
The following lemma encapsulates the crux of the analysis of the Pivot operation.
Lemma 18.Let G be an unweighted Correlation Clustering instance, let C x , C y , C z be three clustering schemes, and let C := Pivot(C x , C y , C z ).Divide each of E + and E − into 4 sets by the distances.E + i denotes the set of pairs of E + with distance i, and E − i denotes the set of pairs of E − with distance i.Call pairs in E + 0 ∪ E + 1 ∪ E − 3 normal, and call other pairs special.Then, Furthermore, the number of special edges is at least half of the size of Proof.The first claim follows immediately by definition: two vertices with distance 0 will always be put into the same cluster, while two vertices with distance 3 will never be put into the same cluster.
Hence, all pairs in E − ∩ Int(C) are special, and the only edges of E + ∩ Ext(C) that are normal are from E + 1 .Observe that, while creating a cluster with a pivot in (x, y, z), an edge uv ∈ E + 1 ends up in Ext(C) if d(u, (x, y, z)) = 1 and d(v, (x, y, z)) = 2 (or vice versa).We will charge such edges to pairs of E 2 := E − 2 ∪ E + 2 with at least one endpoint in the newly created cluster.Note that E 2 consists only of special pairs.Consider a step of the Pivot(C x , C y , C z ) operation creating a cluster C with pivot (x, y, z).To finish the proof of the lemma, it suffices to show that the number of edges uw ∈ E + 1 where u ∈ C and w / ∈ C remains in the graph is not larger than the number of pairs u ′ w ′ ∈ E 2 with |{u ′ , w ′ } ∩ C| ≥ 1 (i.e., the number of edges of E 2 that will be deleted when deleting the cluster C.) To this end, for coordinate vector (x ′ , y ′ , z ′ ), let n (x ′ ,y ′ ,z ′ ) be the number of vertices at coordinate (x ′ , y ′ , z ′ ) that are unassigned to clusters just before the cluster C is created.
We say that an edge uw ∈ E + 1 is cut if one endpoint of uw is in C, and the second endpoint is outside C and unassigned to any cluster at the moment of creating C.
If an edge from E + 1 is cut, its endpoint not in C has coordinate vector at distance 2 to (x, y, z), say (x ′ , y ′ , z).The number of E + 1 edges cut to this coordinate is at most v (x ′ ,y ′ ,z) (v (x ′ ,y,z) + v (x,y ′ ,z) ).We charge them to the E 2 edges between (x, y, z) and (x ′ , y ′ , z) and the E 2 edges between (x ′ , y, z) and (x, y ′ , z).Because these edge sets are complete, it suffices to show that This easily follows from the fact that n (x,y,z) ≥ max{n (x ′ ,y ′ ,z) , n (x ′ ,y,z) , n (x,y ′ ,z) }.This finishes the proof of the lemma.
Lemma 18 suggests to bound the number of the special pairs.This is how we will do it.and E − i for i = 0, 1, 2, 3 be as in Lemma 18.Then, We show that if Ls1 and Ls2 are both not (2 − δ)-competitive, then C := Pivot(Opt, Ls1, Ls2) returns a clustering of cost strictly less than 8δCost(Opt).This will be a contradiction with the optimality of Opt for δ = 1 8 .That is, we will use C x = Opt, C y = Ls1, and C z = Ls2.Combining Lemmata 17 and 19 gives immediately the following.
Lemma 20.If both Ls1 and Ls2 are not (2 − δ)-competitive, then the total weight of special pairs is less than 4δCost(Opt).
The final lemma below is the first place where we use that w ≡ 1 in Theorem 13.This is necessary for Lemma 18 to be useful, as it considers the number of pairs, not their weight.
Proof.Recall that C is the clustering scheme being the result of Pivot(Opt, Ls1, Ls2).By Lemma 18, the number of normal pairs accounted in Cost(C) is not larger than the number of special pairs.By Lemma 20 and the assumption w ≡ 1, the cost of C on normal pairs and on special pairs are both less than 4δCost(Opt).Hence, Cost(C) < 8δCost(Opt).
Since Opt is an optimum solution, Lemma 21 gives a contraction for δ = 1 8 .Hence, at least one of Ls1 and Ls2 is (2 − 1 8 )-competitive.This completes the proof of Theorem 13.
A remark on a hard instance.We finish this section with an example that the Local-Search-Flip-Local-Search algorithm can be as bad as 14 9 -competitive.Consider a graph of 3× 5× 5 vertices, where the vertices have coordinates from (1, 1, 1) to (3, 5, 5).There is an edge between two vertices if and only if their Hamming distance is at most 2. The optimal solution is to cluster vertices which have the same x-coordinate, and the cost is (3 − 1)(5 + 5 − 1)/2 = 9 per vertex.It can be checked that Ls1 and Ls2 may respectively cluster vertices which have the same y-coordinate/z-coordinate, and the cost is (5 − 1)(3 + 5 − 1)/2 = 14 per vertex.The ratio is 14 9 ≈ 1.56.

Iterative version of local search
In this section we present a more complicated iterative local search algorithm that achieves an approximation ratio of (2 − 2 13 + ε) for any ε > 0. We also present it in full generality, where we are not able to find an exact local optimum, but only an approximate one.We start with defining what this "approximate one" means.
Similarly as in the warm-up section, we will solve unweighted Correlation Clustering instances, but use local search on instances with some of the edge weights increased from the flips.The increase from the flips will be small: we will use a constant β > 0 (set to β = 0.5 in the end) and increase the weight of some edges by β at most twice.Hence, the following definition needs to include weight functions w : V (G) 2 → Z + .We say that the weight function w is normal if w(uv) = 1 for uv / ∈ E(G) and w(uv) ≥ 1 for uv ∈ E(G).
Definition 22.Let G be an unweighted Correlation Clustering instance, let ε > 0 be an accuracy parameter, and let (K, E adm ) be a preclustered instance for G returned by Theorem 5 for (G, ε).That is, (K, E adm ) is ε-good and let C * (K,E adm ) be the (unknown) ε-good clustering scheme in (K, E adm ) whose existence is promised by Theorem 5.
Let 0 < γ < ε 13 /4 be a constant and w : V (G) 2 → Z + be a normal weight function.Then, a clustering scheme C is a γ-good local optimum for w if for every cluster C of C * (K,E adm ) it holds that We are now ready to present our algorithm.It is parameterized by an accuracy parameter ε > 0, a constant 0 < γ < ε 13 /4, a constant β > 0, and a number of iterations k. (We will set β = 0.5 and 0 < γ ≪ ε to be small constants in the end.)The input consists of an unweighted Correlation Clustering instance G and the algorithm starts with computing an ε-good preclustered instance (K, E adm ) for G using Theorem 5.
The algorithm uses two operations that are worth recalling.First, it explicitly uses the Pivot operation described in the previous section.Second, for a weight function w and clustering scheme C, it performs the flip by computing a new weight function w ′ := w + β(E + ∩ Ext(C)), which is a shorthand for adding weight β to all edges of G that connect different clusters of C.

Iterated-flipping Local Search(G)
• Let (K, E adm ) be the preclustering obtained via Theorem 5 on G and ε.
• Let w 0 ≡ 1 be the uniform weight function.
• Let C ′ 0 be a γ-good local optimum for w 0 .
-Let C i be a γ-good local optimum for w i . - . ., C ′′ k with respect to the original cost function.
a Pivot can be implemented in O(n) time and space.
We prove in this section the following theorem.
Recall that by Cost(C) we mean the cost of C with regards to the uniform weight w 0 ≡ 1.If we want to use a different weight function w, we denote it by Cost w (C).
We set α := ( 2 13 + α)/2.We will aim at a solution of cost (2 − α)Cost(C * (K,E adm ) )).By the properties promised by Theorem 5 and by our choice of ε 0 , this is enough to obtain a (2 − α)approximation.Hence, for contradiction we assume that the costs of all clustering schemes C i , C ′ i , and C ′′ i generated by the algorithm are larger than (2 − α)Cost(C * (K,E adm ) ).We will reach a contradiction.
We start with an analog of Lemma 14.
Lemma 24.Let C be a γ-good local optimum for G and a weight function w.Then, Proof.Let C be a cluster of C * (K,E adm ) .By the optimality of C, we have Equivalently, The lemma follows from observing that Here, the first inequality uses the properties of (K, E adm ) promised by Theorem 5 while the last inequality uses the assumption γ < ε 13 /4.
We now proceed to the analogs of Lemmata 15 and 16.
Lemma 25.Let C 1 , . . ., C ℓ be clustering schemes and (That is, for every 1 ≤ i ≤ ℓ, we add a weight of β to every edge connecting two distinct clusters of C i .)Let C be a γ-good local optimum of G and w.If Cost(C) > (2 − α)Cost(C * (K,E adm ) ), then Rearranging this, we get Applying now the definition of w(C * (K,E adm ) ), we get Adding the inequality of Lemma 24 to both sides, we obtain The lemma now follows from the above and the following three simple observations: Fix 1 ≤ i ≤ k.By Lemmata 18 and 19, we have Recall β = 0.5.Lemma 25 applied to C i gives Lemma 25 applied to To bound the right hand side of (7), we add twice (8) with twice (9), obtaining the following.
Combining the above with (7), we obtain Using the assumption Cost( The above estimate motivates us to define the following value for 0 . Rewriting (10), we obtain Applying Lemma 25 to C ′ 0 , we have Since every b i is nonnegative, by the choice of k there exists i such that b By (12) and the definition of i 0 , for every 0 ≤ i < i 0 it holds that Hence, by combining (11) with the definition of i 0 , we have 6.5α + 6ε − 1 > α − 2 13 .
This is a contradiction with the assumption α < 2 13 and the choice of ε 0 .This finishes the proof of Theorem 23.

Polynomial-Time Implementation of Local Search
In this section we provide a polynomial-time implementation of the local search routine needed in Section 3.2.That is, given a Correlation Clustering instance G with a preclustering (K, E adm ) obtained via Theorem 5 (with the implicit ε-good clustering scheme C * (K,E adm ) ) and a weight function w, we are to find a γ-good local optimum for w.
The usage in Section 3.2 takes ε > 0 as a sufficiently small accuracy parameter and 0 < γ < ε 13 /4 as a second parameter.The weight function w is equal to 1 on non-edges and takes values w(e) ∈ [1, 2] for edges e ∈ E + .It will be a bit cleaner to consider a slightly more general variant where w(e) ∈ [1, W ] for a constant parameter W ≥ 1, but one can keep in mind that we actually only need the case W = 2.In our algorithm, the weight of a pair can be computed in constant time by checking whether they are neighbors and whether they are in the same cluster in some previous local search solution(s).
Lemma 26.Let w be a weight function that equals 1 on nonedges and takes values in [1, W ] on edges.Consider two clusterings C and C ′ and assume that for each cluster Proof.Equivalently to (13) we have The theorem follows from summing the above inequality for every cluster C i of C ′ and observing that Here, the penultimate inequality uses the properties of (K, E adm ) promised by Theorem 5 while the last inequality uses the assumption γ < ε 13 /4.
We will need the following weight-adjusted variants of previously introduced graph notation.
Definition 27.Let d w (v) = u|(u,v)∈E + w(u, v) be the weighted degree of v under w.Similarly, we define where S is a multiset of vertices.
Observation 28.The following statements follow directly from the properties of (K, E adm ).
We define the neighborhood function N (v) as follows, which is slightly different from N v , the admissible neighborhood.
We define Estimates of set sizes.We will need a few estimates of sizes of various neighborhood-like sets.
+ 1, we only need to upper bound |K| when v belongs to atom K.According to the definition 3, for sufficient small ǫ, at least half of vertices in All clusters in the clustering scheme maintained by our algorithm will be almost good clusters in the following sense.
Definition 31 (Almost good clusters).We say a cluster C is almost good if there exists a vertex r ∈ V (not necessarily in C) such that C ⊆ N (r).
Lemma 32.For any vertex v and any almost good cluster C containing v, |C| ≤ 12ε −4 d(v).
Proof.Since C is almost good, there exists r ∈ V such that C ⊆ N (r).According to lemma 30, Since every cluster considered by our local search algorithm will be an almost good cluster, we have the following corollary.
Corollary 33.Let C be a clustering maintained by the local search algorithm, then for any The last estimate we need is the following.
Costs and estimated costs.As discussed in the introduction, we try to find a cluster that improves our local search solution with the help of sampling.We will sample a number of vertices from the sought cluster and then, for every other vertex v, we try to guess whether v is in the sought cluster by looking at how many vertices from the sample are adjacent to v. To this end, we will need the following definitions.
Definition 35.Consider a clustering C of the graph and let v be a vertex, C(v) ∈ C be its cluster and K be an arbitrary set of vertices.Under weight w, we define CostStays w (K, v) to be the total weight of edges (u, v) that are violated in C + (K \ {v}) and CostMoves w (K, v) to be the total weight of edges (u, v) that are violated in C + (K ∪ {v}).By definition, Lemma 36.Let η 0 > 1.Consider a clustering C of the graph and let v be a vertex, C(v) be its cluster and K be an arbitrary set of vertices of size s.Furthermore, let S = {u 1 , . . ., u η 0 } be a sequence of uniform sample of K. Consider the clustering C + K We define the following random variables where [boolean expression] is the indicator function.
We have that The next two lemmata say that looking at the estimated costs indeed approximates the real cost well, even if we know the actual size of the cluster only approximately.
Lemma 37. Let η 0 = η 5 > 0. Consider the setting of Lemma 36, a vertex v in cluster C(v), an arbitrary cluster K of size s and a sequence S of random uniform sample of K of length η 0 .Then, with probability at least 1 − 4 exp − 1 2 η we have that the following two inequalities hold: be an random variable, we have X i ∈ − s η 0 , s η 0 W . Equation ( 14) holds if and only if According to Hoeffding's inequality, we have Pr So eq. ( 14) holds with probability at least 1 15) holds if and only if According to Hoeffding's inequality, we have So eq. ( 15) holds with probability at least 1 − 2 exp − 1 2 η .By union bound, both eq.( 15) and eq.( 14) hold with probability at least 1 − 4 exp − 1 2 η .
Lemma 38.Let η 0 = η 5 > 0. Consider the setting of Lemma 36, a vertex v in cluster C(v), an arbitrary cluster K of size s and a sequence S of random uniform sample of K of length η 0 .Let s ∈ (1 ± ǫ ′ )s.Then, with probability at least 1 − 4 exp − 1 2 η we have that the following two inequalities hold: Proof.
Corollary 39.In the setting of lemma 38, let ǫ ′ = 1 η 2 , with probability at least 1 − 4 exp − 1 2 η we have that the following two inequalities hold: In the remainder of this section, we fix ǫ ′ = 1 η 2 for simplicity.
The algorithm.The algorithm maintains a tentative solution C and iteratively tries to find an improving cluster.The algorithm is parameterized by an integer parameter η > 0, which we will fix later.In one step, the algorithm invokes the following GenerateCluster routine for any choice of r ∈ V (G), multisets (S i ) η i=1 of vertices in N (r) of size η 5 each, and integers any of the run finds a cluster S ′ such that Cost w (C + S ′ ) < Cost w (C), it picks S ′ that minimizes Cost w (C + S ′ ), replaces C := C + S ′ and restarts.If no such S ′ is found, the algorithm terminates and returns the current solution C as the final output. Polynomial-Time-Local-Search(η) • Compute an atomic pre-clustering (K, E adm ) using the Precluster algorithm.
Note that in section 3.2, we only use {1, 1.5, 2} in the weight, and therefore we only have polynomial many possible cost for a clustering scheme.Therefore, Polynomial-Time-Local-Search runs in polynomial iterations and takes polynomial time, as η, ǫ be some constants.In the next sections, we will remove this assumption and consider arbitrary bounded weight function.And the following lemma shows that for sufficiently large integer η, the above algorithm indeed finds a γ-good local optimum.Note that for the polynomial-time implementation, we would only need a weaker statement saying that there exists a choice of (S i ) η i=1 and (s i ) η i=1 for which the statement holds, as we iterate over all possible choices.However, the fact that a vast majority of choices leads generating an improving cluster S ′ is crucial for subsequent running time improvements, where we will only sample sets S i and integers si .
Consider a clustering C maintained by the local search algorithm and assume that there exists a vertex r, and a (ε, ε/2)-good cluster C with K(r) ⊆ C ⊆ N (r) such that |C| > 1 and Then there exists a collection of sets C * 1 , . . ., C * η of vertices with the following properties.First, Then with probability at least 1 − 2η exp(−η), the cluster S ′ output by GenerateCluster(C, r, K(r), S, s) satisfies Proof.Fix r as in the lemma statement and let That is, C * is the most improving cluster in N (r) containing K(r).Since C is (ε, ε/2)-good cluster containing r, and that each (ε, ε/2)-good cluster contains at most one atom, we have that C ⊆ N (r).Thus, we have that Proof.We first prove the first inequality. If The second inequality follows from the first and the condition Let s * = |C * |.Next, we will prove a lower bound on s * .To this end, we show a quite brute-force estimate that adding one new cluster to a clustering scheme cannot improve the cost by too much.
Claim 42.Let C be a clustering maintained by the local search algorithm, Proof.We split the cost improvement into two terms.The first term is the improvement on E + , which is at most The second term is the improvement on E − , which is at most the maximum amount of cost we paid before.Let v be an arbitrary vertex in C, C i be the cluster in C containing v, the minus cost we pay in C at v is at most Together with the improvement in the first part, the whole claim holds.
Observe that by Lemma 34, we have Together with Claim 42 for C * , it follows that Let v be a vertex in D(r), according to the optimality of C * , we have To analyze the algorithm, let S ′1 , . . ., S ′η be the set of elements in set S ′ at the beginning of the ith iteration of the for loop.We define Let S i be a uniform sample of the cluster C * i and s i = |C * i |.We want to show the following claim.
By the definition of C * i+1 , we have so it is sufficient to prove that We observe that where ∆ denote the symmetric difference of two sets.Fix a vertex v ∈ Q i ∆C * i and let C ∈ C be the cluster containing v. We have where Cost w (C, v) is the total cost paid by clustering scheme C for all edges or non-edges incident to v.According to Corollary 33, i can be one of the following three types.
In the following part, we will split the cost difference into vertices of the three types, and give a bound separately. Missing According to corollary 39, we have with probability at most 4 exp − 1 2 η .Therefore v / ∈ Q i with probability at most 4 exp − 1 2 η .Similarly as before, there are more than 4 exp (−η) η −1 |D(r)| core nodes with probability at most exp − 1 2 η .Non-core nodes.If v ∈ C * i is a non-core node, then In the inequalities, we use the fact that Summing up the total cost difference over all vertices of the three types, we have The last inequality holds since exp(η) > 400ηε −6 .This finishes the proof of the claim.
This finishes the proof of the lemma.According to pigeonhole principle, there exists a cluster According to lemma 40, in this case, with probability at least 1 − 2η exp(−η), in the last iteration, the cluster S ′ output by GenerateCluster(C, r, K(r), S, s) in the last iteration satisfies so the iteration will continue, which lead to a contradiction.

Faster Implementation of Local Search
In this section, we show a faster implementation of the local search algorithm, by using sampling to find new improving clusters instead of enumerating a lot.This is not the end, but both the algorithm and the analysis are important preparations for our following improvements.We define d adm (C) = u∈C d adm (u).We will consider the following local search algorithm, with parameters η, γ and s.
• Do many rounds of the following until C does not change in Θ(n log n) consecutive rounds: -Sample a vertex r ′ with probability 1 n and try to improve the clustering by the singleton {r ′ }.
-Sample a vertex r with probability 1 n•d(r) , or instantly finish this round with probability 1 − r∈V -Uniformly sample η subsets T 1 , . . ., T η of size s from N (r).
-For each collection of η (not necessarily disjoint) multisets S = S 1 , . . ., S η each of η 5 vertices in T i respectively, and each collection of η integers s = s 1 , . . ., s η ∈ {ǫd(r)(1 The number of rounds.We know that the cost of our clustering will be improved by at least γ|E adm |/n per Θ(n log n) rounds.On the other hand, the following lemma provides an upper bound of the initial cost.
Lemma 46.The cost of our initial solution which consists of atoms and singletons is at most Proof.First, the minus cost paid by our initial solution comes from atoms, so it is also paid by Opt.The plus cost we paid induced by edges between atoms, or incident to a singleton in Opt, is again also paid by Opt.So our extra cost is the number of edges incident to vertices which does not belong to any atom and is not singleton in Opt.For any such vertex v, since there is a good cluster containing it, d adm (v) + 1 ≥ ǫd(v).Hence the additional cost is at most So, taking e.g.γ = ǫ 20 , the algorithm stops in O ǫ (n 2 log n) rounds.
Time complexity.The number of rounds is O ǫ (n 2 log n).In each round, each vertex r is sampled with probability 1 n•d(r) .If r is sampled, in each time calling GenerateCluster, we need to go through its neighbors and for each neighbor u, it takes O(d(u)) time to estimate the costs of moving and staying, so it takes O ǫ (d(r) 2 ) time in total.There are only constant many S's and s's.So the total time complexity is O ǫ (n 2 log n r Correctness.In each round, we sample each vertex r with probability 1 n•d(r) .Suppose the optimal good clustering is Opt = {C 1 , . . ., C k }.For each cluster C i , if it contains an atom K such that 576 , we split the cluster into K and C i \ K. Let Opt ′ be the new clustering we get.By the following lemmas, it suffices to compare our solution with Opt ′ instead of Opt.
Proof.Consider any cluster C i ∈ Opt.If C i is not split, then it's (ǫ, ǫ)-good and hence (ǫ, ǫ 2 )-good.Otherwise it is split into an atom K and C i \ K, where K is (ǫ, ǫ 2 )-good since it's an atom, and Proof.
The lemma simply follows.
We summarize our conclusion in the following theorem.
Theorem 50.With high probability, the clustering C returned by our faster implementation is a 3 2 γ-good local optimum.
Further improvements.In the remaining sections, we show how to implement our local search algorithm in MPC, sublinear and streaming models.In each round, our algorithm can be viewed as two steps: • Randomly sample at most one pivot vertex; • Try to improve the current clustering by sampling a constant number of (possibly dependent) clusters containing the pivot.
In Section 6, we show how to sample multiple pivots in one round, reducing the number of rounds to a constant, while a similar result with Lemma 40 still holds.
In Sections 7, we show another way of sampling clusters for each pivot and estimating the costs, which work in the sublinear and streaming models.

MPC implementation of local search
In this section we will show how to implement the local search within constant number of rounds in the MPC model.Note that we are not aware of a way to implement the pivot step in the algorithm in section 3.2 efficiently under the MPC model, but we can implement the algorithm in section 3.1 and achieves a 2 − 1/8 + ǫ approximation as long as we can implement the local search efficiently.was selected.
Lemma 51.Let G be a graph with an ǫ-good preclustered instance (K, E adm ).Let each vertex be selected with probability ǫ 4 γ ′ /24d(r).Let X u for each u ∈ V (G) be the number of neighbours of u in D(u), that has a neighbour in E adm that was selected.Then for all u ∈ V (G) Proof.First, we calculate the expected value of X u : by a union bound and degree similarity of neighbours in E adm .Then the lemma follows from Markov's inequality.
The result of this lemma is that we are only going to stop due to having a too large amount of the neighbours selected with probability at most 1/2.
To argue that this algorithm works in MPC model, we first note that each vertex only needs to broadcast information to its neighbors, or aggregate the information from its neighbors.When each machine have memory O(m δ ), we can have O(m 1−δ + n) machines each taking care of O(m δ ) edges incident to the same vertex.For each vertex with degree larger than m δ , we can build a O( 1 δ )-level B-tree with fan-out O(m δ ) to connect all machines related to this vertex.Then the broadcasting and aggregating can be done in O( 1 δ ) rounds.Lemma 52.Let C be a clustering, let C be the optimal improvement for the clustering, let S ′ be the improving cluster computed by the near linear time local search with r as the starting vertex.Let S ′ mpc be the improving cluster computed by the MPC local search with r as the selected vertex.Let γ ′ ≤ 1 2304 ε 8 γ and η and γ as defined in lemma 40.Then with probability at least 1 4 − 2η exp(−η) Proof.It is clear that the setting of this algorithm is the same as in lemma 40, except that we have possibly lost a γ ′ fraction of the optimal cluster.We therefore compare our solution to the optimal one selected in lemma 40.We only continue if we have lost less than a γ ′ fraction of the neighbors.Using this we can determine how much additional cost this introduces.
using the fact that S ′ mpc is constructed from the same initial vertex r and so shares the same atoms as S ′ .Furthermore, S ′ mpc is contained in N (r) while S ′ \ S ′ mpc is contained in D(r).Applying lemma 34, we get that by the definition of γ ′ .With regards to the probability of this happening, observe that the event of getting enough samples, as described in lemma 45 and lemma 51 are independent and both with probability 1/2.By a union bound with the probability of success from lemma 40 we get the probability of success being 1 4 − 2η exp(−η).From this we see that each time we select a vertex to construct a cluster from, with constant probability we are going to get a constant fraction of the possible improvement for each cluster.Since we can hit a Ω ǫ (1) fraction (instead of Ω ǫ ( 1 n ) in previous sections) of the clusters in Opt ′ in one round, a single round will in expectation achieves a constant fraction of the improvement.This means that between the Θ(log n) parallel executions, at least one will with high probability achieve the improvement if it is possible.
Theorem 53.When ∆ ≤ γ|E adm |, the clustering C returned by the MPC implementation of local search is a 3 2 γ-good local optimum with high probability.Proof.Since the setting is essentially the same as in theorem 50, the proof is the same too.

Sublinear and Streaming Implementations of Local Search
Recall Lemmas 9, 10 and Theorem 11, which allow us to sample neighbors, (globally) sample vertices, query admissible pairs, and traverse admissible neighbors, in both of the sublinear and streaming models.
We start from our MPC implementation which only has O(log n) parallel runs in constant rounds, and in each run, we may create multiple disjoint new clusters from multiple selected "pivots".There are two operations we need to change to make the algorithm work in sublinear and streaming models: • When we estimate CostStays and CostMoves, we cannot enumerate all neighbors of v. Instead, we need to sample a constant-size set of its neighbors, and use one more Chernoff bound in Lemma 38; • For each cluster outputted by GenerateCluster, we need to estimate the improvement, and then choose the most (estimated) improving one.We need a concentration bound for the improvement of each round in Section 5.
Estimating CostStays and CostMoves.Here we rewrite the subroutine that estimates Cost-Stays.Let N G (v) = {u ∈ V | (u, v) ∈ E + } be the neighborhood of v in the original graph G.
EstCostStays ′ w (S = {u 1 , . . ., u η 0 }, s, v) • Draw η ′ i.i.d uniformly random samples x 1 , . . ., x η ′ from N G (v).In our algorithm, we only care about the difference between EstCostStays and EstCostMoves, so instead of computing them individually, we can implement the following function that compute the difference between the two, to avoid computing d w (v).
• return CostStays − CostMoves To compute w(x i , v), we only need to check whether x i and v are in the same cluster in previous local search solution(s), which we can store in O(n) space.Note that since x i is sampled from N G (v), we don't need to query whether it is a neighbor of v. Thus we can use Lemma 9 to sample x i 's.In each round, since each vertex only samples a constant number of times, we only need a constant number of realizations of Lemma 9.
It's a litter more tricky to sample u i 's.When computing d w (v, {u i }), we need to query whether v and u i are neighbors.Hence we need to have access to the neighborhood of u i .We do the sampling as follows.We first use Lemma 10 to sample some vertices.Let's call them special vertices.For the pivot vertex r where we want to sample u i 's from the admissible neighborhood N (r), we know that with high probability, Ω(log n) vertices in N (r) are special, since |N (r)| = Ω(d(r)) (when there exists the improving cluster) and vertices in N (r) are degree-similar to r.Notice that vertices in N (r) are sampled with similar but not exactly the same probabilities in Lemma 10, so we need to further discard each vertex with some (constant) probability to get independent samples.In the end, we sample u i 's from these remaining special vertices.By Lemma 10, we can entirely store the neighborhood of all special vertices, so we can answer each neighbor query in O(1) time.In each round, since each pivot only samples a constant number of times and different pivots sample from disjoint sets, we only need a constant number of realizations of Lemma 10.
Note that we do not need to store the η ′ samples in the previous algorithms, since we can work with them one by one.So we do not introduce extra space cost here.The running time of this estimation is now O ǫ (1).
Lemma 54.Consider the setting of lemma 38, let η ′ > W 2 η 5 4ǫ 4 , with probability 1 − 6 exp(− 1 2 η), the following two inequalities hold: EstCostMoves ′ w (S, s, v) ∈ ExpCostMoves w (s, v) ± If the lemma holds, we can replace EstCostMoves by EstCostMoves ′ in GenerateCluster, since they introduce the same asymptotic bound on the failure probability and error under the setting of lemma 38, the same proof still holds.• return I|S ′ ∆C(r)|/η ′ Here we also need to query the neighbors of u j .Since u j 's are sampled from S ′ ∆C(r) which also has a size Ω(d(r)), we can use the same way to sample them as we previously sampling u i 's.
We will prove the following lemma.Then, according to union bound, all the costs are estimated well within a small error if we set ζ = 2 log n.By changing the constant in the definition of ∆ (in Section 5), the same algorithm could also work.
is, the set of edges and nonedges of G. Let d(v) the degree of vertex v, and for any subset C of vertices let d(v, C) be the number of neighbors of v in C. Some of the graphs will be accompanied by a weight function w : V (G) 2 → R + .The unweighted setting corresponds to w ≡ 1.For any set D ⊆ V (G) 2 , w(D) denotes the total weight of pairs in D. When the graph is unweighted, w(D) = |D|.

Theorem 8 (
Preclustering in the MPC and sublinear time model -[CLM + 21] Theorem 1).For any constant δ > 0, Preclustering(G, ǫ) can be implemented in the MPC model in O(1) rounds.Letting n = |V |, this algorithm succeeds with probability at least 1 − 1/n and requires O(n δ ) memory per machine.Moreover, the algorithm uses a total memory of O(|E| log n).

3. 1
Warm-up: (2 − 1 8 )-approximation As a warm-up, to show our techniques, we present an analysis of a local optimum of a simple local search.Definition 12.A clustering scheme Ls is a local optimum if for every C ⊆ V (G) we have Cost(Ls + C) ≥ Cost(Ls).

Proof.
Since v ∈ N (r), either v and r are in the same cluster, or v is admissible to r.In the former case, N (r) = N (v), according to lemma 29, |N (r)| = |N (v)| ≤ 6ε −3 d(v).In the latter case, we have degree similarity between r and v.According to lemma 29, |N (r)| ≤ 6ε −3 d(r) ≤ 12ε −4 d(v).
r, S 1 , ..., S η , s 1 , ..., s η ).-Let S * , s * be the pair such that the cost of C(S * , s * ) is minimized.-Ifthecost of C(S * , s * ) is at least γ|E adm |/n less than the cost of C: * C ← C(S * , s * );The size of the sampled set S. For GenerateCluster to work, we need S i to be a uniform sample ofC * i .When |T i ∩ C * i | ≥η 5 for all i, we have at least one valid sample S. Since each T i is uniformly sampled from a superset of C * i , the samples we get are uniform.Lemma 45.If s > 10 6 η 6 ǫ −27 , then |S ∩ C * i | ≥ η 5 for all i with probability at least 1 2 .Proof.By Lemma 40, |C * i | ≥ ǫ 27 451584 |N (r)|.It follows by a Chernoff bound for each i and then a union bound over i.
Consider any cluster C i ∈ Opt.If we split C i into K and C i \ K, we can only increase the cost by at most|K| • |C i \ K| ≤ ǫ 21 576 |C i | • |C i \ K| ≤ ǫ 21 576 |N (K)| • |D(K)| since C i ⊆ N (K).Then by Lemma 34, this is at most ǫ 13 4 d adm (C i ).Summing over all clusters, the lemma follows from Theorem 5.Define ∆ i = max{Cost(C) − Cost(C + C ′ i ) − γd adm (C ′ i ), 0} for 0 < γ < ǫ 13 4 , and ∆ = i ∆ i .If we hit C ′ i , by Lemma 40, we can improve the cost by ∆ i with constant probability.So, if some ∆ i 's are at least γ|E adm |/n, then in the following Θ ǫ (n log n) rounds, with high probability we can observe one of them and improve C. Hence with high probability we won't stop with some ∆ i ≥ γ|E adm |/n at any round t.By an union bound over (a polynomial number of) rounds, we know that with high probability, we will get a clustering with ∆ ≤ γ|E adm |.Lemma 49.When ∆ ≤ γ|E adm |, the clustering we have is a 3 2 γ-good local optimum (with respect to Opt ′ ).Proof.By the definition of ∆, i i ∈ Opt ′ of size larger than 1.If it contains an atom K, we will hit K with probability at least ǫ|K|n|C ′ i | ≥ ǫ 22576n ; otherwise we will hit some vertex in C ′ i with probability at least ǫ n .For |C ′ i | = 1, we will hit it with probability 1 n .