Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments.


Introduction
The modern data analyst has no shortage of clustering algorithms to choose from but, given the everincreasing size of relevant datasets, many are often too slow to be practically useful.This is particularly relevant for big-data pipelines, where clustering algorithms are commonly used for compression.The goal is to replace a very large dataset by a smaller, more manageable one for downstream tasks, with the hope it represents the original input well.Lloyd's algorithm [49] was introduced for precisely this reason and minimizes the quantization error -the sum of square distance from each input point to its representative in the compressed dataset.Arguably the most popular clustering algorithm, Lloyd's runs for multiple iterations until convergence with every iteration requiring O(ndk) time, where n is the number of points, d is the number of features and k is the number of clusters -or the size of the targeted compression.For such applications, the number of points can easily be hundreds of millions and, since the quality of compression increases with k, standard objectives can have k in the thousands [41,4].In such settings, any O(ndk) algorithm is prohibitively slow.
Examples like these have prompted the rise of big data algorithms that provide both theoretical and practical runtime improvements.The perspectives of theoretical soundness and practical efficacy are, however, often at odds with one another.On the one hand, theoretical guarantees provide assurance that the algorithm will work regardless of whatever unlucky inputs it receives.On the other hand, it may be difficult to convince oneself to implement the theoretically optimal algorithm when there are cruder methods that are faster to get running and perform well in practice.
Since datasets can be large in the number of points n and/or the number of features d, big-data methods must mitigate the effects of both.With respect to the feature space, the question is effectively closed as random projections are fast (running in effectively linear time), practical to implement [50], and provide tight guarantees on the embedding's size and quality.The outlook is less clear when reducing the number of points n, and there are two separate paradigms that each provide distinct advantages.On the one hand, we have uniform sampling, which runs in sublinear time but may miss important subsets of the data and therefore can only guarantee accuracy under certain strong assumptions on the data [45].On the other hand, the most accurate sampling strategies provide the strong coreset guarantee, wherein the cost of any solution on the compressed data is within an ε-factor of that solution's cost on the original dataset [25].
Our contributions We study both paradigms (uniform sampling and strong coresets) with respect to a classic problem -compression for the k-means and k-median objectives.Whereas uniform sampling provides optimal speed but no worst-case accuracy guarantee, all available coreset constructions have a running time of at least Ω(nd + nk) when yielding tight bounds on the minimum number of samples required for accurate compression.
It is easy to show that any algorithm that achieves a compression guarantee must read the entire dataset 1 .Thus a clear open question is what guarantees are achievable in linear or nearly-linear time.Indeed, currently available fast sampling algorithms for clustering [6,5] cannot achieve strong coreset guarantees.Recently, [31] proposed a method for strong coresets that runs in time Õ(nd + nk) and conjectured this to be optimal for k-median and k-means.
While this bound is effectively optimal for small values of k, there are many applications such as computer vision [34] or algorithmic fairness [18] where the number of clusters can be larger than the number of features by several orders of magnitude.In such settings, the question of time-optimal coresets remains open.Since the issue of determining a coreset of optimal size has recently been closed [25,28,44], this is arguably the main open problem in coreset research for center-based clustering.We resolve this by showing that there exists an easy-to-implement algorithm that constructs coresets in Õ(nd) time -only logarithmic factors away from the time it takes to read in the dataset.
Nonetheless, this does not fully illuminate the landscape among sampling algorithms for clustering in practice.Although our algorithm achieves both an optimal runtime and an optimal compression, it is certainly possible that other, cruder methods may be just as viable for all practical purposes.We state this formally in the following question: When are optimal k-means and k-median coresets necessary and what is the practical tradeoff between coreset speed and accuracy?
To answer this, we perform a thorough comparison across the full span of sampling algorithms that are faster than our proposed method.Through this we verify that, while these faster methods are sufficiently accurate on many real-world datasets, there exist data distributions that cause catastrophic failure for each of them.Indeed, these cases can only be avoided with a strong-coreset method.Thus, while many practical settings do not require the full coreset guarantee, one cannot cut corners if one wants to be confident in their compression.We verify that this extends to the streaming paradigm and applies to downstream clustering approaches.
In summary, our contributions are as follows: • We show that one can obtain strong coresets for k-means and k-median in Õ(nd) time.This resolves a conjecture on the necessary runtime for k-means coresets [31] and is theoretically optimal up to log-factors.
• Through a comprehensive analysis across datasets, tasks, and streaming/non-streaming paradigms, we verify that there is a necessary tradeoff between speed and accuracy among the linear-and sublinear-time sampling methods.This gives the clustering practitioner a blueprint on when to use each compression algorithm for effective clustering results in the fastest possible time.
2 Preliminaries and Related Work

On Sampling Strategies.
As discussed, we focus our study on linear-and sublinear-time sampling strategies.Generally speaking, we consider compression algorithms through the lens of three requirements: • Finding the compression should not take much longer than the time to read in the dataset.
• The size of the compressed data should not depend on the size of the original dataset.
• Candidate solutions on the compressed data are provably good on the original data.
If these requirements are satisfied then, when analyzing runtimes on large datasets, it is always preferable to compress the dataset and then perform the task in question on the compressed representation.Specifically, given a dataset P ∈ R n×d , we concern ourselves with sampling Ω ∈ R m×d ⊂ P (such that m ≪ n) along with a weight vector w ∈ R m .The goal is then that for any candidate solution C, Ω provides us with an idea of the solution's quality with respect to the original dataset, i.e. p∈Ω w p cost(p, C) ≈ p∈P cost(p, C) for a problem-specific cost function.The quickest sampling strategy, running in sublinear time, is uniform sampling.It is clear, however, that this cannot provide any cost-preservation guarantee as missing a single extreme outlier will cause the sampling strategy to fail.Thus, any approach that outperforms uniform sampling must read in the entire dataset and therefore run in at least linear time2 .Among these more sophisticated sampling strategies, coresets offer the strongest compression guarantee: Going forward, we will discuss this in the context of the k-median and k-means cost functions: for dataset P ∈ R d with weights w : P → R + , and any k-tuple with z = 1 for k-median and z = 2 for k-means.We use OPT to denote min C cost z (P, C) and will denote an α-approximation as any candidate solution C such that cost(C) ≤ α • OPT.
Recently, sampling with respect to sensitivity values has grown to prominence due to its simple-toobtain coreset guarantee.True sensitivity values are defined as sup C dist z (p,C) costz(P,C) , where the supremum is taken over all possible solutions C. Intuitively, this is a measure of the maximum impact a point can have on a solution and is difficult to evaluate directly.Thus, the approximate sensitivity-sampling algorithm we consider is the following (as introduced in [37]).Given an α-approximate solution C to a clustering problem, importance scores are defined as where C p is the cluster that p belongs to.This is always an upper-bound on the sensitivity values [37].In essence, sampling enough points proportionately to these values guarantees an accurate compression.The following paragraph discusses how the points must be re-weighted to guarantee the coreset property: The coreset Ω consists of m points sampled proportionate to σ with weights defined as follows: for any sampled point p, define w p := p) .The weights ensure that the cost estimator is unbiaised: in expectation, for any solution C, the cost evaluated on the sample should be equal to the original cost.It was shown in [42] that, when C is a O(1)-approximation, sampling m = Õ kε −2z−2 many points was enough to ensure concentration, namely, Ω is a coreset with probability at least 2/3the conventional success probability for Monte-Carlo algorithms.
To perform this algorithm, the bottleneck in the running time lies in computing the solution C as well as then obtaining costs of every point to its assigned center in C.This takes Õ(nk + nd) time when using a bicriteria approximation algorithm 3 such as the standard k-means++ algorithm [2] combined with dimension reduction techniques (see for example [9,19,50]).This is precisely what was conjectured as the necessary runtime for obtaining k-means and k-median coresets, as merely assigning points to their centers from the bicriteria seems to require Ω(nk) running time [31].

Other Coreset Strategies
Many of the advancements regarding coresets have sought the smallest coreset possible across metric spaces and downstream objectives.42,27], with much of the focus on obtaining the optimal size in the k-means and k-median setting.Recently, a lower bound [44] showed that the group sampling algorithm developed in [25,27,28] is optimal.
Although optimal coresets have size Õ(k • ε −2 min(k z/(z+2) , ε −z )) [28] and are theoretically smaller than those obtained by sensitivity sampling, the experiments of [57] showed that the latter is always more efficient in practice.
In terms of other linear-time methods with sensitivity sampling, we are only aware of the lightweight coresets approach [6], wherein one samples with respect to the candidate solution C = {µ}, i.e. the mean of the data set, instead of a candidate solution with k centers.This runs in O(nd) time but provides a weaker guarantee -one incurs an additive error of ε • cost(P, {µ}).We note that this can be generalized to performing sensitivity sampling using a C that has fewer than k centers.We discuss this in more depth in Section 5.2.
Lastly, the BICO coreset algorithm [38] utilizes the SIGMOD test-of-time winning BIRCH [58] clustering method to obtain k-means coresets while the Streamkm++ method [1] uses k-means++ [2] to obtain a coreset whose size depends on n and is exponential in d.While both were developed to perform quickly in a data-stream, we show in Section 5 that they do not provide strong coreset guarantees in practice for reasonable coreset sizes.
All efficient coreset constructions are probabilistic, making coresets difficult to evaluate.For example, it is co-NP-hard to check whether a candidate compression is a weak coreset [57] 4 .Therefore, although coreset algorithms succeed with some high probability, it is unclear how to computationally verify this.We refer to [57] for further discussion on this topic and discuss our evaluation metrics in Section 5.

Coresets for Database Applications
MapReduce5 is one of the most popular architectures for large scale data analysis (see for example [17,29,33] and references therein for MapReduce algorithms for clustering).Within this context, strong coresets are 'embarrassingly parallel' and have a natural synergy with the MapReduce framework due to the following two properties.First, if two coresets are computed on subsets of data then their union is a coreset for the union of the data subsets.Second, many coreset constructions produce a compression with size completely independent of n, allowing the coreset to be stored in memory-constrained environments.
Using these two properties, one can compress the data in a single round of MapReduce as follows [36].The data is partitioned randomly among the m entities, each of which computes a coreset and sends it to a central host.By the first property, the union of these coresets is a coreset for the full dataset.Thus, the host now possesses a small summary of the full dataset with strong coreset guarantees.By the second property, this summary's size does not depend on the size of the original dataset -in particular, the total size of the messages received by the host is independent of n.Lastly, by the coreset guarantee, any solution that is good w.r.t. the summary is good w.r.t. the original data (and vice versa).
This idea allows one to compute an approximate solution to k-means and k-median, and do so efficiently, in MapReduce: each computation entity needs to communicate merely O(k) bits, where k is the number of clusters.The computational burden can therefore be completely parallelized up to the time required to compute a coreset in each entity -precisely the focus of this paper.We explore similar aggregation strategies experimentally in Section 5.4.

Quadtree Embeddings
A common techniques for designing fast algorithms is to embed the Euclidean space into a tree metric using quadtree embeddings.The central idea is that any hypercube in the input space can be split into 2 d sub-cubes.We can represent this in a tree-structure, where each sub-cube has the original hypercube as its parent.Centering randomly the original hypercube, and appropriately setting the weight of each branch then preserves the expected distance between points in different sub-cubes within an O( √ d log ∆) factor.Here, ∆ is the spread of the point set and is equal to the ratio of the largest distance over the smallest non-zero distance.Given this context, we now introduce quadtree embeddings more formally: Formal Overview Given a set of points in R n×d , we want to return a tree-structure that roughly preserves their pairwise distances.To do this, our first step is to obtain a box enclosing all input points, centered at zero, with all side lengths equal to ∆.This can be done as follows: select an arbitrary input point, and translate the dataset so that this point is at the origin.Then, using O(nd) time, set ∆ to be the maximum distance from any point to the origin.Note that, up to rescaling the points so that the smallest distance equals 1, this is equivalent to the spread as described in the previous paragraph.
Having obtained this box, add a shift s (picked uniformly at random in [0, ∆]) to all the points' coordinates so that the input is now in the box [−2∆, 2∆] d .This transformation does not change any distances and therefore preserves the k-median and k-means costs.The i-th level of the tree (for i ∈ {0, ..., log ∆}) is constructed by centering a grid of side length 2 −i • 2∆ at 0, making each grid-cell a node in the tree.The parent of a cell c is simply the cell at level i − 1 that contains c, and the distance between c and its parent is set to 2 −i • 2∆ • √ d (the length of the hypercube's diagonal).This embedding takes O(nd log ∆) time to construct, where log ∆ is the depth of the tree6 .The linearity in the log ∆ term comes from the fact that this is the maximum depth of the tree.
The distortion of this embedding is at most O(d log ∆), as stated in the following lemma: Lemma 2.2 (Lemma 11.9 in [39]).The distances in the tree metric where the expectation is taken over the random shift s of the decomposition.
A simple proof (and further intuition on quadtree embeddings) can be found in [39].The result follows from combining linearity of expectation and the fact that two points p and q are separated at level i with probability at most √ d∥p − q∥ 2 i ∆ (as in the proof of Lemma 4.3).

Fast-Coresets
Technical Preview.We start by giving an overview of the arguments in this section.
There exists a strong relationship between computing coresets and approximate solutions -one can quickly find a coreset given a reasonable solution and vice-versa.Thus, the general blueprint is as follows: we very quickly find a rough solution which, in turn, facilitates finding a coreset that approximates all solutions.Importantly, the coreset size depends linearly on the quality of the rough solution.Put simply, the coreset can compensate for a bad initial solution by oversampling.Our Algorithm 1 Fast-Coreset(P, k, ε, m) 1: Input: data P , number of clusters k, precision ε and target size m 2: Use a Johnson-Lindenstrauss embedding to embed P of P into d = O(log k) dimensions 3: Find approx.solution C = {c 1 , ..., ck } on P and assignment σ : P → C by Fast-kmeans++ .
Compute a set Ω of m points randomly sampled from P proportionate to s.
primary focus is therefore on finding a sufficiently good coarse solution quickly since, once this has been done, sampling the coreset requires linear-time in n.Our theoretical contribution shows how one can find this solution on Euclidean data using dimension reduction and quadtree embeddings.
Formal Results.In this section, we first combine two existing results to produce a strong coreset in time Õ(nd log ∆), where ∆ is the spread of the input.We show afterwards how to reduce the dependency in ∆ to log log ∆, giving the desired nearly-linear runtime.
Our method is based on the following observations about the group sampling [25] and sensitivity sampling [37] coreset construction algorithms.Both start by computing a solution C. When C is a c-approximation, they compute a coreset with distortion 1 ± cε of size Õ kε −z−2 and Õ kε −2z−2 , respectively.Hence, by rescaling ε, they provide an ε-coreset with size Õ k(ε/c) −z−2 and Õ k(ε/c) −2z−2 .This leads to the following fact: To turn Fact 3.1 into an algorithm, we use the quadtree-based Fast-kmeans++ approximation algorithm from [23], which has two key properties: 1 Fast-kmeans++ runs in Õ (nd log ∆) time (Corollary 4.3 in [23]), and 2 Fast-kmeans++ computes an assignment from input points to centers that is an O (d z log k) approximation to k-median (z = 1) and k-means (z = 2) (see Lemma 3.1 in [23] for z = 2 and the discussion above that lemma for z = 1).Applying dimension reduction techniques [50], the dimension d may be replaced by a log k in time Õ(nd).This results in a O log z+1 k approximation.
The second property is crucial for us: the algorithm does not only compute centers, but also assignments in Õ(nd log ∆) time.Thus, it satisfies the requirements of Fact 3.1 as a sufficiently good initial solution.We describe how to combine Fast-kmeans++ with sensitivity sampling in Algorithm 1 and prove in Section 8.1 that this computes an ε-coreset in time Õ(nd log ∆): Corollary 3.2.Algorithm 1 runs in time O (nd log ∆) and computes an ε-coreset for k-means.
Furthermore, we generalize Algorithm 1 to other fast k-median approaches in Section 8.4.Thus, one can combine existing results to obtain an ε-coreset without an O(nk) time-dependency.However, this has only replaced the O(nd + nk) runtime by O(nd log ∆).Indeed, the spirit of the issue remains -this is not near-linear in the input size.
We verify that log ∆ is on the same order as n by devising a dataset that has n−n ′ points uniformly in the [−1, 1] 2 square.Then, for r ∈ Z + , we produce a sequence of points at (0, 1), (0, 0.5), • • • , (0, 0.5 r ) and copy this sequence n ′ /r times, each time with a different x coordinate.The result is a dataset of size n where log ∆ grows linearly with r ∈ o(n).The resulting linear time-dependency can be seen in Table 1. 4 Reducing the Impact of the Spread

Value of
We now show how one can replace the linear time-dependency on log ∆ with a logarithmic one (i.e., log ∆ → log(log ∆)).
Overview of the Approach Without loss of generality, we assume that the smallest pairwise distance is at least 1, and ∆ is a (known) upper-bound on the diameter of the input.To remove the log ∆ dependency, we proceed by producing a substitute dataset P ′ that has a lower-bound on its minimum distance and an upper-bound on its maximum distance.We then show that, with overwhelming likelihood, reasonable solutions on P ′ have the same cost as solutions on P up to an additive error.
In order to produce the substitute dataset P ′ , we first find a crude upper-bound on the cost of the optimal solution to the clustering task on P .We then create a grid such that, for every cluster in the optimal solution, the cluster is entirely contained in a grid cell with overwhelming likelihoodin some sense, points in different grid cell do not interact.We can then freely move these grid cells around without affecting the cost of the solution, as long as they stay far enough away so that points in different cells still do not interact.Thus, we can constrain the diameter of P ′ to be small with respect to the quality of the approximate solution.We show later how to constrain the minimum distance to be comparable to the quality of this approximate solution as well.
We will focus this section on the simpler k-median problem but show how to reduce k-means to this case in Section 8.2.

Computing a crude upper-bound
As described, we start by computing an approximate solution U such that OPT ≤ U ≤ poly(n) • OPT.For this, the first step is to embed the input into a quadtree.This embedding has two key properties.First, distances are preserved up to a multiplicative factor O(d log ∆), and therefore the k-median cost is preserved up to this factor as well.Second, the metric is a hierarchically separated tree: it can be represented with a tree, where points of P are the leafs.The distance between two points is then given by the depth of their lowest common ancestor -if it is at depth ℓ, their distance is √ d∆2 −ℓ .Our first lemma shows that finding the first level of the tree for which the input lies in k + 1 disjoint subtrees provides us with the desired approximation.Lemma 4.1.[Proof in Section 8.3] Let ℓ be the first level of the tree with at least k + 1 non-empty subtrees.Then, , where OPT T is the optimal k-median solution in the tree metric.
We prove this in Section 8.3.A direct consequence is that the first level of the tree for which at least k + 1 cells are non empty allows us to compute an O(n)-approximation for k-median on the tree metric.Since the tree metric approximates the oringial Euclidean metric up to O(d log ∆), this is therefore an O(nd log ∆)-approximation to k-median in the Euclidean space.
To turn this observation into an algorithm, one needs to count the number of non-empty cells at a given level ℓ: for each point, we identify the cell that contains it using modulo operations.Furthermore, we count the number of distinct non-empty cells using a standard dictionary data structure.This is done in time Õ(nd), and pseudo-code is given Algorithm 2. Using a binary search on the O(log ∆) many levels then gives the following result: Lemma 4.2.[Proof in Section 8.3]There is an algorithm running in time Õ(nd log log ∆) that computes an O(nd log ∆)-approximation to k-median, and O(n 3 d 2 log 2 ∆)-approximation to k-means.
Given this crude approximate solution, it remains to create a substitute dataset P ′ that satisfies two requirements: 1 First, P ′ must have spread linear in the quality of the approximate solution.If this holds, Algorithm 1 on P ′ will take Õ(nd log log ∆) time.
2 Second, any reasonable solution on P ′ should be roughly equivalent to a corresponding solution on P .This would mean that running Algorithm 1 on P ′ gives us a valid coreset for P .

From Approximate Solution to Reduced Spread
Let U be an upper-bound on the optimal cost, computed via Lemma 4.2.We place a grid with side length r := √ dn 2 • U , centered at a random point in {0, ..., r} d .The following lemma ensures that with high probability (over the randomness of the center of the grid), no cluster of the optimal solution is spread on several grid cells.
Lemma 4.3.The probability over the center's location that two points p and q are in different grid cells is at most ∥p−q∥ Proof.We first bound the probability that there is a grid line along the i-th dimension between p and q.Let p i , q i be the i-th coordinate of p and q, assume p i ≤ q i and let ℓ ∈ Z such that p i − ℓr ∈ [0, r).Then, p and q are separated along dimension i if and only if the i-th coordinate of the center is in [p i − ℓr, q i − ℓr].This happens with probability |p i − q i |/r.Finally, a union-bound over all coordinates shows that p and q are in different grid cells with probability at most Since U is larger than the distance between any input point and its center in the optimal solution, a union-bound ensures that with probability 1 − 1/n, no cluster of this solution is split among different cells.In particular, there are at most k-non empty cells which we can identify using a dictionary.We call these "boxes".Each box j has a middle point, which we call m j .
From this input, we build a new set of points P ′ as follows.For each dimension i ∈ {1, ..., d}, sort the k boxes according to their value on dimension i.Then, for each j ∈ {1, ..., k}, let m i j be the i-th coordinate of the middle-point of the j-th box.If m i j+1 − m i j ≥ 2r, then for all boxes j ′ with j ′ > j, shift the points of j ′ by m i j+1 − m i j − 2r in the i-th dimension.This can be implemented in near-linear time, as described in Algorithm 3 (presented in Section 8).In essence, we take boxes that are far apart and bring them closer together.The dataset P ′ obtained after these transformations has the following properties: Proposition 4.4.Let P ′ be the dataset produced by Algorithm 3. It holds that: 1 in P ′ , the diameter of the input is O(dn 2 • U • k) with probability 1 − 1/n over the randomness of the grid, and 2 two boxes that are adjacent (respectively non-adjacent) in P are still adjacent (resp.non-adjacent) in P ′ .
Proof.By construction, the max distance between the middle-points of any two boxes in P ′ is 2r = 2 √ dn 2 • U .Lemma 4.3 ensures that, with probability 1 − 1/n 2 , any two points in the same cluster of the optimal solution (e.g., at distance less than OPT ≤ U from each other) are in the same box.Therefore, a union-bound ensures that there are at most k boxes.It then follows that the total distance along a coordinate is at most 2kr, and the diameter of the whole point set is √ d • 2kr.If two boxes were adjacent in P , then along any dimension their middle-points have distance at most r.Thus, if one of the adjacent boxes moves along any dimension, the other must as well and they will remain adjacent.If they are not adjacent, there is at least one dimension where their middle-points are at distance at least 2r from one another.Along each such dimension, their shift cannot bring them closer than to within 2r of one another and they will stay non-adjacent.
The first property allows us to reduce the spread to (nd log ∆) O (1) .Indeed, one can round the coordinates of every point in P ′ to the closest multiple of g := U n 4 d 2 log ∆ .Combined with the diameter reduction, this ensures that the spread of the dataset obtained is at most (nd log ∆) O (1) .Furthermore, the second property of Proposition 4.4 combined with the choice of g ensures that the cost of any reasonable solution is the same before and after the transformation, as stated in the following lemma: Lemma 4.5.Let P ′ be the outcome of the diameter reduction and rounding steps.With probability 1 − 1/n over the randomness of the grid, P ′ has spread (nd log ∆) O (1) .
Suppose U is such that OPT ≤ U ≤ dn 2 OPT, and let S ′ be a solution for k-median (resp.k-means) on P ′ with cost at most dn 2 OPT (resp.d 2 n 4 OPT for k-means).Then, one can compute a k-median solution for P , with the same cost as S ′ for P ′ up to an additive error OPT/n, in time O(nd).This also works if we replace P ′ and P .
Proof.First, in rounding points to the closest multiple of g, the distance between any point and its rounding is at most g ≤ Summing over all points, any solution computed on the gridded data has cost within an additive factor ± OPT n of the true cost.Let S be the solution obtained from S ′ by reversing the construction of P ′ , namely re-adding the shift that was substracted to every box.Since adjacency is preserved by Proposition 4.4, all points that are in the same cluster have the same shift, and therefore all intra-cluster distances are the same in P and P ′ .Therefore, the costs are equal and, by extension, cost (S) ∈ cost (S ′ ) ± OPT/n, where the additive OPT/n comes from the rounding.
Finally, the smallest non-zero distance is g = U n 4 d 2 log ∆ , and with probability Proposition 4.4), implying that the spread of P ′ is (nd log ∆) O (1) .
Combining the algorithm of Lemma 4.2, which gives a bound on U , with Lemma 4.5 brings us to the following theorem: Theorem 4.6.Given P ⊂ R d with spread ∆, there is an algorithm running in time O(nd log log ∆) that computes a set P ′ such that (1) with probability 1 − 1/n, the spread of P ′ is poly(n, d, log(∆)) and ( 2) any solution with cost at most dn 2 OPT for k-median (resp.d 2 n 4 OPT for k-means) on P ′ can be converted in time O(nd) into a solution with same cost on P , up to an additive error O(OPT/n).
To summarize, we have shown that one can, in Õ(nd) time, find a modified dataset P ′ that preserves the cost of corresponding k-means and k-median solutions from P .Importantly, this P ′ has spread that is logarithmic in the spread of P .As a result, one can apply Algorithm 1 onto P ′ in Õ(nd) time without compromising the compression quality with respect to P .Lastly, this compression on P ′ can be re-formatted onto P in Õ(nd) time.

Fast Compression in Practice
Despite the near-linear time algorithm described in Sections 3 and 4, the coreset construction of Algorithm 1 nonetheless requires a bounded approximation to the clustering task before the sampling can occur.Although theoretically justified, it is unclear how necessary this is in practice -would a method that cuts corners still obtain good practical compression?Metrics To answer this question, we analyze the sampling methods along two metrics -compression accuracy and construction time.Although measuring runtime is standard, it is unclear how to confirm that a subset of points satisfies the coreset property over all solutions.To this end, we use the distortion measure introduced in [57 , where C Ω is a candidate solution computed over the coreset Ω.This will be within 1 + ε if the coreset guarantee is satisfied but may be unbounded otherwise.We refer to this as the coreset distortion.

Goal and Scope of the Empirical Analysis
To motivate the upcoming experiments, we begin by asking "how do other sampling strategies compare to standard sensitivity sampling?"For this preliminary experiment, we focus on the uniform sampling and Fast-Coreset algorithm (Algorithm 1).For each, we evaluate its distortion across the following real datasets: Adult [32], MNIST [48], Taxi [52], Star [55], Song [12], Census [53], and Cover Type [13].
The dataset characteristics are summarized in Table 3.
The resulting comparisons can be found in Table 2. Since sensitivity sampling is the recommended coreset method [57], we use it to obtain a baseline distortion for each dataset 8 .We then compare uniform sampling and Fast-Coresets against this baseline by showing the ratio of their distortion divided by the distortion obtained by sensitivity sampling.As guaranteed by Section 4, we see that Fast-Coresets obtain consistent distortions with standard sensitivity sampling.However, the question is more subtle for uniform sampling -it performs well on most real-world datasets but catastrophically fails on a few (for example, it is ∼ 600× worse on the taxi dataset when compared with standard sensitivity sampling).
This confirms that uniform sampling is not unequivocally reliable as an aggregation methodalthough it is fast, it has the potential to miss important data points.On the other end of the runtime vs. quality spectrum, Fast-Coresets consistently provide an accurate compression but, despite being the fastest coreset method, are still significantly slower than uniform sampling.Thus, the fundamental question is: when is one safe to use fast, inaccurate sampling methods and when is the full coreset guarantee necessary?
We focus the remainder of the experimental section on this question.Specifically, we define a suite of compression methods that interpolate between uniform sampling and Fast-Coresets and evaluate these methods across a set of synthetic datasets.This allows us to experimentally verify the full spectrum of speed vs. accuracy tradeoffs and provide insight into which dataset characteristics are necessary before one can apply suboptimal sampling methods.
We complement this with a comprehensive study of related topics, such as additional analysis on real datasets, comparing against other state-of-the-art coreset methods, and verifying that these results extend to the streaming setting.

Experimental Setup
Algorithms We compare Fast-Coresets (Algorithm 1) against 4 different benchmark sampling strategies that span the space between optimal time and optimal accuracy, as well as state-of-the-art competitors BICO [38] and Streamkm++ [1].
-Standard uniform sampling.Each point is sampled with equal probability and weights are set to n/m, where m is the size of the sample.
-Welterweight coresets.For any j ∈ {1, ..., k}, we compute a coreset using sensitivity sampling with respect to a candidate j-means solution.
We do not compare against group sampling [25] as it uses sensitivity sampling as a subroutine and is merely a preprocessing algorithm designed to facilitate the theoretical analysis.By the authors' own admission, it should not be implemented.We use j going forward to describe the number of centers in the candidate j-means solution.Thus, lightweight coresets have j = 1 while Fast-Coresets have j = k.We take a moment here to motivate the welterweight coreset algorithm.Consider that lightweight coresets use the 1-means solution to obtain the sensitivities that dictate the sampling distribution whereas sensitivity sampling uses the full k-means solution.In effect, as we change the value of j, the cluster sizes |C p | in our approximate solution change.Referring back to equation 1, one can see that setting j between 1 and k acts as an interpolation between uniform and sensitivity sampling.We default to j = log k unless stated otherwise.We use the term 'accelerated sampling methods' when referring to uniform, lightweight and welterweight coresets as a group.
Datasets.We complement our suite of real datasets with the following artificial datasets.We default to n = 50 000 and d = 50 unless stated otherwise.
-c-outlier .Place n − c points in a single location and c points a large distance away.
-Gaussian mixture.A set of scattered Gaussian clusters of varying density.
These clusters are sequentially defined, with the size of the first cluster defined by n κ exp (γ • ρ 0 ), where κ is the number of Gaussian clusters, ρ 0 is uniformly chosen from [−0.5, 0.5], and γ is a hyperparameter that affects the distribution of cluster sizes.Then, given clusters {c 1 , • • • , c i }, we obtain the size of the (i + 1)-st cluster by This has the property that all clusters have size n/k when γ = 0 and, as γ grows, the cluster sizes diverge at an exponential rate.We note that this is a well-clusterable instance with respect to cost stability conditions, see -Benchmark .A specific distribution of points introduced in [57] as a testbed for coreset algorithms.It has the property that all reasonable k-means solutions are of equal quality but are maximally far apart in the solution space.Thus, the dataset is fully determined by the number of centers k.As suggested, we produce three benchmark datasets of varying size before applying random offsets to each.We choose the sizes by Note, the benchmark dataset being difficult for sensitivity sampling does not imply that it should be equally difficult for other sampling methods.
The artificial datasets are constructed to emphasize strengths and weaknesses of the various sampling schemas.For example, the c-outlier problem contains very little information and, as such, should be simple for any sampling strategy that builds a reasonable representation of its input.The geometric dataset then increases the difficulty by having more regions of interest that must be sampled.The Gaussian mixture dataset is harder still, as it incorporates uneven inter-cluster distances and inconsistent cluster sizes.Lastly, the benchmark dataset is devised to be a worst-case example for sensitivity sampling.
Data Parameters In all real and artificial datasets, we add random uniform noise η with 0 ≤ η i ≤ 0.001 in each dimension in order to make all points unique.Unless specifically varying these parameters, we default all algorithms in 5.2 to k = 100 for the Adult, MNIST, Star, and artificial datasets and k = 500 for the Song, Cover Type, Taxi, and Census datasets.Our default coreset size is then m = 40k.We refer to the coreset size scalar (the "40" in the previous sentence) as the m-scalar.We only run the dimension-reduction step on the MNIST dataset, as the remaining datasets already have sufficiently low dimensionality.We run our experiments on an Intel Core i9 10940X 3.3GHz 14-Core processor.

Evaluating Sampling Strategies
Theoretically guaranteed methods.We first round out the comparison between the Fast-Coreset algorithm and standard sensitivity sampling.Specifically, the last columns of Tables 4 and 5 show that the Fast-Coreset method produces compressions of consistently low distortion and that this

Fast Coreset
Figure 2: Top: The effect of the m-scalar on coreset distortion for real-world datasets.This is a visualization of the data in Table 4. Bottom: The effect of the m-scalar on the algorithm runtime for real-world datasets.All values are the mean over 5 runs.The three bars represent samples of size m = 40k, 80k.
holds across datasets, m-scalar values and in the streaming setting.Despite this, Figure 1 shows that varying k from 50 to 400 causes a linear slowdown in sensitivity sampling but only a logarithmic one for the Fast-Coreset method.This analysis confirms the theory in Section 4 -Fast-Coresets obtain equivalent compression to sensitivity sampling but do not have a linear runtime dependence on k.We therefore do not add traditional sensitivity sampling to the remaining experiments.
Speed vs. Accuracy.We now refer the reader to the remaining columns of Table 4 and to Figure 2, where we show the effect of coreset size across datasets by sweeping over m-scalar values.Despite the suboptimal theoretical guarantees of the accelerated sampling methods, we see that they obtain competitive distortions on most of the real-world datasets while running faster than Fast-Coresets in practice.However, uniform sampling breaks on the Taxi and Star datasets -Taxi corresponds to the 2D start locations of taxi rides in Porto and has many clusters of varied size while Star is the pixel values of an image of a shooting star (most pixels are black except for a small cluster of white pixels).Thus, it seems that uniform sampling requires well-behaved datasets, with few outliers and consistent class sizes.
To verify this, consider the results of these sampling strategies on the artificial datasets in Table 4 and Figure 2: as disparity in cluster sizes and distributions grows, the accelerated sampling methods have difficulty capturing all of the outlying points in the dataset.Thus, Figure 2 shows a clear interplay between runtime and sample quality: the faster the method, the more brittle its compression.
While uniform sampling is expected to be brittle, it may be less obvious what causes light-and welterweight coresets to break.The explanation is simple for lightweight coresets: they sample according to the 1-means solution and are therefore biased towards points that are far from the mean.Thus, as a simple counterexample, lightweight coresets are likely to miss a small cluster that is close to the center-of-mass of the dataset.This can be seen in Figure 3, where we show an example where the lightweight coreset construction fails on the Gaussian mixture dataset.Since the small circled cluster is close to the center-of-mass of the dataset, it is missed when sampling according to distance from the mean.
Generalizing this reasoning also explains the brittleness of welterweight coresets when j < k.To see this, let C j be the approximation obtained during welterweight coresets and observe that the sum of importance values of the points belonging to center Thus, our probability mass is distributed across the clusters that have been found in the approximate solution.Naturally, if j < k and we missed a cluster from OPT, there is some set of points that have not received an appropriate probability mass and may therefore be missed.for the remaining ones.Initializations are identical within each row.We show the first 3 digits of the cost for readability.
We evaluate the full extent of this relationship in Table 7, where we show the interplay between the welterweight coreset's j parameter (number of centers in the approximate solution) and the Gaussian mixture dataset's γ parameter (higher γ leads to higher class imbalance).We can consider this as answering the question "How good must our approximate solution be before sensitivity sampling can handle class imbalance?"To this end, all the methods have low distortion for small values of γ but, as γ grows, only Fast-Coresets (and, to a lesser extent, welterweight coresets for larger values of j) are guaranteed to have low distortion.For completeness, we verify that these results also hold for the k-median task in Figure 4. There, we see that k-median distortions across datasets are consistent with k-means distortions.We show one of five runs to emphasize the random nature of compression quality when using various sampling schemas.
To round out the dataset analysis, we note that BICO performs consistently poorly on the coreset distortion metric9 , as can be seen in Table 6.We also analyze the Streamkm++ method across the artificial datasets in Table 9 with m = 40k and see that it obtains poor distortions compared to sensitivity sampling.This is due to Streamkm++'s required coreset size -logarithmic in n and exponential in d -being much larger than those for sensitivity sampling (sensitivity sampling coresets depend on neither parameter).We did not include Streamkm++ in tables 4, 5 due to its suboptimal coreset size, distortion and runtime.
Lastly, we point out that every sampling method performs well on the benchmark dataset, which is designed to explicitly punish sensitivity sampling's reliance on the initial solution.Thus, we verify that there is no setting that breaks sensitivity sampling.
We lastly show how well these compression schemas facilitate fast clustering on large datasets in Table 8.Consider that a large coreset-distortion means that the centers obtained on the coreset poorly represent the full dataset.However, among sampling methods with small distortion, it may be the case that one consistently leads to the 'best' solutions.To test this, we compare the solution quality across all fast methods on the real-world datasets, where coreset distortions are consistent.Indeed, Table 8 shows that no sampling method leads to solutions with consistently minimal costs.

Streaming Setting
One of the most common use-cases for big-data algorithms is the streaming setting, where one receives input in batches and must maintain a compression that is representative of the dataset.Although there is a wealth of sampling and coreset methods in the streaming paradigm, we require consistency across algorithms and therefore assume a black-box sampling procedure.Since the coreset property is preserved under composition, we utilize the merge-&-reduce strategy originally proposed by [11] and first applied to maintaining clustering coresets in stream by [40].The idea is to first partition the input into b blocks and then perform sampling and composition along them until a single compression  is obtained.Specifically, we start by obtaining a coreset on each block.Then, combining samples using a complete binary tree, we (1) recursively re-sample from the children until there is at least one coreset for each level10 in the tree and then (2) concatenate these samples and obtain one final coreset from the composition.Since we are composing coresets from coresets, the errors stack and, in theory, we should require more samples to obtain a similar accuracy.
Despite this, we see the opposite result for many of our sampling strategies.Surprisingly, Table 5 and Figure 5 show that the accelerated sampling methods all perform at least as well under composition on the artificial datasets and do not suffer significant drops in accuracy, variance or runtime on the real datasets.Although inconsistent with the prevailing intuition, we must therefore conclude that the accelerated sampling methods are equally feasible in the streaming setting.We suspect that this is due to the non-uniformity imposed by the merge-&-reduce algorithm.To see this, consider uniform sampling on the c-outlier dataset during the final step of the composition, where we are composing the samples corresponding to each layer of the tree.Assume first that our outlier points happened to fall in the first block.Then we have taken a sample of size m from this block and immediately use this for the final composition.Thus, in this case the outliers are more likely to be in our final sample than in the non-streaming setting.In the alternate setting where the outliers are less likely, our expected error is already high and missing the outlier 'more' cannot worsen our expected error.

Takeaways
To summarize the comparisons between sampling strategies and datasets, the practical guideline is that uniform sampling usually works well, so an optimistic user can default to that (while accepting that it might fail).This was evidenced in Sections 5.1 and 5.3.However, it was also shown there that the accelerated sampling methods all have a chance of catastrophically failing.Thus, in accordance with Table 7, a cautious user may wish to verify whether uniform sampling will work by checking how balanced the dataset's clusters are.By our theoretical contribution (Corollary 3.2 and Theorem 4.6), performing this verification is just as expensive as computing a coreset.Thus, the cautious user may as well create a Fast-Coreset in that amount of time.Importantly, we do not claim that there is a 'best' algorithm among the ones that have been discussed.

Conclusion
In this work, we discussed the theoretical and practical limits of compression algorithms for centerbased clustering.We proposed the first nearly-linear time coreset algorithm for k-median and k-means.Moreover, the algorithm can be parameterized to achieve an asymptotically optimal coreset size.Subsequently, we conducted a thorough experimental analysis comparing this algorithm with fast sampling heuristics.In doing so, we find that although the Fast-Coreset algorithm achieves the best compression guarantees among its competitors, naive uniform sampling is already a sufficient compression for downstream clustering tasks in well-behaved datasets.Furthermore, we find that intermediate heuristics interpolating between uniform sampling and coresets play an important role in balancing efficiency and accuracy.
Although this closes the door on the highly-studied problem of optimally small and fast coresets for k-median and k-means, open questions of wider scope still remain.For example, when does sensitivity sampling guarantee accurate compression with optimal space in linear time and can these conditions be formalized?Furthermore, sensitivity sampling is incompatible with paradigms such as fair-clustering [8,15,21,43,56] and it is unclear whether one can expect that a linear-time method can optimally compress a dataset while adhering to the fairness constraints.
Algorithm 2 Crude-Approx(P ) 1: procedure Count-Distinct-Cells(P, c, ℓ) ▷ data P , c is the center of the quadtree grid at level ℓ

2:
Let D be a dictionary, and count = 0.

3:
for each point p ∈ P do 4: let c p be the center of the cell containing p.The i-th coordinate of c p is ⌊ pi−ci 2 ℓ ⌋ • 2 ℓ + 2 ℓ 2 .

5:
if c p is a not a key of D then 6: insert c p in D and do count ← count +1.

4:
for each point p ∈ P do

6:
Identify the non-empty dictionary keys c 1 , ..., c k , and let C 1 , ..., C k , be the corresponding cells.Sort the cells according to the i-th coordinate of their center.Let δ = 0.  Substract δ from the i-th coordinate of all points in the j-th cell. 12: Output: the dataset P ′ consisting of all shifted points.

14:
Round each coordinates of points to the closest multiple of g := U n 4 d 2 log ∆ .

15:
Output: the dataset P ′ with all point after rounding.
8 Proofs, Pseudo-Code, and Extensions We put the proofs, pseudo-code and algorithmic extensions towards the end of the paper for improved readability of the primary text.Algorithm 2 corresponds to the discussion in Section 4.1 and Algorithm 3 corresponds to the discussion in Section 4.2.

Proof of Corollary 3.2
Recall that Corollary 3.2 states that Algorithm 1 produces an ε-coreset in time Õ(nd log ∆).
On the projected dataset, the algorithm Fast-kmeans++ runs in time Õ (n log ∆), and its solution has an approximation-ratio O dz log k = O log z+1 k for P .The guarantee offered by the embedding ensures that the clustering {C 1 , ..., C k } still has approximation ratio for P [50].
For k-means, computing the 1-mean solution for each C i takes time O(nd) (the 1-mean is simply the mean).For k-median, computing the 1-median solution can be done as well in time O(nd) [20].We note that both may be approximated to a factor 2 in constant time, by sampling uniformly at randm few points from each cluster [26].
Provided the c i and the partition C i , computing |C i | and cost(C i , c i ) for all i also takes time O(nd).
Since the solution consisting of assigning each p ∈ C i to c i is a O log z+1 k -approximation, the values s(p) defined in Algorithm 1 can be used to perform the coreset construction algorithm, and we metrics, solving k-median can be done in linear time using dedicated algorithms (see e.g.[24]).Using the solution from the HST metric, one can compute a coreset, and iterate using the previous argument.This embedding into HST is very similar to what is done by the Fast-kmeans++ algorithm, but can be actually performed in any metric space, not only Euclidean.For instance, in a metric described by a graph with m edges, the running time of this construction would be near linear-time Õ(m).

Figure 1 :
Figure 1: Mean runtime over five runs as we vary k for sensitivity sampling and Fast-Coresets.Bars are k = 50, 100, 200, 400; y-axis is log-scale.

Figure 3 :
Figure 3: The results of lightweight and fast-coreset constructions on a 2D Gaussian mixture dataset of n = 100K points with clusters of varying size.The circled cluster has ∼ 400 points and coresets have 200 points.Left: Original multivariate-Gaussian dataset.Middle: Lightweight coresets fail to capture the cluster of ∼ 400 points.Right: Sensitivity sampling with j = k identifies all of the clusters.

Figure 5 :
Figure 5: Top: Coreset distortion on the k-means task in the streaming and non-streaming settings.This is a visualization of the data in Table 5. Bottom: Coreset construction runtimes in the streaming and non-streaming settings for the linear and sub-linear complexity coreset algorithms.Bars are [Streaming, Non-Streaming].

5 :
let c p be the center of the cell containing p.The i-th coordinate of c p is⌊ pi−ci r ⌋ • r + r 2 .Add p to D[c p ].

Table 3 :
Unless stated otherwise, our experimental results focus on the k-means task.Description of real world datasets

Table 4 :
Distortion means and variances for different sample sizes across datasets for k-means; taken over 5 runs.Failure cases (distortion > 5) are bolded.Catastrophic failures (distortion > 10) are underlined.

Table 5 :
Distortion means and variances in the streaming and non-streaming setting for k-means; taken over 5 runs.Failure cases (distortion > 5) are bolded.Catastrophic failures (distortion > 10) are underlined.

Table 6 :
Distortion values for the BICO algorithm in the static and streaming settings, taken over five runs.Failure cases (distortion > 5) are bolded.Catastrophic failures (distortion > 10) are underlined.

Table 7 :
The effect of γ in the Gaussian mixture dataset on the coreset distortion.We report the means over 5 random dataset generations.Each generation had 50 000 points in 50 dimensions, with 50 Gaussian clusters and coresets of size 4 000.We set k = 100.

Table 8 :
cost(P, C S ), where P is the whole dataset and C S is found via k-means++[2, 23] (k = 50) and Lloyd's algorithm on the coreset.Sample sizes are m = 4 000 for the first two rows and m = 20 000

Table 9 :
Distortions for Streamkm++ on artificial datasets.