Abstract
We investigate variants of Lloyd's heuristic for clustering high-dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.
References
- Aggarwal, A., Deshpande, A., and Kannan, R. 2009. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimazation Problems (APPROX 2009). 15--28. Google Scholar
Digital Library
- Ailon, N., Jaiswal, R., and Monteleoni, C. 2009. Streaming k-means approximation. In Proceedings of the 23rd Annual Conference on Neural Information Processing System (NIPS). 10--18.Google Scholar
- Alsabti, K., Ranka, S., and Singh, V. 1998. An efficient k-means clustering algorithm. In Proceedings of the 1st Workshop on High Performance Data Mining.Google Scholar
- Arthur, D., Manthey, B., and Röglin, H. 2011. Smoothed analysis of the k-means method. J. ACM 58, 5, 19. Google Scholar
Digital Library
- Arthur, D., and Vassilvitskii, S. 2006. How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (SoCG). 144--153. Google Scholar
Digital Library
- Arthur, D. and Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1027--1035. Google Scholar
Digital Library
- Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. 2004. Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33, 544--562. Google Scholar
Digital Library
- Awasthi, P., Blum, A., and Sheffet, O. 2010. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS). 309--318. Google Scholar
Digital Library
- Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC). 250--257. Google Scholar
Digital Library
- Balcan, M., Blum, A., and Gupta, A. 2009. Approximate clustering without the approximation. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1068--1077. Google Scholar
Digital Library
- Bradley, P. S. and Fayyad, U. 1998. Refining initial points for K-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML). 91--99. Google Scholar
Digital Library
- Bunn, P. and Ostrovsky, R. 2007. Secure two-party k-means clustering. In Proceedings of the ACM Conference on Computer and Communications Security. 486--497. Google Scholar
Digital Library
- Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS). 378--388. Google Scholar
Digital Library
- Charikar, M., Guha, S., Tardos, É., and Shmoys, D. B. 2002. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65, 129--149. Google Scholar
Digital Library
- Chrobak, M., Kenyon, C., and Young, N. 2006. The reverse greedy algorithm for the metric k-median problem. Info. Proc. Lett. 97, 68--72. Google Scholar
Digital Library
- Cox, D. R. 1957. Note on grouping. J. ASA 52, 543--547.Google Scholar
- Dasgupta, S. 2003. How fast is k-means? In Proceedings of the 16th Annual Conference on Computational Learning Theory (COLT). 735.Google Scholar
- de la Vega, W. F., Karpinski, M., Kenyon, C., and Rabani, Y. 2003. Approximation schemes for clustering problems. In Proceedings of the 35th ACM Symposium on Theory of Computing. 50--58. Google Scholar
Digital Library
- Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Y. Stat. Soc. B 39, 1--38.Google Scholar
- Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. 2004. Clustering large graphs via the Singular Value Decomposition. Mach. Learn. 56, 9--33. Google Scholar
Digital Library
- Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification. Wiley-Interscience. Google Scholar
Digital Library
- Effros, M. and Schulman, L. J. 2004a. Deterministic clustering with data nets. Electronic Tech rep. ECCC TR04-050.Google Scholar
- Effros, M. and Schulman, L. J. 2004b. Deterministic clustering with data nets. In Proceedings of the International Symposium on Information Theory (ISIT).Google Scholar
- Fisher, D. 1996. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res. 4, 147--178. Google Scholar
Digital Library
- Forgey, E. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 21, 768.Google Scholar
- Gersho, A. and Gray, R. M. 1992. Vector quantization and signal compression. Kluwer. Google Scholar
Digital Library
- Gray, R. M. and Neuhoff, D. L. 1998. Quantization. IEEE Trans. Inform. Theory 44, 6, 2325--2384. Google Scholar
Digital Library
- Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual Symposium on Theory of Computing (STOC). 291--300. Google Scholar
Digital Library
- Har-Peled, S., and Sadri, B. 2005. How fast is the k-means method? Algorithmica 41, 185--202. Google Scholar
Digital Library
- Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. 1997. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci. 37, 861--870.Google Scholar
Cross Ref
- Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3. Google Scholar
Digital Library
- Jain, K., Mahdian, M., Markakis, E., Saberi, A., and Vazirani, V. 2003. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. J. ACM 50, 795--824. Google Scholar
Digital Library
- Jain, K. and Vazirani, V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48, 274--296. Google Scholar
Digital Library
- Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., and Wu, A. 2004. A local search approximation algorithm for k-means clustering. Comput. Geom. 28, 89--112. Google Scholar
Digital Library
- Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., and Wu, A. Y. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 881--892. Google Scholar
Digital Library
- Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley.Google Scholar
- Kumar, A., Sabharwal, Y., and Sen, S. 2004. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual Symposium on Foundations of Computer Science (FOCS). 454--462. Google Scholar
Digital Library
- Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantization design. IEEE Trans. Commun. COM-28, 84--95.Google Scholar
Cross Ref
- Lloyd, S. P. 1982. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory 28, 129--137. Google Scholar
Digital Library
- MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.Google Scholar
- Matousek, J. 2000. On approximate geometric k-clustering. Disc. Computat. Geom. 24, 61--84.Google Scholar
Cross Ref
- Max, J. 1960. Quantizing for minimum distortion. IEEE Trans. Inform. Theory IT-6, 1, 7--12.Google Scholar
Cross Ref
- Meila, M. and Heckerman, D. 1998. An experimental comparison of several clustering and initialization methods. In Proceedings of the 14th Conference on Uncertainty in Arificial Intelligent (UAI). 386--395. Google Scholar
Digital Library
- Mettu, R. R. and Plaxton, C. G. 2004. Optimal time bounds for approximate clustering. Mach. Learn. 56, 35--60. Google Scholar
Digital Library
- Milligan, G. 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45, 325--342.Google Scholar
Cross Ref
- Motwani, R. and Raghavan, P. 1995. Randomized Algorithms. Cambridge Univ. Press. Google Scholar
Digital Library
- Nissim, K., Rashkhodnikova, S., and Smith, A. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual Symposium on Theory of Computing (STOC). 75--84. Google Scholar
Digital Library
- Ostrovsky, R. and Rabani, Y. 2002. Polynomial time approximation schemes for geometric clustering problems. J. ACM 49, 2, 139--156. Google Scholar
Digital Library
- Ostrovsky, R., Rabani, Y., Schulman, L., and Swamy, C. 2006. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the 47th Annual Symposium on Foundation of Computer Science (FOCS). 165--174. Google Scholar
Digital Library
- Papadimitriou, C., Raghavan, P., Tamaki, H., and Vempala, S. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci. 61, 217--235. Google Scholar
Digital Library
- Pelleg, D. and Moore, A. 1999. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the 5th Annual ACM Conference on Knowledge Discovery and Data Mining (KDD). 277--281. Google Scholar
Digital Library
- Pena, J. M., Lozano, J. A., and Larranaga, P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Patt. Rec. Lett. 20, 1027--1040. Google Scholar
Digital Library
- Phillips, S. J. 2002. Acceleration of k-means and related clustering problems. In Proceedings of the 4th Workshop on Algorithm Engineering and Experiments (ALENEX). Google Scholar
Digital Library
- Schulman, L. J. 2000. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC). 547--555. Google Scholar
Digital Library
- Snarey, M., Terrett, N. K., Willet, P., and Wilton, D. J. 1997. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Model. 15, 372--385.Google Scholar
Cross Ref
- Spielman, D. and Teng, S. 2001. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proceedings of 33rd Annual Symposium on Theory of Computing (STOC). 296--305. Google Scholar
Digital Library
- Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801--804.Google Scholar
- Tryon, R. C. and Bailey, D. E. 1970. Cluster Analysis. McGraw-Hill. 147--150.Google Scholar
- Vattani, A. 2011. k-means requires exponentially many iterations even in the plane. Disc. Comput. Geom. 45, 4, 596--616. Google Scholar
Digital Library
Index Terms
The effectiveness of lloyd-type methods for the k-means problem





Comments