ABSTRACT
We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ε-accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml
References
- L. Bottou and Y. Bengio. Convergence properties of the kmeans algorithm. In Advances in Neural Information Processing Systems. 1995.Google Scholar
- J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In ICML '08: Proceedings of the 25th international conference on Machine learning, 2008. Google Scholar
Digital Library
- C. Elkan. Using the triangle inequality to accelerate k-means. In ICML '03: Proceedings of the 20th international conference on Machine learning, 2003.Google Scholar
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5, 2004. Google Scholar
Digital Library
- D. Witten and R. Tibshirani. A framework for feature selection in clustering. To Appear: Journal of the American Statistical Association, 2010.Google Scholar
- X. Wu and V. Kumar. The Top Ten Algorithms in Data Mining. Chapman & Hall/CRC, 2009. Google Scholar
Digital Library
Index Terms
Web-scale k-means clustering





Comments