ABSTRACT
Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These results indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noise-reduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values.
References
- Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503--511.Google Scholar
Cross Ref
- Bradley, P., & Fayyad, U. (1998). Refining initial points for k-means clustering. Proc. 15th International Conf. on Machine Learning. Google Scholar
Digital Library
- Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification, 2nd ed. Wiley. Google Scholar
Digital Library
- Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1, 183--187.Google Scholar
Cross Ref
- Fan, K. (1949). On a theorem of Weyl concerning eigenvalues of linear transformations. Proc. Natl. Acad. Sci. USA, 35, 652--655.Google Scholar
Cross Ref
- Gersho, A., & Gray, R. (1992). Vector quantization and signal compression. Kluwer. Google Scholar
Digital Library
- Goldstein, H. (1980). Classical mechanics. Addison-Wesley. 2nd edition.Google Scholar
- Golub, G., & Van Loan, C. (1996). Matrix computations, 3rd edition. Johns Hopkins, Baltimore. Google Scholar
Digital Library
- Gordon, A., & Henderson, J. (1977). An algorithm for euclidean sum of squares classification. Biometrics, 355--362.Google Scholar
- Grim, J., Novovicova, J., Pudil, P., Somol, P., & Ferri, F. (1998). Initialization normal mixtures of densities. Proc. Int'l Conf. Pattern Recognition (ICPR 1998). Google Scholar
Digital Library
- Hartigan, J., & Wang, M. (1979). A K-means clustering algorithm. Applied Statistics, 28, 100--108. Google Scholar
Digital Library
- Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of statistical learning. Springer Verlag.Google Scholar
- Jain, A., & Dubes, R. (1988). Algorithms for clustering data. Prentice Hall. Google Scholar
Digital Library
- Jolliffe, I. (2002). Principal component analysis. Springer. 2nd edition.Google Scholar
- Lloyd, S. (1957). Least squares quantization in pcm. Bell Telephone Laboratories Paper, Marray Hill.Google Scholar
- MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symposium, 281--297.Google Scholar
- Moore, A. (1998). Very fast em-based mixture model clustering using multiresolution kd-trees. Proc. Neural Info. Processing Systems (NIPS 1998). Google Scholar
Digital Library
- Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Proc. Neural Info. Processing Systems (NIPS 2001).Google Scholar
- Wallace, R. (1989). Finding natural clusters through entropy minimization. Ph.D Thesis. Carnegie-Mellon Uiversity, CS Dept. Google Scholar
Digital Library
- Zha, H., Ding, C., Gu, M., He, X., & Simon, H. (2001). Spectral relaxation for K-means clustering. Advances in Neural Information Processing Systems 14 (NIPS'01), 1057--1064.Google Scholar
Index Terms
(auto-classified)K-means clustering via principal component analysis




Comments