ABSTRACT
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.
References
- Albatineh, A. N., Niewiadomska-Bugaj, M., & Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23, 301--313.Google Scholar
- Banerjee, A., Dhillon, I. S., Ghosh, J., & Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6, 1345--1382. Google Scholar
- Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley. Google Scholar
- Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 193--218.Google Scholar
- Lancaster, H. (1969). The chi-squared distribution. New York: John Wiley.Google Scholar
- Meilă, M. (2005). Comparing clusterings: an axiomatic view. ICML '05: Proceedings of the 22nd international conference on Machine learning (pp. 577--584). New York, NY, USA: ACM. Google Scholar
- Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn., 52, 91--118. Google Scholar
- Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846--850.Google Scholar
- Strehl, A., & Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583--617. Google Scholar
- Vinh, N. X., & Epps, J. (2009). A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. BIBE'09: The IEEE International Conference on BioInformatics and BioEngineering, to appear. Google Scholar
- Vinh, N. X., Epps, J., & Bailey, J. (2009). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. to be submitted.Google Scholar
- Warrens, M. (2008). On similarity coefficients for 2x2 tables and correction for chance. Psychometrika, 73, 487--502.Google Scholar
- Yu, Z., Wong, H.-S., & Wang, H. (2007). Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics, 23, 2888--2896. Google Scholar
Index Terms
Information theoretic measures for clusterings comparison



Comments