10.1145/1553374.1553511acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicpsprocConference Proceedings
research-article

Information theoretic measures for clusterings comparison: is a correction for chance necessary?

ABSTRACT

Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.

References

  1. Albatineh, A. N., Niewiadomska-Bugaj, M., & Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23, 301--313.Google ScholarGoogle Scholar
  2. Banerjee, A., Dhillon, I. S., Ghosh, J., & Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6, 1345--1382. Google ScholarGoogle Scholar
  3. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley. Google ScholarGoogle Scholar
  4. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 193--218.Google ScholarGoogle Scholar
  5. Lancaster, H. (1969). The chi-squared distribution. New York: John Wiley.Google ScholarGoogle Scholar
  6. Meilă, M. (2005). Comparing clusterings: an axiomatic view. ICML '05: Proceedings of the 22nd international conference on Machine learning (pp. 577--584). New York, NY, USA: ACM. Google ScholarGoogle Scholar
  7. Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn., 52, 91--118. Google ScholarGoogle Scholar
  8. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846--850.Google ScholarGoogle Scholar
  9. Strehl, A., & Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583--617. Google ScholarGoogle Scholar
  10. Vinh, N. X., & Epps, J. (2009). A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. BIBE'09: The IEEE International Conference on BioInformatics and BioEngineering, to appear. Google ScholarGoogle Scholar
  11. Vinh, N. X., Epps, J., & Bailey, J. (2009). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. to be submitted.Google ScholarGoogle Scholar
  12. Warrens, M. (2008). On similarity coefficients for 2x2 tables and correction for chance. Psychometrika, 73, 487--502.Google ScholarGoogle Scholar
  13. Yu, Z., Wong, H.-S., & Wang, H. (2007). Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics, 23, 2888--2896. Google ScholarGoogle Scholar

Index Terms

  1. Information theoretic measures for clusterings comparison

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!