research-article

Selecting the Right Correlation Measure for Binary Data

Abstract

Finding the most interesting correlations among items is essential for problems in many commercial, medical, and scientific domains. Although there are numerous measures available for evaluating correlations, different correlation measures provide drastically different results. Piatetsky-Shapiro provided three mandatory properties for any reasonable correlation measure, and Tan et al. proposed several properties to categorize correlation measures; however, it is still hard for users to choose the desirable correlation measures according to their needs. In order to solve this problem, we explore the effectiveness problem in three ways. First, we propose two desirable properties and two optional properties for correlation measure selection and study the property satisfaction for different correlation measures. Second, we study different techniques to adjust correlation measures and propose two new correlation measures: the Simplified χ2 with Continuity Correction and the Simplified χ2 with Support. Third, we analyze the upper and lower bounds of different measures and categorize them by the bound differences. Combining these three directions, we provide guidelines for users to choose the proper measure according to their needs.

References

  1. R. Agrawal, T. Imieliński, and A. Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’93). ACM, New York, NY, 207--216. DOI: http://dx.doi.org/10.1145/170035.170072 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bate, M. Lindquist, I. R. Edwards, S. Olsson, R. Orre, A. Lansner, and R. M. De Freitas. 1998. A Bayesian neural network method for adverse drug reaction signal generation. European Journal of Clinical Pharmacology 54, 4 (1998), 315--321.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Brin, R. Motwani, and C. Silverstein. 1997a. Beyond market baskets: Generalizing association rules to correlations. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’97). ACM, New York, NY, 265--276. DOI: http://dx.doi.org/10.1145/253260.253327 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. 1997b. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’97). ACM, New York, NY, 255--264. DOI: http://dx.doi.org/10.1145/253260. 253325 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Clauset, M. E. J. Newman, and C. Moore. 2004. Finding community structure in very large networks. Physical Review E 70, 6 (Dec. 2004), 066111+. DOI: http://dx.doi.org/10.1103/PhysRevE.70.066111Google ScholarGoogle ScholarCross RefCross Ref
  6. L. Duan and W. Nick Street. 2009. Finding maximal fully-correlated itemsets in large databases. In Proceedings of the International Conference on Data Mining (ICDM’09). 770--775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Duan, W. Nick Street, and Y. Liu. 2013. Speeding up correlation search for binary data. Pattern Recognition Letters 34, 13 (2013), 1499--1507. DOI: http://dx.doi.org/10.1016/j.patrec.2013.05.027 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Duan, W. Nick Street, Y. Liu, and H. Lu. 2014. Community detection in graphs through correlation. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (ACM SIGKDD’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Dumouchel. 1999. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. American Statistician 53, 3 (1999), 177--202.Google ScholarGoogle Scholar
  10. T. Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1 (1993), 61--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Everett. 1957. ‘Relative State’ formulation of quantum mechanics. Reviews of Modern Physics 29 (1957), 454--462.Google ScholarGoogle ScholarCross RefCross Ref
  12. L. Geng and H. J. Hamilton. 2006. Interestingness measures for data mining: A survey. Computing Surveys 38, 3 (2006), 9. DOI: http://dx.doi.org/10.1145/1132960.1132963 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Jermaine. 2005. Finding the most interesting correlations in a database: How hard can it be? Information Systems 30, 1 (2005), 21--46. DOI: http://dx.doi.org/10.1016/j.is.2003.08.004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. H. Johnson and D. W. Wichern. 2001. Applied Multivariate Statistical Analysis. Prentice Hall.Google ScholarGoogle Scholar
  15. M. Liu, E. R. M. Hinz, M. E. Matheny, J. C. Denny, J. S. Schildcrout, R. A. Miller, and H. Xu. 2013. Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records. Journal of the American Medical Informatics Association 20, 3 (2013), 420--426. DOI: http://dx.doi.org/10.1136/amiajnl-2012-001119Google ScholarGoogle ScholarCross RefCross Ref
  16. F. Mosteller. 1968. Association and estimation in contingency tables. Journal of the American Statistical Association 63, 321 (1968), 1--28.Google ScholarGoogle Scholar
  17. P.-N. Tan and V. Kumar. 2000. Interestingness measures for association patterns: A perspective. In Proceedings of the KDD 2000 Workshop on Postprocessing in Machine Learning and Data Mining.Google ScholarGoogle Scholar
  18. G. N. Norén, A. Bate, J. Hopstadius, K. Star, and I. R. Edwards. 2008. Temporal pattern discovery for trends and transient effects: Its application to patient records. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). ACM, New York, NY, 963--971. DOI: http://dx.doi.org/10.1145/1401890.1402005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. R. Omiecinski. 2003. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering 15, 1 (2003), 57--69. DOI: http://dx.doi.org/10.1109/TKDE.2003.1161582 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. OMOP. 2010. Methods section for the disproportionality paper. (September 2010). http://omop.fnih.org/MethodsLibrary.Google ScholarGoogle Scholar
  21. G. Piatetsky-Shapiro. 1991. Discovery, Analysis, and Presentation of Strong Rules. AAAI/MIT Press, 229--248.Google ScholarGoogle Scholar
  22. H. T. Reynold. 1977. The Analysis of Cross-Classifications. Free Press.Google ScholarGoogle Scholar
  23. C. L. Sistrom and C. W. Garvan. 2004. Proportions, odds, and risk. Radiology 230, 1 (2004), 12--19.Google ScholarGoogle ScholarCross RefCross Ref
  24. P.-N. Tan, V. Kumar, and J. Srivastava. 2004. Selecting the right objective measure for association analysis. Information Systems 29, 4 (2004), 293--313. DOI: http://dx.doi.org/10.1016/S0306-4379(03)00072-3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P.-N. Tan, M. Steinbach, and V. Kumar. 2005. Introduction to Data Mining. Addison Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Tew, C. Giraud-Carrier, K. Tanner, and S. Burton. 2013. Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Mining and Knowledge Discovery (2013), 1--42. DOI: http://dx.doi.org/10.1007/s10618-013-0326-x Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Xiong, M. Brodie, and S. Ma. 2006a. TOP-COP: Mining top-K strongly correlated pairs in large databases. In Proceedings of the International Conference on Data Mining (ICDM’06). Washington, DC, 1162--1166. DOI: http://dx.doi.org/10.1109/ICDM.2006.161 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. 2006b. TAPER: A two-step approach for all-strong-pairs correlation query in large databases. IEEE Transactions on Knowledge and Data Engineering 18, 4 (2006), 493--508. DOI: http://dx.doi.org/10.1109/TKDE.2006.68 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Zhang and J. Feigenbaum. 2006. Finding highly correlated pairs efficiently with powerful pruning. In Proceedings of the ACM CIKM International Conference on Information and Knowledge Management (CIKM’06). ACM, New York, NY, 152--161. DOI: http://dx.doi.org/10.1145/1183614.1183640 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Zhong, C. Liu, and S. Ohsuga. 2001. Dynamically organizing KDD processes. International Journal of Pattern Recognition and Artificial Intelligence 15, 3 (2001), 451--473.Google ScholarGoogle ScholarCross RefCross Ref
  31. N. Zhong, Y. Y. Yao, and S. Ohsuga. 1999. Peculiarity oriented multi-database mining. In Proceedings of the 3rd European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’99). Springer-Verlag, London, UK, 136--146. Google ScholarGoogle ScholarDigital LibraryDigital Library

Supplemental Material

Index Terms

(auto-classified)
  1. Selecting the Right Correlation Measure for Binary Data

    Reviews

    Amos O Olagunju

    Intelligent data mining algorithms call for reliable indicators of relationships in massive datasets. How should correlations be selected for the precise analysis of binary data from different problem areas__?__ Duan et al. critique the strengths and weaknesses of numerous correlation statistics. They offer properties that: guarantee the existence of association patterns beyond any doubt and make extremely related item sets noticeable in binary data investigations; validate the accurate estimation of negative correlations; and provide confidence about computed correlations, irrespective of any sample size increase. Computational statisticians ought to read this astounding paper. Correlation support is the percentage of item coincidences. Let SDFES be the squared deviation of the fixed support from the expected support. Simplified chi-square, an overall gauge of association among items, is the absolute value of the fixed item set size times SDFES divided by the expected support. The authors present two correlation test statistics: the product of the simplified chi-square and fixed support, and the product of the fixed item set size and SDFES divided by the sum of the expected support and a continuity correction to diminish its instability. They provide formulas of the exact upper and lower bounds of 18 correlation measures and graphically illuminate the bounds for various support and data sizes. Correlated pair and item set search experiments were performed with synthetic patient and Facebook datasets. The average correlation support of the topmost pairs retrieved and the mean average precision were computed. The proposed two test statistics produced reliable search results. However, in correlated item set searches with simulated and Netflix datasets of items, movies, and transactions, the simplified chi-square with fixed support was less accurate in retrieval performance, due to the uncertainty of the data patterns. The authors credibly articulate the germane properties and correlation measures for skillfully probing different binary datasets. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!