skip to main content
10.1145/1559795.1559814acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

An efficient rigorous approach for identifying statistically significant frequent itemsets

Published:29 June 2009Publication History

ABSTRACT

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate.

Our methodology hinges on a Poisson approximation to the distribution of the number of itemsets in a random dataset with support at least s, for any s greater than or equal to a minimum threshold smin. We obtain this result through a novel application of the Chen-Stein approximation method, which is of independent interest. Based on this approximation, we develop an efficient parametric multi-hypothesis test for identifying the desired threshold s*. A crucial feature of our approach is that, unlike most previous work, it takes into account the entire dataset rather than individual discoveries. It is therefore better able to distinguish between significant observations and random fluctuations. We present extensive experimental results to substantiate the effectiveness of our methodology.

References

  1. ]]C.C. Aggarwal and P.S. Yu. A new framework for itemset generation. In Proc. of the 17th ACM Symp. on Principles of Database Systems, pages 18--24, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ]]R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Intl. Conference on Management of Data, pages 207--216, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ]]R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein method. Statistical Science, 5(4):403--434, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  4. ]]Y. Benjamini, and Y. Hochberg. Controlling the false discovery rate. J. Royal Statistical Society, Series B, 57:289--300, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  5. ]]Y. Benjamini, and D. Yekutieli The control of the false discovery rate in multiple testing under dependency Annals of Statistics, 29 (4): 1165--1188, 2001.Google ScholarGoogle Scholar
  6. ]]R.J. Bolton, D.J. Hand, and N.M. Adams. Determining Hit Rate in Pattern Search In Proc. of Pattern Detection and Discovery, LNAI 2447,pages 36--48, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ]]W. J. Conover. Practical Nonparametric Statistics. Wiley Series in Probability, 3rd Ed., 1999.Google ScholarGoogle Scholar
  8. ]]S. Dudoit, J.P. Shaffer, and J.C. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, Vol. 18, No. 1, 2003, p. 71--103.Google ScholarGoogle ScholarCross RefCross Ref
  9. ]]W. DuMouchel. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician, 53:177--202, 1999.Google ScholarGoogle Scholar
  10. ]]W. DuMouchel and D. Pregibon. Empirical Bayes screening for multi-item associations. In Proc. of the 7th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 67--76, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. In Proc. of the 12th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 167--176, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. ]]B. Goethals, R. Bayardo, and M.J. Zaki, editors. Proc. of the 2nd Workshop on Frequent Itemset Mining Implementations (FIMI04), volume 126. CEUR-WS Workshop On-line Proceedings, November 2004.Google ScholarGoogle Scholar
  13. ]]B. Goethals and M. J. Zaki, editors. Proc. of the 1st Workshop on Frequent Itemset Mining Implementations (FIMI03), volume 90. CEUR-WS Workshop On-line Proceedings, November 2003.Google ScholarGoogle Scholar
  14. ]]W. Hämäläinen, and M. Nykänen Efficient discovery of statistically significant associationrules In Proc. of the 8th IEEE Intl. Conference on DataMining, pages 203--212, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ]]J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery, 14(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]J. Han, and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Mateo, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ]]S. Jaroszewicz, and T. Scheffer. Fast discovery of unexpected patterns in data, relative to a Bayesian network. In Proc. of the 11th ACM SIGKDD Intl Conference on Knowledge Discovery in Data Mining, pages 118--127, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ]]H.O. Lancaster. The Chi-squared Distribution. John Wiley & Sons, New York NY, 1969.Google ScholarGoogle Scholar
  19. ]]N. Megiddo, and R. Srikant Discovering predictive association rules. In Proc. of the 4th Intl. Conference on Knowledge Discovery and Data Mining, pages 274--278, 1998.Google ScholarGoogle Scholar
  20. ]]M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. ]]N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of the 7th Int. Conference on Database Theory, pages 398--416, January 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ]]P.W. Purdom, D. Van Gucht, and D.P. Groth. Average case performance of the apriori algorithm. SIAM J. Computing, 33 (5): 1223--1260, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  23. ]]B. Sayrafi, D. Van Gucht, and P.W. Purdom. On the effectiveness and efficiency of computing bounds on the support of item-sets in the frequent item--sets mining problem. In Proc. of the 1st Intl. Workshop on Open Source Data Mining, OSDM '05, pages: 46--55, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. ]]A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8(6):970--974, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ]]C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39--68, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ]]R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. of the ACM SIGMOD Intl. Conference on Management of Data, pages 1--12, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. ]]P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. ]]D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. In Proc. of the 31st Very Large Data Base Conference, pages 709--720, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ]]H. Zhang, B. Padmanabhan, and A. Tuzhilin. On the discovery of significant statistical quantitative rules. In Proc. of the 10th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 374--383, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An efficient rigorous approach for identifying statistically significant frequent itemsets

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!