ABSTRACT
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate.
Our methodology hinges on a Poisson approximation to the distribution of the number of itemsets in a random dataset with support at least s, for any s greater than or equal to a minimum threshold smin. We obtain this result through a novel application of the Chen-Stein approximation method, which is of independent interest. Based on this approximation, we develop an efficient parametric multi-hypothesis test for identifying the desired threshold s*. A crucial feature of our approach is that, unlike most previous work, it takes into account the entire dataset rather than individual discoveries. It is therefore better able to distinguish between significant observations and random fluctuations. We present extensive experimental results to substantiate the effectiveness of our methodology.
- ]]C.C. Aggarwal and P.S. Yu. A new framework for itemset generation. In Proc. of the 17th ACM Symp. on Principles of Database Systems, pages 18--24, 1998. Google Scholar
Digital Library
- ]]R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Intl. Conference on Management of Data, pages 207--216, 1993. Google Scholar
Digital Library
- ]]R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein method. Statistical Science, 5(4):403--434, 1990.Google Scholar
Cross Ref
- ]]Y. Benjamini, and Y. Hochberg. Controlling the false discovery rate. J. Royal Statistical Society, Series B, 57:289--300, 1995.Google Scholar
Cross Ref
- ]]Y. Benjamini, and D. Yekutieli The control of the false discovery rate in multiple testing under dependency Annals of Statistics, 29 (4): 1165--1188, 2001.Google Scholar
- ]]R.J. Bolton, D.J. Hand, and N.M. Adams. Determining Hit Rate in Pattern Search In Proc. of Pattern Detection and Discovery, LNAI 2447,pages 36--48, 2002. Google Scholar
Digital Library
- ]]W. J. Conover. Practical Nonparametric Statistics. Wiley Series in Probability, 3rd Ed., 1999.Google Scholar
- ]]S. Dudoit, J.P. Shaffer, and J.C. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, Vol. 18, No. 1, 2003, p. 71--103.Google Scholar
Cross Ref
- ]]W. DuMouchel. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician, 53:177--202, 1999.Google Scholar
- ]]W. DuMouchel and D. Pregibon. Empirical Bayes screening for multi-item associations. In Proc. of the 7th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 67--76, 2001. Google Scholar
Digital Library
- ]]A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. In Proc. of the 12th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 167--176, 2006. Google Scholar
Digital Library
- ]]B. Goethals, R. Bayardo, and M.J. Zaki, editors. Proc. of the 2nd Workshop on Frequent Itemset Mining Implementations (FIMI04), volume 126. CEUR-WS Workshop On-line Proceedings, November 2004.Google Scholar
- ]]B. Goethals and M. J. Zaki, editors. Proc. of the 1st Workshop on Frequent Itemset Mining Implementations (FIMI03), volume 90. CEUR-WS Workshop On-line Proceedings, November 2003.Google Scholar
- ]]W. Hämäläinen, and M. Nykänen Efficient discovery of statistically significant associationrules In Proc. of the 8th IEEE Intl. Conference on DataMining, pages 203--212, 2008. Google Scholar
Digital Library
- ]]J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery, 14(1), 2007. Google Scholar
Digital Library
- ]]J. Han, and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Mateo, CA, 2001. Google Scholar
Digital Library
- ]]S. Jaroszewicz, and T. Scheffer. Fast discovery of unexpected patterns in data, relative to a Bayesian network. In Proc. of the 11th ACM SIGKDD Intl Conference on Knowledge Discovery in Data Mining, pages 118--127, 2005. Google Scholar
Digital Library
- ]]H.O. Lancaster. The Chi-squared Distribution. John Wiley & Sons, New York NY, 1969.Google Scholar
- ]]N. Megiddo, and R. Srikant Discovering predictive association rules. In Proc. of the 4th Intl. Conference on Knowledge Discovery and Data Mining, pages 274--278, 1998.Google Scholar
- ]]M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google Scholar
Digital Library
- ]]N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of the 7th Int. Conference on Database Theory, pages 398--416, January 1999. Google Scholar
Digital Library
- ]]P.W. Purdom, D. Van Gucht, and D.P. Groth. Average case performance of the apriori algorithm. SIAM J. Computing, 33 (5): 1223--1260, 2004.Google Scholar
Cross Ref
- ]]B. Sayrafi, D. Van Gucht, and P.W. Purdom. On the effectiveness and efficiency of computing bounds on the support of item-sets in the frequent item--sets mining problem. In Proc. of the 1st Intl. Workshop on Open Source Data Mining, OSDM '05, pages: 46--55, 2005. Google Scholar
Digital Library
- ]]A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8(6):970--974, 1996. Google Scholar
Digital Library
- ]]C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39--68, 1998. Google Scholar
Digital Library
- ]]R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. of the ACM SIGMOD Intl. Conference on Management of Data, pages 1--12, 1996. Google Scholar
Digital Library
- ]]P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006. Google Scholar
Digital Library
- ]]D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. In Proc. of the 31st Very Large Data Base Conference, pages 709--720, 2005. Google Scholar
Digital Library
- ]]H. Zhang, B. Padmanabhan, and A. Tuzhilin. On the discovery of significant statistical quantitative rules. In Proc. of the 10th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 374--383, 2004. Google Scholar
Digital Library
Index Terms
An efficient rigorous approach for identifying statistically significant frequent itemsets
Recommendations
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we ...
An efficient pattern growth approach for mining fault tolerant frequent itemsets
Highlights- Mining fault tolerant (FT) frequent itemsets are computationally expensive.
- ...
AbstractMining fault tolerant (FT) frequent itemsets from transactional databases are computationally more expensive than mining exact matching frequent itemsets. Previous algorithms mine FT frequent itemsets using Apriori heuristic. Apriori-...
Applying bit-vector projection approach for efficient mining of N-most interesting frequent itemsets
CI '07: Proceedings of the Third IASTED International Conference on Computational IntelligenceReal world datasets are sparse, dirty and contain hundreds of items. In such situations, discovering interesting rules (results) using traditional frequent itemset mining approach by specifying a user defined input support threshold is not appropriate. ...






Comments