skip to main content
research-article

Robust and distributed top-n frequent-pattern mining with SAP BW accelerator

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

Mining for association rules and frequent patterns is a central activity in data mining. However, most existing algorithms are only moderately suitable for real-world scenarios. Most strategies use parameters like minimum support, for which it can be very difficult to define a suitable value for unknown datasets. Since most untrained users are unable or unwilling to set such technical parameters, we address the problem of replacing the minimum-support parameter with top-n strategies. In our paper, we start by extending a top-n implementation of the ECLAT algorithm to improve its performance by using heuristic search strategy optimizations. Also, real-world datasets are often distributed and modern database architectures are switching from expensive SMPs to cheaper shared-nothing blade servers. Thus, most mining queries require distribution handling. Since partitioning can be forced by user-defined semantics, it is often forbidden to transform the data. Therefore, we developed an adaptive top-n frequent-pattern mining algorithm that simplifies the mining process on real distributions by relaxing some requirements on the results. We first combine the PARTITION and the TPUT algorithms to handle distributed top-n frequent-pattern mining. Then, we extend this new algorithm for distributions with real-world data characteristics. For frequent-pattern mining algorithms, equal distributions are important conditions, and tiny partitions can cause performance bottlenecks. Hence, we implemented an approach called MAST that defines a minimum absolute-support threshold. MAST prunes patterns with low chances of reaching the global top-n result set and high computing costs. In total, our approach simplifies the process of frequent-pattern mining for real customer scenarios and data sets. This may make frequent-pattern mining accessible for very new user groups. Finally, we present results of our algorithms when run on the SAP NetWeaver BW Acceleratorwith standard and real business datasets.

References

  1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. of the 20th Intl. Conf. On Very Large Data Bases, pages 487--499. Morgan Kaufmann, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--260, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In Proc. of Intl. Symposium on Principles Of Distributed Computing, pages 206--215, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Köler, and J. Syed. An architecture for distributed enterprise data mining. In Proc. of the 7th Intl. Conf. On High-Performance Computing And Networking, pages 573--582, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. W.-L. Cheung and Y. Xiao. Effect of data skewness in parallel mining of association rules. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 48--60, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K.-T. Chuang, J.-L. Huang, and M.-S. Chen. Mining top-k frequent patterns in the presence of the memory constraint. The VLDB Journal, 17(5):1321--1344, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Cong. Mining the top-k frequent itemset with minimum length m, 2001.Google ScholarGoogle Scholar
  8. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of the 19th Intl. Conf. on Management of Data, pages 1--12, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. In Proc. of the ICDM'02, December 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. He. Mining top-k approximate frequent patterns. Technical Report TR-2005-0315, 2005.Google ScholarGoogle Scholar
  11. The IlliMine Project, http://illimine.cs.uiuc.edu.Google ScholarGoogle Scholar
  12. A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica, 28:497--520, 1960.Google ScholarGoogle ScholarCross RefCross Ref
  13. T. Legler, W. Lehner, and A. Ross. Data mining with the SAP NetWeaver BI accelerator. In Proc. of the 32nd Intl. Conf. On Very Large Data Bases, pages 1059--1068, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Michel, P. Triantafillou, and G. Weikum. KLEE: a framework for distributed top-k query algorithms. In Proc. of the 31st Intl. Conf. On Very Large Data Bases, pages 637--648. VLDB Endowment, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21--30, 2000.Google ScholarGoogle Scholar
  16. A. Ross. SAP NetWeaver BI Accelerator. Galileo Press, 2009.Google ScholarGoogle Scholar
  17. A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In The VLDB Journal, pages 432--444, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. B. Skillicorn. Parallel frequent set counting. Parallel Computing, 28(5):815--825, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with probabilistic guarantees. In Proc. of the 2004 Intl. Conf. on Very Large Data Bases, pages 648--659, August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Wirth, M. Borth, and J. Hipp. When distribution is part of the semantics: A new problem class for distributed knowledge discovery. In Proc. of the PKDD 2001 Workshop on Ubiquitous Data Mining for Mobile and Distributed Environments, pages 56--64, Freiburg, Germany, 2001.Google ScholarGoogle Scholar
  21. M. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Zaki and K. Gouda. Fast vertical mining using diffsets. In Proc. of the 2003 Intl. Conf. on Knowledge Discovery and Data Mining, pages 326--335, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Zaki and C. Hsiao. Charm: an efficient algorithm for closed association rule mining. Technical report, 1999.Google ScholarGoogle Scholar
  24. M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. Technical Report TR651, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Zhao, H. Zhang, F. Figueiredo, L. Cao, and C. Zhang. Mining for combined association rules on multiple datasets. In Proc. of the 2007 Intl. Workshop On Domain Driven Data Mining, pages 18--23, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Robust and distributed top-n frequent-pattern mining with SAP BW accelerator

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 2, Issue 2
          August 2009
          367 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2009
          Published in pvldb Volume 2, Issue 2

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)1
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader