skip to main content
research-article

A Support System for Clustering Data Streams with a Variable Number of Clusters

Published:25 July 2016Publication History
Skip Abstract Section

Abstract

Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.

References

  1. Marcel R. Ackermann, Christiane Lammersen, Marcus Märtens, Christoph Raupach, Christian Sohler, and Kamil Swierkot. 2010. StreamKM++: A clustering algorithms for data streams. In Proc. of the ALENEX. 173--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A framework for clustering evolving data streams. In Proc. of the VLDB. 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2004. A framework for projected clustering of high dimensional data streams. In Proc. of the 30th International Conference on Very Large Data Bases (VLDB’04). VLDB Endowment, 852--863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael R. Anderberg. 1973. Cluster Analysis for Applications. Academic Press.Google ScholarGoogle Scholar
  5. David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proc. of the SODA’07. 1027--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jürgen Beringer and Eyke Hüllermeier. 2006. Online clustering of parallel data streams. Data Knowled. Eng. 58 (2006), 180--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11 (2010), 1601--1604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Abdelhamid Bouchachia. 2011. Evolving clustering: An asset for evolving systems. In IEEE SMC Newsletter, Vol. 36. 1--6.Google ScholarGoogle Scholar
  9. T. Calinski and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3 (1974), 1--27.Google ScholarGoogle ScholarCross RefCross Ref
  10. Thiago F. Covões and Eduardo R. Hruschka. 2011. Towards improving cluster-based feature selection with a simplified silhouette filter. Inform. Sci. 181, 18 (2011), 3766--3782.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fernando Crespo and Richard Weber. 2005. A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets. Syst. 150 (2005), 267--284.Google ScholarGoogle ScholarCross RefCross Ref
  12. David L. Davies and Donald W. Bouldin. 1979. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979), 224 --227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jonathan de Andrade Silva and Eduardo Raul Hruschka. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proc. of the 4th International Conference on Machine Learning and Applications (ICMLA’11), Vol. 2. 14--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 (2006), 1--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. C. Dunn. 1974. Well separated clusters and optimal fuzzy-partitions. J. Cybernet. 4 (1974), 95--104.Google ScholarGoogle ScholarCross RefCross Ref
  16. Brian S. Everitt, Sabine Landau, and Morven Leese. 2001. Cluster Analysis. Arnold Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dominik Fisch, Dominik Fisch, Martin Jänicke, Edgar Kalkowski, and Bernhard Sick. 2012. Techniques for knowledge acquisition in dynamically changing environments. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 16:1--16:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joao Gama. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC, London. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guha, Meyerson, Mishra, Motwani, and O’Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowled. Data Eng. 15 (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jiawei Han and Micheline Kamber. 2000. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Myles Hollander and Douglas A. Wolfe. 1999. Nonparametric Statistical Methods (2nd ed.). Wiley, New York, NY.Google ScholarGoogle Scholar
  22. E. R. Hruschka, L. N. de Castro, and R. J. G. B Campello. 2004. Evolutionary algorithms for clustering gene-expression data. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM’04). 403--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Eduardo R. Hruschka, Ricardo J. G. B. Campello, and Leandro Nunes de Castro. 2006. Evolving clusters in gene-expression data. Inform. Sci. 176 (2006), 1898--1927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Hubert and P. Arabie. 1985. Comparing partitions. J. Class. 2 (1985), 193--218.Google ScholarGoogle ScholarCross RefCross Ref
  25. Anil K. Jain. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (2009), 651--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc., Piscataway, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Kaufman and P. J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY.Google ScholarGoogle Scholar
  28. Edwin Lughofer. 2011. Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications. Studies in Fuzziness and Soft Computing, Vol. 266. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Edwin Lughofer. 2012. A dynamic split-and-merge approach for evolving cluster models. Evolv. Syst. 3 (2012), 135--151.Google ScholarGoogle ScholarCross RefCross Ref
  30. Moamar S. Mouchaweh. 2010. Learning in dynamic environments: Application to the identification of hybrid dynamic systems. In Proc. of the 2010 9th International Conference on Machine Learning and Applications (ICMLA). 555--560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Murilo C. Naldi, Ricardo J. G. B. Campello, Eduardo R. Hruschka, and André C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Appl. Soft Comput. 11 (2011), 1938--1952. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Murilo C. Naldi, André Fontana, and Ricardo J. G. B. Campello. 2009. Comparison among methods for k estimation in k-means. In Proc. of the ISDA’09. 1006--1013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Liadan O’Callaghan, Adam Meyerson, Rajeev Motwani, Nina Mishra, and Sudipto Guha. 2002. Streaming-data algorithms for high-quality clustering. In Proc. of the ICDE. 685--695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. R. Pal and J. C. Bezdek. 1995. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Syst. 3 (1995), 370--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Pearson. 1901. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 6 (1901), 559--572.Google ScholarGoogle ScholarCross RefCross Ref
  36. Witold Pedrycz and Richard Weber. 2008. Editorial: Special issue on soft computing for dynamic data mining. Appl. Soft Comput. 8 (2008), 1281--1282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Andres Quiroz, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma. 2012. Design and evaluation of decentralized online clustering. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 34:1--34:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Moamar Sayed Mouchaweh and Edwin Lughofer. 2012. Learning in Non-Stationary Environments: Methods and Applications. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jonathan A. Silva, Elaine R. Faria, Rodrigo C. Barros, Eduardo R. Hruschka, André C. P. L. F. de Carvalho, and João Gama. 2013. Data stream clustering: A survey. ACM Comput. Surv. 46, 1 (2013), 13:1--13:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proc. KDD Workshop Text Mining. 109--111.Google ScholarGoogle Scholar
  41. Lucas Vendramin, Ricardo J. G. B. Campello, and Eduardo R. Hruschka. 2010. Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. 3 (2010), 209--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data mining. Knowled. Inform. Syst. 14 (2007), 1--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhenwei Yu, Jeffrey J. P. Tsai, and Thomas Weigert. 2008. An adaptive automatically tuning intrusion detection system. ACM Trans. Autonom. Adapt. Syst. 3 (2008), 10:1--10:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. of the SIGMOD’96. 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Support System for Clustering Data Streams with a Variable Number of Clusters

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Autonomous and Adaptive Systems
      ACM Transactions on Autonomous and Adaptive Systems  Volume 11, Issue 2
      Special Section on Best Papers from SASO 2014 and Regular Articles
      July 2016
      267 pages
      ISSN:1556-4665
      EISSN:1556-4703
      DOI:10.1145/2952298
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 July 2016
      • Accepted: 1 June 2014
      • Revised: 1 February 2014
      • Received: 1 September 2013
      Published in taas Volume 11, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!