Abstract
Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.
- Marcel R. Ackermann, Christiane Lammersen, Marcus Märtens, Christoph Raupach, Christian Sohler, and Kamil Swierkot. 2010. StreamKM++: A clustering algorithms for data streams. In Proc. of the ALENEX. 173--187. Google Scholar
Digital Library
- Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A framework for clustering evolving data streams. In Proc. of the VLDB. 81--92. Google Scholar
Digital Library
- Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2004. A framework for projected clustering of high dimensional data streams. In Proc. of the 30th International Conference on Very Large Data Bases (VLDB’04). VLDB Endowment, 852--863. Google Scholar
Digital Library
- Michael R. Anderberg. 1973. Cluster Analysis for Applications. Academic Press.Google Scholar
- David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proc. of the SODA’07. 1027--1035. Google Scholar
Digital Library
- Jürgen Beringer and Eyke Hüllermeier. 2006. Online clustering of parallel data streams. Data Knowled. Eng. 58 (2006), 180--204. Google Scholar
Digital Library
- Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11 (2010), 1601--1604. Google Scholar
Digital Library
- Abdelhamid Bouchachia. 2011. Evolving clustering: An asset for evolving systems. In IEEE SMC Newsletter, Vol. 36. 1--6.Google Scholar
- T. Calinski and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3 (1974), 1--27.Google Scholar
Cross Ref
- Thiago F. Covões and Eduardo R. Hruschka. 2011. Towards improving cluster-based feature selection with a simplified silhouette filter. Inform. Sci. 181, 18 (2011), 3766--3782.Google Scholar
Cross Ref
- Fernando Crespo and Richard Weber. 2005. A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets. Syst. 150 (2005), 267--284.Google Scholar
Cross Ref
- David L. Davies and Donald W. Bouldin. 1979. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979), 224 --227. Google Scholar
Digital Library
- Jonathan de Andrade Silva and Eduardo Raul Hruschka. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proc. of the 4th International Conference on Machine Learning and Applications (ICMLA’11), Vol. 2. 14--19. Google Scholar
Digital Library
- Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 (2006), 1--30. Google Scholar
Digital Library
- J. C. Dunn. 1974. Well separated clusters and optimal fuzzy-partitions. J. Cybernet. 4 (1974), 95--104.Google Scholar
Cross Ref
- Brian S. Everitt, Sabine Landau, and Morven Leese. 2001. Cluster Analysis. Arnold Publishers. Google Scholar
Digital Library
- Dominik Fisch, Dominik Fisch, Martin Jänicke, Edgar Kalkowski, and Bernhard Sick. 2012. Techniques for knowledge acquisition in dynamically changing environments. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 16:1--16:25. Google Scholar
Digital Library
- Joao Gama. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC, London. Google Scholar
Digital Library
- Guha, Meyerson, Mishra, Motwani, and O’Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowled. Data Eng. 15 (2003). Google Scholar
Digital Library
- Jiawei Han and Micheline Kamber. 2000. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- Myles Hollander and Douglas A. Wolfe. 1999. Nonparametric Statistical Methods (2nd ed.). Wiley, New York, NY.Google Scholar
- E. R. Hruschka, L. N. de Castro, and R. J. G. B Campello. 2004. Evolutionary algorithms for clustering gene-expression data. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM’04). 403--406. Google Scholar
Digital Library
- Eduardo R. Hruschka, Ricardo J. G. B. Campello, and Leandro Nunes de Castro. 2006. Evolving clusters in gene-expression data. Inform. Sci. 176 (2006), 1898--1927. Google Scholar
Digital Library
- L. Hubert and P. Arabie. 1985. Comparing partitions. J. Class. 2 (1985), 193--218.Google Scholar
Cross Ref
- Anil K. Jain. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (2009), 651--666. Google Scholar
Digital Library
- Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc., Piscataway, NJ. Google Scholar
Digital Library
- L. Kaufman and P. J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY.Google Scholar
- Edwin Lughofer. 2011. Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications. Studies in Fuzziness and Soft Computing, Vol. 266. Springer, Berlin. Google Scholar
Digital Library
- Edwin Lughofer. 2012. A dynamic split-and-merge approach for evolving cluster models. Evolv. Syst. 3 (2012), 135--151.Google Scholar
Cross Ref
- Moamar S. Mouchaweh. 2010. Learning in dynamic environments: Application to the identification of hybrid dynamic systems. In Proc. of the 2010 9th International Conference on Machine Learning and Applications (ICMLA). 555--560. Google Scholar
Digital Library
- Murilo C. Naldi, Ricardo J. G. B. Campello, Eduardo R. Hruschka, and André C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Appl. Soft Comput. 11 (2011), 1938--1952. Google Scholar
Digital Library
- Murilo C. Naldi, André Fontana, and Ricardo J. G. B. Campello. 2009. Comparison among methods for k estimation in k-means. In Proc. of the ISDA’09. 1006--1013. Google Scholar
Digital Library
- Liadan O’Callaghan, Adam Meyerson, Rajeev Motwani, Nina Mishra, and Sudipto Guha. 2002. Streaming-data algorithms for high-quality clustering. In Proc. of the ICDE. 685--695. Google Scholar
Digital Library
- N. R. Pal and J. C. Bezdek. 1995. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Syst. 3 (1995), 370--379. Google Scholar
Digital Library
- K. Pearson. 1901. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 6 (1901), 559--572.Google Scholar
Cross Ref
- Witold Pedrycz and Richard Weber. 2008. Editorial: Special issue on soft computing for dynamic data mining. Appl. Soft Comput. 8 (2008), 1281--1282. Google Scholar
Digital Library
- Andres Quiroz, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma. 2012. Design and evaluation of decentralized online clustering. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 34:1--34:31. Google Scholar
Digital Library
- Moamar Sayed Mouchaweh and Edwin Lughofer. 2012. Learning in Non-Stationary Environments: Methods and Applications. Springer, Berlin. Google Scholar
Digital Library
- Jonathan A. Silva, Elaine R. Faria, Rodrigo C. Barros, Eduardo R. Hruschka, André C. P. L. F. de Carvalho, and João Gama. 2013. Data stream clustering: A survey. ACM Comput. Surv. 46, 1 (2013), 13:1--13:31. Google Scholar
Digital Library
- Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proc. KDD Workshop Text Mining. 109--111.Google Scholar
- Lucas Vendramin, Ricardo J. G. B. Campello, and Eduardo R. Hruschka. 2010. Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. 3 (2010), 209--235. Google Scholar
Digital Library
- Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data mining. Knowled. Inform. Syst. 14 (2007), 1--37. Google Scholar
Digital Library
- Zhenwei Yu, Jeffrey J. P. Tsai, and Thomas Weigert. 2008. An adaptive automatically tuning intrusion detection system. ACM Trans. Autonom. Adapt. Syst. 3 (2008), 10:1--10:25. Google Scholar
Digital Library
- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. of the SIGMOD’96. 103--114. Google Scholar
Digital Library
Index Terms
A Support System for Clustering Data Streams with a Variable Number of Clusters
Recommendations
Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 02Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is ...
An evolutionary algorithm for clustering data streams with a variable number of clusters
An evolutionary algorithm for clustering data stream is proposed.Our algorithm allows estimating k automatically from the data in an online fashion.It monitors eventual degradation in the quality of the induced clusters.Results show our algorithm ...
Clustering categorical data streams
In this paper, we propose an efficient clustering algorithm for analyzing categorical data streams. It has been proved that the proposed algorithm uses small memory footprints. We provide empirical analysis on the performance of the algorithm in ...






Comments