skip to main content
article
Free access

BIRCH: an efficient data clustering method for very large databases

Published: 01 June 1996 Publication History

Abstract

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

References

[1]
Peter C, heeselnan, James Kelly, Matthew Self, et al., Auto CIass : A Bayesian Classification System, Proc. of the 5th Int'l Conf. on Machine Learning, Morgan Kaufman, 3un. 1988.
[2]
Richard Duds, and Peter E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.
[3]
R. Dubes, and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis Advances in C, omputers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.
[4]
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, A Database Interface .for Clustering in Large Spatial Databases, Proc. of 1st {nt'l Conf. on Knowledge Discovery and Data Mining, 1995.
[5]
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int'l Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995.
[6]
Douglas H. Fisher, Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987
[7]
Douglas H. Fisher, Iterative Optimization and Simplification of Hierarchical CIuaterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt l lniversity, Nashville, TN 37235.
[8]
A. Gersho and P~. Gray, Vector quantization and signal compression, Boston, Ms.: Kluwer Academic Publishers, 1992.
[9]
Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.
[10]
Michael Lebowitz, Experiments with Incremental Concept Formation : UNIMEM, Machine Learning, 1987.
[11]
R.C.T.Lee, Clustering analysis and its applications, Advances in Information Systems Science, Edited by J .T.Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.
[12]
F. Murtagh, A Survey of _Recent Advances in Hier'archical Clustering Algorithms, The Computer Journal, 1933.
[13]
Raymond T. Ng and Jiawei Hart, Efficient and Effective Clustering Methods for,Spatial Data Mining, Proc. of VLDB, 1994.
[14]
(:lark F. Olson, Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, l.lniv, of California at Berkeley, Dec.,1993.
[15]
Tian Zhang, Raghu Ramakrishnan, and Miron Livl~y, BIRCH: An Efficient Data Clustering Method .for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.

Cited By

View all
  • (2025)Anchor-based fast spectral ensemble clusteringInformation Fusion10.1016/j.inffus.2024.102587113(102587)Online publication date: Jan-2025
  • (2024)Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutionsPeerJ Computer Science10.7717/peerj-cs.228610(e2286)Online publication date: 29-Aug-2024
  • (2024)Decoding gen Z employee profiles: revealing work valuesRecherches en Sciences de Gestion10.3917/resg.159.0267N° 159:6(267-294)Online publication date: 26-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 25, Issue 2
June 1996
557 pages
ISSN:0163-5808
DOI:10.1145/235968
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data
    June 1996
    560 pages
    ISBN:0897917944
    DOI:10.1145/233269
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1996
Published in SIGMOD Volume 25, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,185
  • Downloads (Last 6 weeks)236
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Anchor-based fast spectral ensemble clusteringInformation Fusion10.1016/j.inffus.2024.102587113(102587)Online publication date: Jan-2025
  • (2024)Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutionsPeerJ Computer Science10.7717/peerj-cs.228610(e2286)Online publication date: 29-Aug-2024
  • (2024)Decoding gen Z employee profiles: revealing work valuesRecherches en Sciences de Gestion10.3917/resg.159.0267N° 159:6(267-294)Online publication date: 26-Feb-2024
  • (2024)USmorph: An Updated Framework of Automatic Classification of Galaxy Morphologies and Its Application to Galaxies in the COSMOS FieldThe Astrophysical Journal Supplement Series10.3847/1538-4365/ad434f272:2(42)Online publication date: 11-Jun-2024
  • (2024)Monitoring Dynamically Changing Migratory Flocks Using an Algebraic Graph Theory-Based Clustering AlgorithmRemote Sensing10.3390/rs1607121516:7(1215)Online publication date: 29-Mar-2024
  • (2024)Data-Driven Consensus Protocol Classification Using Machine LearningMathematics10.3390/math1202022112:2(221)Online publication date: 9-Jan-2024
  • (2024)Functional Framework for Multivariant E-Commerce User InterfacesJournal of Theoretical and Applied Electronic Commerce Research10.3390/jtaer1901002219:1(412-430)Online publication date: 16-Feb-2024
  • (2024)Comprehensive Study on Optimizing Inland Waterway Vessel Routes Using AIS DataJournal of Marine Science and Engineering10.3390/jmse1210177512:10(1775)Online publication date: 6-Oct-2024
  • (2024)Detecting Logos for Indoor Environmental Perception Using Unsupervised and Few-Shot LearningElectronics10.3390/electronics1312224613:12(2246)Online publication date: 7-Jun-2024
  • (2024)SCALE-BOSS-MR: Scalable Time Series Classification Using Multiple Symbolic RepresentationsApplied Sciences10.3390/app1402068914:2(689)Online publication date: 13-Jan-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media