skip to main content
10.1145/3448016.3457296acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering

Published: 18 June 2021 Publication History

Abstract

This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time).
We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.

Supplementary Material

MP4 File (3448016.3457296.mp4)
This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN$^*$ in the literature). Our approach is based on generating a well-separated pair decomposition (WSPD) followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our WSPD-based algorithm forHDBSCAN$^*$. We also present a parallel approximate algorithm for HDBSCAN$^*$ (and OPTICS) based on a recent sequential algorithm by Gan and Tao. Finally, we give a new parallel divide-and-conquer algorithm for computing the dendrogramand reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN$^*$.We present highly-optimized implementations of our algorithms that achieve state-of-the-art performance. For the WSPD-based algorithms, we propose a memory optimization that requires only a subset of all possible well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.Not only are our algorithms efficient in practice, but we also prove strong theoretical bounds on their work (number of operations) and depth (parallel time). We show that all of our algorithms have polylogarithmic depth, making them highly parallel, and have work bounds that match their sequential counterparts.

References

[1]
[n.d.]. CHEM Dataset.https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures.
[2]
[n.d.]. GeoLife Dataset. https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/.
[3]
[n.d.]. Household Dataset. https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption.
[4]
[n.d.]. HT Dataset. https://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring.
[5]
[n.d.]. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
[6]
Pankaj K. Agarwal, Herbert Edelsbrunner, Otfried Schwarzkopf, and Emo Welzl. 1991. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete & Computational Geometry (1991), 407--422.
[7]
Mihael Ankerst, Markus Breunig, H. Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points to Identify the Clustering Structure. In ACM SIGMOD International Conference on Management of Data. 49--60.
[8]
Sunil Arya and David M. Mount. 2016. A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees. In ACM-SIAM Symposium on Discrete Algorithms. 1220--1233.
[9]
Bentley and Friedman. 1978. Fast Algorithms for Constructing Minimal Spanning Trees in Coordinate Spaces. IEEE Trans. Comput., Vol. C-27, 2 (Feb 1978), 97--105.
[10]
Richard P. Brent. 1974. The Parallel Evaluation of General Arithmetic Expressions. J. ACM, Vol. 21, 2 (April 1974), 201--206.
[11]
Paul B Callahan. 1993. Optimal parallel all-nearest-neighbors using the well-separated pair decomposition. In IEEE Symposium on Foundations of Computer Science (FOCS). 332--340.
[12]
Paul B. Callahan and S. Rao Kosaraju. 1993. Faster Algorithms for Some Geometric Graph Problems in Higher Dimensions. In ACM-SIAM Symposium on Discrete Algorithms. 291--300.
[13]
Paul B. Callahan and S. Rao Kosaraju. 1995. A Decomposition of Multidimensional Point Sets with Applications to k-Nearest-Neighbors and n-Body Potential Fields. J. ACM, Vol. 42, 1 (1995), 67--90.
[14]
Ricardo Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. 2015. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data (TKDD), Article 5 (2015), 5:1--5:51 pages.
[15]
Samidh Chatterjee, Michael Connor, and Piyush Kumar. 2010. Geometric Minimum Spanning Trees with GeoFilterKruskal. In International Symposium on Experimental Algorithms (SEA), Vol. 6049. 486--500.
[16]
Danny Z. Chen, Michiel Smid, and Bin Xu. 2005. Geometric Algorithms for Density-Based Data Clustering. International Journal of Computational Geometry & Applications, Vol. 15, 03 (2005), 239--260.
[17]
Richard Cole. 1988. Parallel Merge Sort. SIAM J. Comput., Vol. 17, 4 (Aug. 1988), 770--785.
[18]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3. ed.) .MIT Press.
[19]
Mark de Berg, Ade Gunawan, and Marcel Roeloffzen. 2019. Faster DB-scan and HDB-scan in Low-Dimensional Euclidean Spaces, In International Symposium on Algorithms and Computation (ISAAC). International Journal of Computational Geometry & Applications, Vol. 29, 01, 21--47.
[20]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-based Algorithm for Discovering Clusters a Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 226--231.
[21]
Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco. 2015. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors and Actuators B: Chemical, Vol. 215 (2015), 618--629.
[22]
Jerome H. Friedman, Jon Louis Bentley, and Raphael Ari Finkel. 1976. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software, Vol. 3, 3 (7 1976), 209--226.
[23]
Harold N. Gabow, Jon L. Bentley, and Robert E. Tarjan. 1984. Scaling and related techniques for geometry problems. In ACM Symposium on Theory of Computing (STOC). 135--143.
[24]
Junhao Gan and Yufei Tao. 2017. On the Hardness and Approximation of Euclidean DBSCAN. ACM Transactions on Database Systems (TODS), Vol. 42, 3 (2017), 14:1--14:45.
[25]
Junhao Gan and Yufei Tao. 2018. Fast Euclidean OPTICS with Bounded Precision in Low Dimensional Space. In ACM SIGMOD International Conference on Management of Data. 1067--1082.
[26]
J. Gil, Y. Matias, and U. Vishkin. 1991. Towards a theory of nearly constant time parallel algorithms. In IEEE Symposium on Foundations of Computer Science (FOCS). 698--710.
[27]
Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: Highly Parallel DBSCAN. In MLHPC. Article 2, 2:1--2:10 pages.
[28]
John C. Gower and Gavin J. S. Ross. 1969. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol. 18, 1 (1969), 54--64.
[29]
Yan Gu, Julian Shun, Yihan Sun, and Guy E. Blelloch. 2015. A Top-Down Parallel Semisort. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 24--34.
[30]
Ade Gunawan. 2013. A faster algorithm for DBSCAN. Master's thesis, Eindhoven University of Technology.
[31]
William Hendrix, Md Mostofa Ali Patwary, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. 2012. Parallel hierarchical clustering on shared memory platforms. In International Conference on High Performance Computing. 1--9.
[32]
Xu Hu, Jun Huang, and Minghui Qiu. 2017. A Communication Efficient Parallel DBSCAN Algorithm Based on Parameter Server. In ACM Conference on Information and Knowledge Management (CIKM). 2107--2110.
[33]
Ramón Huerta, Thiago Schiavo Mosqueiro, Jordi Fonollosa, Nikolai F. Rulkov, and Irene Rodríguez-Luján. 2016. Online Humidity and Temperature Decorrelation of Chemical Sensors for Continuous Monitoring. Chemometrics and Intelligent Laboratory Systems, Vol. 157, 169--176.
[34]
Joseph Jaja. 1992. Introduction to Parallel Algorithms .Addison-Wesley Professional.
[35]
Richard M. Karp and Vijaya Ramachandran. 1990. Parallel Algorithms for Shared-Memory Machines. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A). MIT Press, 869--941.
[36]
Charles E. Leiserson. 2010. The Cilk+ concurrency platform. J. Supercomputing, Vol. 51, 3 (2010). Springer.
[37]
Alessandro Lulli, Matteo Dell'Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable Density-based Clustering for Arbitrary Data. Proc. VLDB Endow., Vol. 10, 3 (Nov. 2016), 157--168.
[38]
William B March, Parikshit Ram, and Alexander G Gray. 2010. Fast Euclidean minimum spanning tree: algorithm, analysis, and applications. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 603--612.
[39]
Leland McInnes and John Healy. 2017. Accelerated hierarchical density clustering. arXiv preprint arXiv:1705.07321 (2017).
[40]
Daniel Müllner. 2011. Modern hierarchical, agglomerative clustering algorithms. arxiv: 1109.2378 [stat.ML]
[41]
Giri Narasimhan and Martin Zachariasen. 2001. Geometric Minimum Spanning Trees via Well-Separated Pair Decompositions. ACM Journal of Experimental Algorithmics, Vol. 6 (2001), 6.
[42]
Clark F. Olson. 1995. Parallel algorithms for hierarchical clustering. Parallel Comput., Vol. 21, 8 (1995), 1313 -- 1325.
[43]
Vitaly Osipov, Peter Sanders, and Johannes Singler. 2009. The Filter-Kruskal Minimum Spanning Tree Algorithm. In Workshop on Algorithm Engineering and Experiments (ALENEX). 52--61.
[44]
M. Patwary, D. Palsetia, A. Agrawal, W. K. Liao, F. Manne, and A. Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11.
[45]
M. Patwary, D. Palsetia, A. Agrawal, W. K. Liao, F. Manne, and A. Choudhary. 2013. Scalable parallel OPTICS data clustering using graph algorithmic techniques. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--12.
[46]
Jörg Sander, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky. 2003. Automatic extraction of clusters from hierarchical clustering representations. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 75--87.
[47]
J. Santos, T. Syed, M. Coelho Naldi, R. J. G. B. Campello, and J. Sander. 2019. Hierarchical Density-Based Clustering using MapReduce. IEEE Transactions on Big Data (2019), 1--1.
[48]
Michael Ian Shamos and Hoey Dan. 1975. Closest-point problems. (1975), 151--162.
[49]
J. Shun and G. E. Blelloch. 2014. A Simple Parallel Cartesian Tree Algorithm and its Application to Parallel Suffix Tree Construction. ACM Transactions on Parallel Computing (TOPC), Vol. 1, 1, Article 8 (Oct. 2014), 8:1--8:20 pages.
[50]
Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing Contention Through Priority Updates. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 152--163.
[51]
Hwanjun Song and J. Lee. 2018. RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning. In ACM SIGMOD International Conference on Management of Data. 1173--1187.
[52]
Vijay V. Vazirani. 2010. Approximation Algorithms .Springer Publishing Company, Incorporated.
[53]
P. J. Wan, G. Cu alinescu, X. Y. Li, and O. Frieder. 2002. Minimum-Energy Broadcasting in Static Ad Hoc Wireless Networks. Wireless Networks, Vol. 8, 6 (2002), 607--617.
[54]
Yiqiu Wang, Yan Gu, and Julian Shun. 2020. Theoretically-efficient and practical parallel DBSCAN. In ACM SIGMOD International Conference on Management of Data. 2555--2571.
[55]
Ying Xu, Victor Olman, and Dong Xu. 2001. Minimum Spanning Trees for Gene Expression Data Clustering. Genome Informatics, Vol. 12 (02 2001), 24--33.
[56]
Andrew Chi-Chih. Yao. 1982. On Constructing Minimum Spanning Trees in k-Dimensional Spaces and Related Problems. SIAM J. Comput., Vol. 11, 4 (1982), 721--736.
[57]
Meichen Yu, Arjan Hillebrand, Prejaas Tewarie, Jil Meier, Bob van Dijk, Piet Van Mieghem, and Cornelis Jan Stam. 2015. Hierarchical clustering in minimum spanning trees. Chaos: An Interdisciplinary Journal of Nonlinear Science, Vol. 25, 2 (2015), 023107.
[58]
Yu Zheng, Like Liu, Longhao Wang, and Xing Xie. 2008. Learning Transportation Mode from Raw GPS Data for Geographic Applications on the Web. In International Conference on World Wide Web. 247--256.

Cited By

View all
  • (2024)Network Models of BACE-1 Inhibitors: Exploring Structural and Biochemical RelationshipsInternational Journal of Molecular Sciences10.3390/ijms2513689025:13(6890)Online publication date: 23-Jun-2024
  • (2024)PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673148(908-918)Online publication date: 12-Aug-2024
  • (2024)Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage ClusteringProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659973(233-245)Online publication date: 17-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. parallel algorithms
  3. shared memory algorithms

Qualifiers

  • Research-article

Funding Sources

  • NSF CAREER Award
  • Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA
  • DOE Early Career Award
  • Google Faculty Research Award
  • DARPA SDH Award

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)384
  • Downloads (Last 6 weeks)62
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Network Models of BACE-1 Inhibitors: Exploring Structural and Biochemical RelationshipsInternational Journal of Molecular Sciences10.3390/ijms2513689025:13(6890)Online publication date: 23-Jun-2024
  • (2024)PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673148(908-918)Online publication date: 12-Aug-2024
  • (2024)Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage ClusteringProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659973(233-245)Online publication date: 17-Jun-2024
  • (2024)Fast and Memory-Efficient Approximate Minimum Spanning Tree Generation for Large DatasetsArabian Journal for Science and Engineering10.1007/s13369-024-08974-yOnline publication date: 21-Jun-2024
  • (2023)Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering (Abstract)Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing10.1145/3597635.3598025(17-18)Online publication date: 18-Jul-2023
  • (2023)Parallel Filtered Graphs for Hierarchical Clustering2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00153(1967-1980)Online publication date: Apr-2023
  • (2023)ARC–MOF: A Diverse Database of Metal-Organic Frameworks with DFT-Derived Partial Atomic Charges and Descriptors for Machine LearningChemistry of Materials10.1021/acs.chemmater.2c0248535:3(900-916)Online publication date: 20-Jan-2023
  • (2023)A Survey on Large Datasets Minimum Spanning TreesArtificial Intelligence10.1007/978-3-031-22485-0_3(26-35)Online publication date: 1-Jan-2023
  • (2022)A Review: Machine Learning for Combinatorial Optimization Problems in Energy AreasAlgorithms10.3390/a1506020515:6(205)Online publication date: 13-Jun-2022
  • (2022)Advanced Clustering Techniques for Emotional Grouping in Learning Environments Using an AR-SandboxInternational Journal of Uncertainty, Fuzziness and Knowledge-Based Systems10.1142/S021848852240014130:03(427-442)Online publication date: 22-Jul-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media