Abstract
An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.
- N. Abe, B. Zadrozny, and J. Langford. 2006. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD).Philadelphia, PA. 504--509. DOI:http://dx.doi.org/10.1145/1150402.1150459 Google Scholar
Digital Library
- E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. 2013. Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). New York City, NY. 1009--1012. DOI:http://dx.doi.org/10.1145/2463676.2463696 Google Scholar
Digital Library
- C. C. Aggarwal and P. S. Yu. 2001. Outlier detection for high dimensional data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 37--46. DOI:http://dx.doi.org/10.1145/375663.375668 Google Scholar
Digital Library
- M. Agyemang, K. Barker, and R. Alhajj. 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis 10 (2006), 521--538. Google Scholar
Digital Library
- F. Angiulli and F. Fassetti. 2009. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD) 3, 1 (2009), 4:1--57. DOI:http://dx.doi.org/10.1145/1497577.1497581 Google Scholar
Digital Library
- F. Angiulli and C. Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2 Google Scholar
Digital Library
- M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Philadelphia, PA. 49--60. DOI:http://dx.doi.org/10.1145/304182.304187. Google Scholar
Digital Library
- K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.Google Scholar
- V. Barnett. 1978. The study of outliers: Purpose and model. Applied Statistics 27, 3 (1978), 242--250.Google Scholar
Cross Ref
- V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons.Google Scholar
- S. Basu, I. Davidson, and K. Wagstaff (Eds.). 2008. Constraint Clustering: Advances in Algorithms, Applications and Theory. CRC Press, Boca Raton, London, New York. Google Scholar
Digital Library
- S. D. Bay and M. Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. 29--38. DOI:http://dx.doi.org/10.1145/956750.956758 Google Scholar
Digital Library
- R. J. Beckman and R. D. Cook. 1983. Outlier..........s. Technometrics 25, 2 (1983), 119--149.Google Scholar
- M. Bilenko, S. Basu, and R. J. Mooney. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (ICML). Banff, AB, Canada. Google Scholar
Digital Library
- C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer. Google Scholar
Digital Library
- S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. 2004. Visually mining through cluster hierarchies. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412.Google Scholar
- M. M. Breunig, H.-P. Kriegel, P. Kröger, and J. Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Santa Barbara, CA. 79--90. DOI:http://dx.doi.org/10.1145/375663.375672 Google Scholar
Digital Library
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. 2000. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 93--104. DOI:http://dx.doi.org/10.1145/342009.335388 Google Scholar
Digital Library
- C. Böhm and C. Plant. 2008. HISSCLU: A hierarchical density-based method for semi-supervised clustering. In Proceedings of the 11th International Conference on Extending Database Technology (EDBT). Nantes, France. 440--451. DOI:http://dx.doi.org/10.1145/1353343.1353398 Google Scholar
Digital Library
- R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013a. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14Google Scholar
- R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. 2013b. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery 27, 3 (2013), 344--371. DOI:http://dx.doi.org/10.1007/s10618-013-0311-4Google Scholar
Cross Ref
- M. H. Chehreghani, H. Abolhassani, and M. H. Chehreghani. 2008. Improving density-based methods for hierarchical clustering of web pages. Data & Knowledge Engineering 67, 1 (2008), 30--50. DOI:http://dx.doi.org/10.1016/j.datak.2008.06.006 Google Scholar
Digital Library
- D. Comaniciu and P. Meer. 1999. Distribution free decomposition of multivariate data. Pattern Analysis and Applications 2, 1 (1999), 22--30. Google Scholar
Digital Library
- D. Coomans and D. L. Massart. 1981. Potential methods in pattern recognition: Part 2. CLUPOT -- An unsupervised pattern recognition technique. Analytica Chimica Acta 133, 3 (1981), 225--239.Google Scholar
Cross Ref
- A. Cuevas, M. Febrero, and R. Fraiman. 2000. Estimating the number of clusters. Canadian Journal of Statistics 28, 2 (2000), 367--382.Google Scholar
Cross Ref
- A. Cuevas, M. Febrero, and R. Fraiman. 2001. Cluster analysis: A further approach based on density estimation. Computational Statistics and Data Analysis 36 (2001), 441--459. DOI:http://dx.doi.org/10.1016/j.csda.2005.10.012 Google Scholar
Digital Library
- X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. 2014. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE.2014.6816642Google Scholar
- M. Daszykowski, B. Walczak, and D. L. Massart. 2001. Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics and Intelligent Laboratory Systems 56, 2 (2001), 83--92.Google Scholar
Cross Ref
- I. Davidson, K. L. Wagstaff, and S. Basu. 2006. Measuring constraint-set utility for partitional clustering algorithms. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Berlin, Germany. 115--126. Google Scholar
Digital Library
- T. de Vries, S. Chawla, and M. E. Houle. 2010. Finding local anomalies in very high dimensional space. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 128--137. DOI:http://dx.doi.org/10.1109/ICDM.2010.151 Google Scholar
Digital Library
- T. de Vries, S. Chawla, and M. E. Houle. 2012. Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems (KAIS) 32, 1 (2012), 25--52. DOI:http://dx.doi.org/10.1007/s10115-011-0430-4 Google Scholar
Digital Library
- S. Dhandapani, G. Gupta, and J. Ghosh. 2010. Design and Implementation of Scalable Hierarchical Density Based Clustering. Technical Report. Technical Report IDEAL-2010-06, Dept. Electrical and Computer Eng., Univ. of Texas at Austin.Google Scholar
- L. Ertöz, M. Steinbach, and V. Kumar. 2002. A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115.Google Scholar
- L. Ertöz, M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), San Francisco, CA.Google Scholar
- M. Ester. 2009. Density-based Clustering. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.). Springer, 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605Google Scholar
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR. 226--231.Google Scholar
- B. S. Everitt, S. Landau, and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. Google Scholar
Digital Library
- A. Foss and O. R. Zaïane. 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186. Google Scholar
Digital Library
- A. L. N. Fred and A. K. Jain. 2005. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 6 (2005), 835--850. DOI:http://dx.doi.org/10.1109/TPAMI.2005.113 Google Scholar
Digital Library
- K. Fukunaga and L. Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1 (1975), 32--40. Google Scholar
Digital Library
- J. Gao and P.-N. Tan. 2006. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM). Hong Kong, China. 212--221. DOI:http://dx.doi.org/10.1109/ICDM.2006.43 Google Scholar
Digital Library
- J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. 2005. The amsterdam library of object images. International Journal of Computer Vision 61, 1 (2005), 103--112. DOI:http://dx.doi.org/10.1023/B:VISI.0000042993.50813.60 Google Scholar
Digital Library
- F. E. Grubbs. 1950. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics 21, 1 (1950), 27--58.Google Scholar
Cross Ref
- G. Gupta, A. Liu, and J. Ghosh. 2006. Hierarchical density shaving: A clustering and visualization framework for large biological datasets. In IEEE ICDM Workshop on Data Mining in Bioinformatics (DMB). Hong Kong, China, 89--93. Google Scholar
Digital Library
- G. Gupta, A. Liu, and J. Ghosh. 2010. Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 2 (2010), 223--237. DOI:http://dx.doi.org/10.1145/1791396.1791402 Google Scholar
Digital Library
- A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. 2009. Detection of outliers. Wiley Interdisciplinary Reviews: Computational Statistics 1, 1 (2009), 57--70.Google Scholar
Digital Library
- S. Hang, Z. You, and L. Y. Chun. 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data. In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Vol. 5. Tianjin, China, 52--56. Google Scholar
Digital Library
- J. A. Hanley and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1982), 29--36.Google Scholar
Cross Ref
- J. A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto. Google Scholar
Digital Library
- J. A. Hartigan. 1987. Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82, 397 (1987), 267--270.Google Scholar
Cross Ref
- D. Hawkins. 1980. Identification of Outliers. Chapman and Hall.Google Scholar
- M. Herbin, N. Bonnet, and P. Vautrot. 2001. Estimation of the number of clusters and influence zones. Pattern Recognition Letters 22, 14 (2001), 1557--1568. DOI:http://dx.doi.org/10.1016/S0167-8655(01)00103-9 Google Scholar
Digital Library
- A. Hinneburg and H. H. Gabriel. 2007. Denclue 2.0: Fast clustering based on kernel density estimation. In Proceedings of the 7th International Symposium on Intelligent Data Analysis (IDA). Ljubljana, Slovenia. 70--80. DOI:http://dx.doi.org/10.1007/978-3-540-74825-0_7 Google Scholar
Digital Library
- A. Hinneburg and D. A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD). New York City, NY. 58--65.Google Scholar
- A. Hinneburg and D. A. Keim. 2003. A general approach to clustering in large databases with noise. Knowledge and Information Systems (KAIS) 5 (2003), 387--415. Issue 4.Google Scholar
Cross Ref
- V. J. Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22 (2004), 85--126. DOI:http://dx.doi.org/10.1023/B:AIRE.0000045502.10941.a9 Google Scholar
Digital Library
- D. Horta and R. J. G. B. Campello. 2012. Automatic aspect discrimination in data clustering. Pattern Recognition 45, 12 (2012), 4370--4388. DOI:http://dx.doi.org/10.1016/j.patcog.2012.05.011 Google Scholar
Digital Library
- L. Hubert and P. Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.Google Scholar
Cross Ref
- J.-N. Hwang, S.-R. Lay, and A. Lippman. 1994. Nonparametric multivariate density estimation: A comparative study. IEEE Transactions on Signal Processing 42, 10 (1994), 2795--2810. Google Scholar
Digital Library
- A. K. Jain and R. C. Dubes. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs. Google Scholar
Digital Library
- W. Jin, A. K. Tung, and J. Han. 2001. Mining top-n local outliers in large databases. In Proceedings of the 7th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Francisco, CA. 293--298. DOI:http://dx.doi.org/10.1145/502512.502554 Google Scholar
Digital Library
- W. Jin, A. K. H. Tung, J. Han, and W. Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Singapore. 577--593. DOI:http://dx.doi.org/10.1007/11731139_68 Google Scholar
Digital Library
- S. C. Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32 (1967), 241--254.Google Scholar
Cross Ref
- F. Keller, E. Müller, and K. Böhm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE). Washington, DC. DOI:http://dx.doi.org/10.1109/ICDE.2012.88 Google Scholar
Digital Library
- E. M. Knorr and R. T. Ng. 1997a. A unified approach for mining outliers. In Proceedings of the conference of the Centre for Advanced Studies on Collaborative research (CASCON). Toronto, ON, Canada. 11--23. DOI:http://dx.doi.org/10.1145/782010.782021 Google Scholar
Digital Library
- E. M. Knorr and R. T. Ng. 1997b. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Newport Beach, CA. 219--222. DOI:http://dx.doi.org/10.1145/782010.782021Google Scholar
- E. M. Knorr and R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). New York City, NY. 392--403. Google Scholar
Digital Library
- E. M. Knorr, R. T. Ng, and V. Tucanov. 2000. Distance-based outliers: Algorithms and applications. The VLDB Journal 8, 3--4 (2000), 237--253. DOI:http://dx.doi.org/10.1007/s007780050006 Google Scholar
Digital Library
- G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE Transactions on Knowledge and Data Engineering 15, 5 (2003), 1170--1187. DOI:http://dx.doi.org/10.1109/TKDE.2003.1232271 Google Scholar
Digital Library
- H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. 2011. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 3 (2011), 231--240. DOI:http://dx.doi.org/10.1002/widm.30Google Scholar
Cross Ref
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009a. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). Hong Kong, China. 1649--1652. DOI:http://dx.doi.org/10.1145/1645953.1646195 Google Scholar
Digital Library
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2009b. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 831--838. DOI:http://dx.doi.org/10.1007/978-3-642-01307-2_86 Google Scholar
Digital Library
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2011a. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM) Mesa, AZ. 13--24.Google Scholar
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. 2012. Outlier detection in arbitrarily oriented subspaces. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 379--388. DOI:http://dx.doi.org/10.1109/ICDM.2012.21 Google Scholar
Digital Library
- H.-P. Kriegel, P. Kröger, and A. Zimek. 2010. Outlier Detection Techniques. Tutorial at the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, DC. (2010).Google Scholar
- H.-P. Kriegel, M. Schubert, and A. Zimek. 2008. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 444--452. DOI:http://dx.doi.org/10.1145/1401890.1401946 Google Scholar
Digital Library
- B. Larsen and C. Aone. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). San Diego, CA. 16--22. Google Scholar
Digital Library
- A. Lazarevic and V. Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 157--166. DOI:http://dx.doi.org/10.1145/1081870.1081891 Google Scholar
Digital Library
- L. Lelis and J. Sander. 2009. Semi-supervised density-based clustering. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM). Miami, FL. 842--847. DOI:http://dx.doi.org/10.1109/ICDM.2009.143 Google Scholar
Digital Library
- F. T. Liu, K. M. Ting, and Z.-H. Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6, 1 (2012), 3:1--39. Google Scholar
Digital Library
- P. Liu, D. Zhou, and N. Wu. 2007. VDBSCAN: Varied density based spatial clustering of applications with noise. In International Conference on Service Systems and Service Management (ICSSSM). Chengdu, China. 1--4.Google Scholar
- B. Micenková, R. T. Ng, X. H. Dang, and I. Assent. 2013. Explaining outliers by subspace separability. In Proceedings of the 13th IEEE International Conference on Data Mining (ICDM). Dallas, TX. 518--527.Google Scholar
- D. W. Muller and G. Sawitzki. 1991. Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. 86, 415 (1991), 738--746.Google Scholar
- E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. 2012. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM). Brussels, Belgium. 529--538. DOI:http://dx.doi.org/10.1109/ICDM.2012.112 Google Scholar
Digital Library
- E. Müller, I. Assent, U. Steinhausen, and T. Seidl. 2008. OutRank: Ranking outliers in high dimensional data. In Proceedings of the 24th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank). Cancun, Mexico. 600--603. DOI:http://dx.doi.org/10.1109/ICDEW.2008.4498387 Google Scholar
Digital Library
- E. Müller, M. Schiffer, and T. Seidl. 2010. Adaptive outlierness for subspace outlier ranking. In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM). Toronto, ON, Canada. 1629--1632. DOI:http://dx.doi.org/10.1145/1871437.1871690 Google Scholar
Digital Library
- E. Müller, M. Schiffer, and T. Seidl. 2011. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE). Hannover, Germany. 434--445. DOI:http://dx.doi.org/10.1109/ICDE.2011.5767916 Google Scholar
Digital Library
- M. C. Naldi, R. J. G. B. Campello, E. R. Hruschka, and A. C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Applied Soft Computing 11, 2 (2011), 1938--1952. Google Scholar
Digital Library
- H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. 2010. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA). Tsukuba, Japan. 368--383. DOI:http://dx.doi.org/10.1007/978-3-642-12026-8_29 Google Scholar
Digital Library
- H. V. Nguyen and V. Gopalkrishnan. 2009. Efficient pruning schemes for distance-based outlier detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD). Bled, Slovenia. 160--175. DOI:http://dx.doi.org/10.1007/978-3-642-04174-7_11 Google Scholar
Digital Library
- H. V. Nguyen, V. Gopalkrishnan, and I. Assent. 2011. An unbiased distance-based outlier detection approach for high-dimensional data. In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA). Hong Kong, China. 138--152. DOI:http://dx.doi.org/10.1007/978-3-642-20149-3_12 Google Scholar
Digital Library
- G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. 2010. Distance-based outlier detection: Consolidation and renewed bearing. Proceedings of the VLDB Endowment 3, 2 (2010), 1469--1480. Google Scholar
Digital Library
- S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. 2003. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE). Bangalore, India. 315--326. DOI:http://dx.doi.org/10.1109/ICDE.2003.1260802Google Scholar
- D. Pascual, F. Pla, and J. S. Sánchez. 2006. Non parametric local density-based clustering for multimodal overlapping distributions. In Proceedings of the 7th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL). Burgos, Spain. 671--678. Google Scholar
Digital Library
- E. S. Pearson and C. Chandra Sekar. 1936. The efficiency of statistical tools and a criterion for the rejection of outlying observations. Biometrika 28, 3/4 (1936), 308--320.Google Scholar
Cross Ref
- T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou. 2009. DECODE: A new method for discovering clusters of different densities in spatial data. Data Mining and Knowledge Discovery 18, 3 (2009), 337--369. DOI:http://dx.doi.org/10.1007/s10618-008-0120-3 Google Scholar
Digital Library
- T. Pei, A.-X. Zhu, C. Zhou, B. Li, and C. Qin. 2006. A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. International Journal of Geographical Information Science 20, 2 (2006), 153--168.Google Scholar
Cross Ref
- N. Pham and R. Pagh. 2012. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Beijing, China. Google Scholar
Digital Library
- S. Ramaswamy, R. Rastogi, and K. Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Dallas, TX. 427--438. DOI:http://dx.doi.org/10.1145/342009.335437 Google Scholar
Digital Library
- C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2007. C-DBSCAN: Density-based clustering with constraints. In Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, A. An, J. Stefanowski, S. Ramanna, C. Butz, W. Pedrycz, and G. Wang (Eds.). 216--223. Google Scholar
Digital Library
- C. Ruiz, M. Spiliopoulou, and E. Menasalvas. 2010. Density-based semi-supervised clustering. Data Mining and Knowledge Discovery 21, 3 (2010), 345--370. Google Scholar
Digital Library
- J. Sander. 2010. Density-based clustering. In Encyclopedia of Machine Learning, C. Sammut and G. I. Webb (Eds.). Springer, US, 270--273.Google Scholar
- J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. 2003. Automatic extraction of clusters from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Seoul, Korea. 75--87. DOI:http://dx.doi.org/10.1007/3-540-36175-8_8 Google Scholar
Digital Library
- E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. 2012. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM). Anaheim, CA. 1047--1058.Google Scholar
Cross Ref
- E. Schubert, A. Zimek, and H.-P. Kriegel. 2014a. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM). Philadelphia, PA. 542--550. DOI:http://dx.doi.org/10.1137/1.9781611973440.63Google Scholar
Cross Ref
- E. Schubert, A. Zimek, and H.-P. Kriegel. 2014b. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery 28, 1 (2014), 190--237. DOI:http://dx.doi.org/10.1007/s10618-012-0300-z Google Scholar
Digital Library
- B. W. Silverman. 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC.Google Scholar
- P. H. A. Sneath. 1957. The application of computers to taxonomy. Journal of General Microbiology 17 (1957), 201--226.Google Scholar
Cross Ref
- T. Soler and M. Chin. 1985. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention. 393--406.Google Scholar
- W. Stuetzle. 2003. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20, 1 (2003), 25--47. DOI:http://dx.doi.org/10.1007/s00357-003-0004-6Google Scholar
Cross Ref
- W. Stuetzle and R. Nugent. 2010. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics 19, 2 (2010), 397--418.Google Scholar
Cross Ref
- H. Sun, J. Huang, J. Han, H. Deng, P. Zhao, and B. Feng. 2010. gSkeletonClu: Density-based network clustering via structure-connected tree division or agglomeration. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). Sydney, Australia. 481--490. Google Scholar
Digital Library
- P.-N. Tan, M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley. Google Scholar
Digital Library
- J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung. 2002. Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Taipei, Taiwan. 535--548. DOI:http://dx.doi.org/10.1007/3-540-47887-6_53 Google Scholar
Digital Library
- K. L. Wagstaff. 2002. Intelligent Clustering with Instance-Level Constraints. Ph.D. Dissertation. Department of Computer Science, Cornell University. Google Scholar
Digital Library
- D. Wishart. 1969. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical Taxonomy, A. J. Cole (Ed.). 282--311.Google Scholar
- M. A. Wong and T. Lane. 1983. A kth nearest neighbour clustering procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 45, 3 (1983), 362--368.Google Scholar
- R. Xu and D. Wunsch II. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16 (2005), 645--678. DOI:http://dx.doi.org/10.1109/TNN.2005.845141 Google Scholar
Digital Library
- R. Xu and D. Wunsch II. 2009. Clustering. IEEE Press. Google Scholar
Digital Library
- J. Yang, N. Zhong, Y. Yao, and J. Wang. 2008. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Las Vegas, NV. 776--784. DOI:http://dx.doi.org/10.1145/1401890.1401983 Google Scholar
Digital Library
- K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 10 (2001), 977--987.Google Scholar
Cross Ref
- K. Y. Yeung, M. Medvedovic, and R. E. Bumgarner. 2003. Clustering gene-expression data with repeated measurements. Genome Biology 4, 5 (2003), R34.1--R34.17.Google Scholar
Cross Ref
- K. Zhang, M. Hutter, and H. Jin. 2009. A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Bangkok, Thailand. 813--822. DOI:http://dx.doi.org/10.1007/978-3-642-01307-2_84 Google Scholar
Digital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). Montreal, QC, Canada. 103--114. DOI:http://dx.doi.org/10.1145/233269.233324 Google Scholar
Digital Library
- A. Zimek, R. J. G. B. Campello, and J. Sander. 2013. Ensembles for unsupervised outlier detection: Challenges and research questions. ACM SIGKDD Explorations 15, 1 (2013), 11--22. Google Scholar
Digital Library
- A. Zimek, R. J. G. B. Campello, and J. Sander. 2014. Data perturbation for outlier detection ensembles. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management (SSDBM). Aalborg, Denmark. 13:1--12. DOI:http://dx.doi.org/10.1145/2618243.2618257 Google Scholar
Digital Library
- A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. 2013a. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). Chicago, IL. 428--436. DOI:http://dx.doi.org/10.1145/2487575.2487676 Google Scholar
Digital Library
- A. Zimek, E. Schubert, and H.-P. Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining 5, 5 (2012), 363--387. DOI:http://dx.doi.org/10.1002/sam.11161 Google Scholar
Digital Library
Index Terms
Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
Recommendations
Density-based hierarchical clustering for streaming data
For streaming data that arrive continuously such as multimedia data and financial transactions, clustering algorithms are typically allowed to scan the data set only once. Existing research in this domain mainly focuses on improving the accuracy of ...
Initialization of K-modes clustering using outlier detection techniques
We considered the initialization of K-modes clustering from the view of outlier detection.We proposed an initialization algorithm for K-modes clustering via the distance-based outlier detection technique.We presented a partition entropy-based outlier ...
Density-based semi-supervised clustering
Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...





Comments