ABSTRACT
Understanding data distributions is one of the most fundamental research topic in data analysis. The literature provides a great deal of powerful statistical learning algorithms to gain knowledge on the underlying distribution given multivariate observations. We are likely to find out a dependence between features, the appearance of clusters or the presence of outliers. Before such deep investigations, we propose the folding test of unimodality. As a simple statistical description, it allows to detect whether data are gathered or not (unimodal or multimodal). To the best of our knowledge, this is the first multivariate and purely statistical unimodality test. It makes no distribution assumption and relies only on a straightforward p-value. Through real world data experiments, we show its relevance and how it could be useful for clustering.
References
- M-Y Cheng and Peter Hall. 1998. Calibrating the excess mass and dip tests of modality. Journal of the Royal Statistical Society: Series B (Statistical Methodology) Vol. 60, 3 (1998), 579--589.Google Scholar
- Dorin Comaniciu and Peter Meer. 2002. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence Vol. 24, 5 (2002), 603--619. Google Scholar
- Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. 2014. Learning k-Modal Distributions via Testing. Theory of Computing Vol. 10 (2014), 535--570.Google Scholar
- Constantinos Daskalakis, Ilias Diakonikolas, Rocco A. Servedio, Gregory Valiant, and Paul Valiant. 2013. Testing k-Modal Distributions: Optimal Algorithms via Reductions Proceedings of the 24th Symposium on Discrete Algorithms. 1833--1852. Google Scholar
- Sudhakar Dharmadhikari and Kumar Joag-Dev. 1988. Unimodality, convexity, and applications. Elsevier.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et almbox.. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, Vol. Vol. 96. 226--231. Google Scholar
- Mario A. T. Figueiredo and Anil K. Jain. 2002. Unsupervised learning of finite mixture models. IEEE Trans. on pattern analysis and machine intelligence (2002). Google Scholar
- Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science Vol. 315, 5814 (2007), 972--976.Google Scholar
- Peter Hall, Michael C Minnotte, and Chunming Zhang. 2004. Bump hunting with non-Gaussian kernels. Annals of statistics (2004), 2124--2141.Google Scholar
- Peter Hall and Matthew York. 2001. On the calibration of Silverman's test for multimodality. Statistica Sinica (2001), 515--536.Google Scholar
- Mark H Hansen and Bin Yu. 2001. Model selection and the principle of minimum description length. J. Amer. Statist. Assoc. Vol. 96, 454 (2001), 746--774.Google Scholar
- JA Hartigan and Surya Mohanty. 1992. The runt test for multimodality. Journal of Classification Vol. 9, 1 (1992), 63--70.Google Scholar
- John A. Hartigan and P. M. Hartigan. 1985. The dip test of unimodality. The Annals of Statistics (1985), 70--84. http://www.jstor.org/stable/2241144Google Scholar
- Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters Vol. 31, 8 (2010), 651--666. Google Scholar
- Argyris Kalogeratos and Aristidis Likas. 2012. Dip-means: an incremental clustering method for estimating the number of clusters Advances in neural information processing systems. 2393--2401. Google Scholar
- George Marsaglia et almbox.. 1972. Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics Vol. 43, 2 (1972), 645--646.Google Scholar
- Samuel Maurus and Claudia Plant. 2016. Skinny-dip: Clustering in a Sea of Noise. In Proceedings of the 22nd ACM SIGKDD. ACM, 1055--1064. Google Scholar
- Dietrich Werner Müller and Günther Sawitzki. 1991. Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. Vol. 86, 415 (1991), 738--746.Google Scholar
- Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics Vol. 20 (1987), 53--65. Google Scholar
- Gregory Paul M Rozál and JA Hartigan. 1994. The MAP test for multimodality. Journal of Classification Vol. 11, 1 (1994), 5--36.Google Scholar
- Jack Sherman and Winifred J Morrison. 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics Vol. 21, 1 (1950), 124--127.Google Scholar
- Bernard W. Silverman. 1981. Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society. Series B (Methodological) (1981).Google Scholar
- Ivo Stoepker. 2016. Testing for multimodality. (2016).Google Scholar
- Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) Vol. 63, 2 (2001), 411--423.Google Scholar
Index Terms
Are your data gathered?




Comments