10.1145/3219819.3219994acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedings
research-article

Are your data gathered?

ABSTRACT

Understanding data distributions is one of the most fundamental research topic in data analysis. The literature provides a great deal of powerful statistical learning algorithms to gain knowledge on the underlying distribution given multivariate observations. We are likely to find out a dependence between features, the appearance of clusters or the presence of outliers. Before such deep investigations, we propose the folding test of unimodality. As a simple statistical description, it allows to detect whether data are gathered or not (unimodal or multimodal). To the best of our knowledge, this is the first multivariate and purely statistical unimodality test. It makes no distribution assumption and relies only on a straightforward p-value. Through real world data experiments, we show its relevance and how it could be useful for clustering.

References

  1. M-Y Cheng and Peter Hall. 1998. Calibrating the excess mass and dip tests of modality. Journal of the Royal Statistical Society: Series B (Statistical Methodology) Vol. 60, 3 (1998), 579--589.Google ScholarGoogle Scholar
  2. Dorin Comaniciu and Peter Meer. 2002. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence Vol. 24, 5 (2002), 603--619. Google ScholarGoogle Scholar
  3. Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. 2014. Learning k-Modal Distributions via Testing. Theory of Computing Vol. 10 (2014), 535--570.Google ScholarGoogle Scholar
  4. Constantinos Daskalakis, Ilias Diakonikolas, Rocco A. Servedio, Gregory Valiant, and Paul Valiant. 2013. Testing k-Modal Distributions: Optimal Algorithms via Reductions Proceedings of the 24th Symposium on Discrete Algorithms. 1833--1852. Google ScholarGoogle Scholar
  5. Sudhakar Dharmadhikari and Kumar Joag-Dev. 1988. Unimodality, convexity, and applications. Elsevier.Google ScholarGoogle Scholar
  6. Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et almbox.. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, Vol. Vol. 96. 226--231. Google ScholarGoogle Scholar
  7. Mario A. T. Figueiredo and Anil K. Jain. 2002. Unsupervised learning of finite mixture models. IEEE Trans. on pattern analysis and machine intelligence (2002). Google ScholarGoogle Scholar
  8. Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science Vol. 315, 5814 (2007), 972--976.Google ScholarGoogle Scholar
  9. Peter Hall, Michael C Minnotte, and Chunming Zhang. 2004. Bump hunting with non-Gaussian kernels. Annals of statistics (2004), 2124--2141.Google ScholarGoogle Scholar
  10. Peter Hall and Matthew York. 2001. On the calibration of Silverman's test for multimodality. Statistica Sinica (2001), 515--536.Google ScholarGoogle Scholar
  11. Mark H Hansen and Bin Yu. 2001. Model selection and the principle of minimum description length. J. Amer. Statist. Assoc. Vol. 96, 454 (2001), 746--774.Google ScholarGoogle Scholar
  12. JA Hartigan and Surya Mohanty. 1992. The runt test for multimodality. Journal of Classification Vol. 9, 1 (1992), 63--70.Google ScholarGoogle Scholar
  13. John A. Hartigan and P. M. Hartigan. 1985. The dip test of unimodality. The Annals of Statistics (1985), 70--84. http://www.jstor.org/stable/2241144Google ScholarGoogle Scholar
  14. Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters Vol. 31, 8 (2010), 651--666. Google ScholarGoogle Scholar
  15. Argyris Kalogeratos and Aristidis Likas. 2012. Dip-means: an incremental clustering method for estimating the number of clusters Advances in neural information processing systems. 2393--2401. Google ScholarGoogle Scholar
  16. George Marsaglia et almbox.. 1972. Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics Vol. 43, 2 (1972), 645--646.Google ScholarGoogle Scholar
  17. Samuel Maurus and Claudia Plant. 2016. Skinny-dip: Clustering in a Sea of Noise. In Proceedings of the 22nd ACM SIGKDD. ACM, 1055--1064. Google ScholarGoogle Scholar
  18. Dietrich Werner Müller and Günther Sawitzki. 1991. Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. Vol. 86, 415 (1991), 738--746.Google ScholarGoogle Scholar
  19. Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics Vol. 20 (1987), 53--65. Google ScholarGoogle Scholar
  20. Gregory Paul M Rozál and JA Hartigan. 1994. The MAP test for multimodality. Journal of Classification Vol. 11, 1 (1994), 5--36.Google ScholarGoogle Scholar
  21. Jack Sherman and Winifred J Morrison. 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics Vol. 21, 1 (1950), 124--127.Google ScholarGoogle Scholar
  22. Bernard W. Silverman. 1981. Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society. Series B (Methodological) (1981).Google ScholarGoogle Scholar
  23. Ivo Stoepker. 2016. Testing for multimodality. (2016).Google ScholarGoogle Scholar
  24. Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) Vol. 63, 2 (2001), 411--423.Google ScholarGoogle Scholar

Index Terms

  1. Are your data gathered?

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!