skip to main content
research-article

Parallel Massive Clustering of Discrete Distributions

Authors Info & Claims
Published:02 June 2015Publication History
Skip Abstract Section

Abstract

The trend of analyzing big data in artificial intelligence demands highly-scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed, aiming at clustering data represented by bags of weighted vectors which are well adopted data signatures in many emerging information retrieval and multimedia learning applications. However, the high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present the parallel D2-clustering (PD2-clustering) algorithm with substantially improved scalability. We developed a hierarchical multipass algorithm structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. Experiments and extensive comparisons between PD2-clustering and other clustering algorithms are conducted on synthetic datasets. The results show that the proposed parallel algorithm achieves significant speed-up with minor accuracy loss. We apply PD2-clustering to image concept learning. In addition, by extending D2-clustering to symbolic data, we apply PD2-clustering to protein sequence clustering. For both applications, we demonstrate the high competitiveness of our new algorithm in comparison with other state-of-the-art methods.

Skip Supplemental Material Section

Supplemental Material

References

  1. D. Arthur, B. Manthey, and H. Röglin. 2009. K-means has polynomial smoothed complexity. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science. 405--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. 2005. Clustering with Bregman divergences. J. Mach. Learn. Research 6, 1705--1749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. 2013. Testing closeness of discrete distributions. J. ACM 60, 1, 4:1--4:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Beecks, A. M. Ivanescu, S. Kirchhoff, and T. Seidl. 2011. Modeling multimedia contents through probabilistic feature signatures. In Proceedings of the 19th ACM International Conference on Multimedia. 1433--1436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31, 1, 365--370.Google ScholarGoogle ScholarCross RefCross Ref
  6. W. Chen, Y. Song, H. Bai, C. Lin, and E. Y. Chang. 2011. Parallel spectral clustering in distributed systems. IEEE Trans. Pattern Anal. Mach. Intell. 33, 3, 568--586. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Clement and W. Desch. 2008. An elementary proof of the triangle inequality for the Wasserstein metric. Proc. Amer. Math. Soc. 136, 1, 333--339.Google ScholarGoogle ScholarCross RefCross Ref
  9. T. M. Cover and J. A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.Google ScholarGoogle Scholar
  10. E. Dahlhaus. 2000. Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algo. 36, 2, 205--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Daubechies. 1992. Ten Lectures on Wavelets. SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. S. Dhillon and D. S. Modha. 2001. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1--2, 143--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. J. Enright and C. A. Ouzounis. 2000. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16, 5, 451--457.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. L. Ferreira Cordeiro, C. Traina Junior, A. J. Machado Traina, J. Lóopez, U. Kang, and C. Faloutsos. 2011. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 690--698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Garrow, A. Agnew, and D. Westhead. 2005. TMB-Hunt: An amino acid composition based method to screen proteomes for Beta-Barrel transmembrane proteins. BMC Bioinformatics 6, 1, 56--71.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. Gersho and R. M. Gray. 1992. Vector Quantization and Signal Compression. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Gonina, G. Friedland, E. Battenberg, P. Koanantakool, M. Driscoll, E. Georganas, and K. Keutzer. 2014. Scalable multimedia content analysis on parallel platforms using python. ACM Trans. Multimedia Comput. Commun. Appl. 10, 2, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15, 3, 515--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li. 2010. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 26, 5, 680--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Hubert and P. Arabie. 1985. Comparing partitions. J. Classification 2, 1, 193--218.Google ScholarGoogle ScholarCross RefCross Ref
  21. L. V. Kantorovich. 1942. On the transfer of masses. Dokl. Akad. Nauk. SSSR. 227--229.Google ScholarGoogle Scholar
  22. D. Kelley and S. Salzberg. 2010. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 11, 1, 544--555.Google ScholarGoogle ScholarCross RefCross Ref
  23. E. Levina and P. Bickel. 2001. The earth mover's distance is the Mallows distance: some insights from statistics. In Proceedings of 8th IEEE International Conference on Computer Vision, Vol. 2. 251--256.Google ScholarGoogle Scholar
  24. J. Li and J. Z. Wang. 2008. Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30, 6, 985--1002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Linde, A. Buzo, and R. Gray. 1980. An algorithm for vector quantizer design. IEEE Trans. Commun. 28, 1, 84--95.Google ScholarGoogle ScholarCross RefCross Ref
  26. D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2, 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281--297.Google ScholarGoogle Scholar
  28. G. Monge. 1781. Méemoire sur la théeorie des déeblais et des remblais. De l'Imprimerie Royale.Google ScholarGoogle Scholar
  29. K. G. Murty. 1983. Linear Programming. Vol. 57, Wiley New York.Google ScholarGoogle Scholar
  30. C. F. Olson. 1995. Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 8, 1313--1325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Paccanaro, J. A. Casbon, and M. A. Saqi. 2006. Spectral clustering of protein sequences. Nucleic Acids Research 34, 5, 1571--1580.Google ScholarGoogle ScholarCross RefCross Ref
  32. W. R. Pearson. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63--98.Google ScholarGoogle ScholarCross RefCross Ref
  33. S. L. K. Pond, K. Scheffler, M. B. Gravenor, A. F. Y. Poon, and S. D. W. Frost. 2010. Evolutionary fingerprinting of genes. Mol. Biol. Evolution 27, 3, 520--536.Google ScholarGoogle ScholarCross RefCross Ref
  34. A. Rosenberg and J. Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the Conference on Empirical Methods on Natural Language Processing and Computational Natural Language Learning, Vol. 7. 410--420.Google ScholarGoogle Scholar
  35. J. Sang and C. Xu. 2011. Browse by chunks: Topic mining and organizing on web-scale social media. ACM Trans. Multimedia Comput. Commun. Appl. 7, 1, 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. J. Sigrist, E. De Castro, L. Cerutti, B. A. Cuche, N. Hulo, A. Bridge, L. Bougueleret, and I. Xenarios. 2013. New and continuing developments at PROSITE. Nucleic Acids Research 41, D1, D344--D347.Google ScholarGoogle ScholarCross RefCross Ref
  37. N. X. Vinh, J. Epps, and J. Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Research, 2837--2854. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Wan. 2007. A novel document similarity measure based on earth mover's distance. Information Sciences 177, 18, 3718--3730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Xu and S. F. Chang. 2008. Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. Pattern Anal. Mach. Intell. 30, 11, 1985--1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. W. Zhao, H. Ma, and Q. He. 2009. Parallel k-means clustering based on mapreduce. In Cloud Computing, Lecture Notes in Computer Science Vol. 5931, 674--679. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Q. Zheng and W. Gao. 2008. Constructing visual phrases for effective and efficient object-based image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 5, 1, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallel Massive Clustering of Discrete Distributions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!