Abstract
The trend of analyzing big data in artificial intelligence demands highly-scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed, aiming at clustering data represented by bags of weighted vectors which are well adopted data signatures in many emerging information retrieval and multimedia learning applications. However, the high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present the parallel D2-clustering (PD2-clustering) algorithm with substantially improved scalability. We developed a hierarchical multipass algorithm structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. Experiments and extensive comparisons between PD2-clustering and other clustering algorithms are conducted on synthetic datasets. The results show that the proposed parallel algorithm achieves significant speed-up with minor accuracy loss. We apply PD2-clustering to image concept learning. In addition, by extending D2-clustering to symbolic data, we apply PD2-clustering to protein sequence clustering. For both applications, we demonstrate the high competitiveness of our new algorithm in comparison with other state-of-the-art methods.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, A reward-and-punishment-based approach for concept detection using adaptive ontology rules
- D. Arthur, B. Manthey, and H. Röglin. 2009. K-means has polynomial smoothed complexity. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science. 405--414. Google Scholar
Digital Library
- A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. 2005. Clustering with Bregman divergences. J. Mach. Learn. Research 6, 1705--1749. Google Scholar
Digital Library
- T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. 2013. Testing closeness of discrete distributions. J. ACM 60, 1, 4:1--4:25. Google Scholar
Digital Library
- C. Beecks, A. M. Ivanescu, S. Kirchhoff, and T. Seidl. 2011. Modeling multimedia contents through probabilistic feature signatures. In Proceedings of the 19th ACM International Conference on Multimedia. 1433--1436. Google Scholar
Digital Library
- B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31, 1, 365--370.Google Scholar
Cross Ref
- W. Chen, Y. Song, H. Bai, C. Lin, and E. Y. Chang. 2011. Parallel spectral clustering in distributed systems. IEEE Trans. Pattern Anal. Mach. Intell. 33, 3, 568--586. Google Scholar
Digital Library
- T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48. Google Scholar
Digital Library
- P. Clement and W. Desch. 2008. An elementary proof of the triangle inequality for the Wasserstein metric. Proc. Amer. Math. Soc. 136, 1, 333--339.Google Scholar
Cross Ref
- T. M. Cover and J. A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.Google Scholar
- E. Dahlhaus. 2000. Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algo. 36, 2, 205--240. Google Scholar
Digital Library
- I. Daubechies. 1992. Ten Lectures on Wavelets. SIAM. Google Scholar
Digital Library
- I. S. Dhillon and D. S. Modha. 2001. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 1--2, 143--175. Google Scholar
Digital Library
- A. J. Enright and C. A. Ouzounis. 2000. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16, 5, 451--457.Google Scholar
Cross Ref
- R. L. Ferreira Cordeiro, C. Traina Junior, A. J. Machado Traina, J. Lóopez, U. Kang, and C. Faloutsos. 2011. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 690--698. Google Scholar
Digital Library
- A. Garrow, A. Agnew, and D. Westhead. 2005. TMB-Hunt: An amino acid composition based method to screen proteomes for Beta-Barrel transmembrane proteins. BMC Bioinformatics 6, 1, 56--71.Google Scholar
Cross Ref
- A. Gersho and R. M. Gray. 1992. Vector Quantization and Signal Compression. Springer. Google Scholar
Digital Library
- E. Gonina, G. Friedland, E. Battenberg, P. Koanantakool, M. Driscoll, E. Georganas, and K. Keutzer. 2014. Scalable multimedia content analysis on parallel platforms using python. ACM Trans. Multimedia Comput. Commun. Appl. 10, 2, 18. Google Scholar
Digital Library
- S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng. 15, 3, 515--528. Google Scholar
Digital Library
- Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li. 2010. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 26, 5, 680--682. Google Scholar
Digital Library
- L. Hubert and P. Arabie. 1985. Comparing partitions. J. Classification 2, 1, 193--218.Google Scholar
Cross Ref
- L. V. Kantorovich. 1942. On the transfer of masses. Dokl. Akad. Nauk. SSSR. 227--229.Google Scholar
- D. Kelley and S. Salzberg. 2010. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 11, 1, 544--555.Google Scholar
Cross Ref
- E. Levina and P. Bickel. 2001. The earth mover's distance is the Mallows distance: some insights from statistics. In Proceedings of 8th IEEE International Conference on Computer Vision, Vol. 2. 251--256.Google Scholar
- J. Li and J. Z. Wang. 2008. Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30, 6, 985--1002. Google Scholar
Digital Library
- Y. Linde, A. Buzo, and R. Gray. 1980. An algorithm for vector quantizer design. IEEE Trans. Commun. 28, 1, 84--95.Google Scholar
Cross Ref
- D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2, 91--110. Google Scholar
Digital Library
- J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281--297.Google Scholar
- G. Monge. 1781. Méemoire sur la théeorie des déeblais et des remblais. De l'Imprimerie Royale.Google Scholar
- K. G. Murty. 1983. Linear Programming. Vol. 57, Wiley New York.Google Scholar
- C. F. Olson. 1995. Parallel algorithms for hierarchical clustering. Parallel Comput. 21, 8, 1313--1325. Google Scholar
Digital Library
- A. Paccanaro, J. A. Casbon, and M. A. Saqi. 2006. Spectral clustering of protein sequences. Nucleic Acids Research 34, 5, 1571--1580.Google Scholar
Cross Ref
- W. R. Pearson. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63--98.Google Scholar
Cross Ref
- S. L. K. Pond, K. Scheffler, M. B. Gravenor, A. F. Y. Poon, and S. D. W. Frost. 2010. Evolutionary fingerprinting of genes. Mol. Biol. Evolution 27, 3, 520--536.Google Scholar
Cross Ref
- A. Rosenberg and J. Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the Conference on Empirical Methods on Natural Language Processing and Computational Natural Language Learning, Vol. 7. 410--420.Google Scholar
- J. Sang and C. Xu. 2011. Browse by chunks: Topic mining and organizing on web-scale social media. ACM Trans. Multimedia Comput. Commun. Appl. 7, 1, 30. Google Scholar
Digital Library
- C. J. Sigrist, E. De Castro, L. Cerutti, B. A. Cuche, N. Hulo, A. Bridge, L. Bougueleret, and I. Xenarios. 2013. New and continuing developments at PROSITE. Nucleic Acids Research 41, D1, D344--D347.Google Scholar
Cross Ref
- N. X. Vinh, J. Epps, and J. Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Research, 2837--2854. Google Scholar
Digital Library
- X. Wan. 2007. A novel document similarity measure based on earth mover's distance. Information Sciences 177, 18, 3718--3730. Google Scholar
Digital Library
- D. Xu and S. F. Chang. 2008. Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans. Pattern Anal. Mach. Intell. 30, 11, 1985--1997. Google Scholar
Digital Library
- W. Zhao, H. Ma, and Q. He. 2009. Parallel k-means clustering based on mapreduce. In Cloud Computing, Lecture Notes in Computer Science Vol. 5931, 674--679. Google Scholar
Digital Library
- Q. Zheng and W. Gao. 2008. Constructing visual phrases for effective and efficient object-based image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 5, 1, 7. Google Scholar
Digital Library
Index Terms
Parallel Massive Clustering of Discrete Distributions
Recommendations
Parallel Clustering Based on Partitions of Local Minimal-Spanning-Trees
PAAP '12: Proceedings of the 2012 Fifth International Symposium on Parallel Architectures, Algorithms and ProgrammingMany traditional clustering algorithms have the scalability problem while dealing with large data sets. One common strategy to handle the problem is to parallelize the algorithms and execute them along with the input data on high-performance computers. ...
A parallel hierarchical clustering algorithm for PCs cluster system
The efficiency of clustering algorithms is strongly needed with very large databases and high-dimensional data types. As a solution, parallel algorithms can be used to provide powerful computing ability. PCs cluster system is one of low-cost general-...
A Tabu search based clustering algorithm and its parallel implementation on Spark
Highlights- Our Tabu Search based clustering algorithm produces more accurate and stable solutions compared to the widely-applied Spark MLlib K-means algorithm.
AbstractThe well-known K-means clustering algorithm has been employed widely in different application domains ranging from data analytics to logistics applications. However, the K-means algorithm can be affected by factors such as the initial ...






Comments