Abstract
k-means clustering is a widely used clustering analysis technique in machine learning. In this article, we study the problem of differentially private k-means clustering. Several state-of-the-art methods follow the single-workload approach, which adapts an existing machine-learning algorithm by making each step private. However, most of them do not have satisfactory empirical performance. In this work, we develop techniques to analyze the empirical error behaviors of one of the state-of-the-art single-workload approaches, DPLloyd, which is a differentially private version of the Lloyd algorithm for k>-means clustering. Based on the analysis, we propose an improvement of DPLloyd. We also propose a new algorithm for k-means clustering from the perspective of the noninteractive approach, which publishes a synopsis of the input dataset and then runs k-means on synthetic data generated from the synopsis. We denote this approach by EUGkM. After analyzing the empirical error behaviors of EUGkM, we further propose a hybrid approach that combines our DPLloyd improvement and EUGkM. Results from extensive and systematic experiments support our analysis and demonstrate the effectiveness of the DPLloyd improvement, EUGkM, and the hybrid approach.
- Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing 8, 1 (2012), 121--164. Google Scholar
Cross Ref
- M. Lichman. 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.Google Scholar
- Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. 2010. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, 503--512. DOI:https://doi.org/10.1145/1835804.1835869Google Scholar
Digital Library
- Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. 2005. Practical privacy: The SuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’05). ACM, New York, 128--138. DOI:https://doi.org/10.1145/1065167.1065184Google Scholar
Digital Library
- United States Census. 1991. Topologically Integrated Geographic Encoding and Referencing. Retrieved from http://www.census.gov/geo/maps-data/data/tiger.html.Google Scholar
- Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic regression. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS’08). Curran Associates, 289--296.Google Scholar
Digital Library
- Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially private empirical risk minimization. J. Mach. Learn. Res. 12 (July 2011), 1069--1109.Google Scholar
- Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. 2012. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12). IEEE Computer Society, 20--31. DOI:https://doi.org/10.1109/ICDE.2012.16Google Scholar
Digital Library
- Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’03). ACM, New York, 202--210. DOI:https://doi.org/10.1145/773153.773173Google Scholar
Digital Library
- Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II (ICALP’06). Springer-Verlag, Berlin,1--12. DOI:https://doi.org/10.1007/11787006_1Google Scholar
Digital Library
- Cynthia Dwork. 2011. A firm foundation for private data analysis. Commun. ACM 54, 1 (Jan. 2011), 86--95. DOI:https://doi.org/10.1145/1866739.1866758Google Scholar
Digital Library
- Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC’06). Springer-Verlag, Berlin, 265--284. DOI:https://doi.org/10.1007/11681878_14Google Scholar
Digital Library
- Cynthia Dwork and Kobbi Nissim. 2004. Privacy-Preserving Datamining on Vertically Partitioned Databases. Springer, Berlin, 528--544. DOI:https://doi.org/10.1007/978-3-540-28628-8_32Google Scholar
- Pasi Fränti. 2006. Clustering datasets. Retrieved from http://cs.joensuu.fi/sipu/datasets/.Google Scholar
- Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, 493--502. DOI:https://doi.org/10.1145/1835804.1835868Google Scholar
Digital Library
- Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A simple and practical algorithm for differentially private data release. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). Curran Associates, 2339--2347.Google Scholar
- Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. 2016. Principled evaluation of differentially private algorithms using DPBench. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, 139--154. DOI:https://doi.org/10.1145/2882903.2882931Google Scholar
Digital Library
- Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 1021--1032. DOI:https://doi.org/10.14778/1920841.1920970Google Scholar
Digital Library
- Lei Jing. 2011. Differentially private M-estimators. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates, 361--369. http://dl.acm.org/citation.cfm?id=2986459.2986500Google Scholar
Digital Library
- K. Krishna and M. Narasimha Murty. 1999. Genetic K-means algorithm. Trans. Sys. Man Cyber. Part B 29, 3 (June 1999), 433--439. DOI:https://doi.org/10.1109/3477.764879Google Scholar
Digital Library
- Ninghui Li, Wahbeh Qardaji, Dong Su, and Jianneng Cao. 2012. PrivBasis: Frequent itemset mining with differential privacy. Proc. VLDB Endow. 5, 11 (July 2012), 1340--1351. DOI:https://doi.org/10.14778/2350229.2350251Google Scholar
Digital Library
- Ninghui Li, Weining Yang, and Wahbeh Qardaji. 2013. Differentially private grids for geospatial data. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE’13). IEEE Computer Society, 757--768. DOI:https://doi.org/10.1109/ICDE.2013.6544872Google Scholar
Digital Library
- S. Lloyd. 2006. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Sept. 2006), 129--137. DOI:https://doi.org/10.1109/TIT.1982.1056489Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York.Google Scholar
- Frank McSherry. 2009. Privacy Integrated Queries (PINQ) Infrastructure. Retrieved from http://research.microsoft.com/en-us/downloads/73099525-fd8d-4966-9b93-574e6023147f/.Google Scholar
- Frank McSherry and Ilya Mironov. 2009. Differentially private recommender systems: Building privacy into the net. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 627--636. DOI:https://doi.org/10.1145/1557019.1557090Google Scholar
Digital Library
- Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE Computer Society, 94--103. DOI:https://doi.org/10.1109/FOCS.2007.41Google Scholar
Digital Library
- Frank D. McSherry. 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD’09). ACM, New York, 19--30. DOI:https://doi.org/10.1145/1559845.1559850Google Scholar
Digital Library
- Prashanth Mohan. 2012. GUPT: A platform for privacy-preserving data mining. Retrieved from https://github.com/prashmohan/GUPT.Google Scholar
Digital Library
- Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David Culler. 2012. GUPT: Privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, New York, 349--360. DOI:https://doi.org/10.1145/2213836.2213876Google Scholar
Digital Library
- Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC’07). ACM, New York, 75--84. DOI:https://doi.org/10.1145/1250790.1250803Google Scholar
Digital Library
- J. M. Peña, J. A. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 10 (Oct. 1999), 1027--1040. DOI:https://doi.org/10.1016/S0167-8655(99)00069-0Google Scholar
Digital Library
- Weiliang Qiu. 2015. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). Retrieved from http://cran.r-project.org/web/packages/clusterGeneration/index.html.Google Scholar
- Siddheswar Ray and Rose H. Turi. 1999. Determination of number of clusters in K-means clustering and application in colour image segmentation. In The 4th International Conference on Advances in Pattern Recognition and Digital Techniques. 137--143.Google Scholar
- Scipy.org. 2001. Scientific Computing Tools for Python. Retrieved from http://scipy.org/.Google Scholar
- Adam Smith. 2011. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the 43th Annual ACM Symposium on Theory of Computing (STOC’11). ACM, New York. 813–822. DOI: 10.1145/1993636.1993743 Google Scholar
Digital Library
- Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. 2016. Differentially private K-means clustering. In Proceedings of the 6th ACM Conference on Data and Application Security and Privacy (CODASPY’16). ACM, New York, 26--37. DOI:https://doi.org/10.1145/2857705.2857708Google Scholar
Digital Library
- Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. B (Stat. Methodol.) 63, 2 (2001), 411--423. Google Scholar
Cross Ref
- Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 8 (Aug. 2011), 1200--1214. DOI:https://doi.org/10.1109/TKDE.2010.247Google Scholar
Digital Library
- Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2014. PrivBayes: Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). ACM, New York, 1423--1434. DOI:https://doi.org/10.1145/2588555.2588573Google Scholar
Digital Library
- Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, and Marianne Winslett. 2013. PrivGene: Differentially private model fitting using genetic algorithms. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). ACM, New York, 665--676. DOI:https://doi.org/10.1145/2463676.2465330Google Scholar
Digital Library
- Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. 2012. Functional mechanism: Regression analysis under differential privacy. Proc. VLDB Endow. 5, 11 (July 2012), 1364--1375. DOI:https://doi.org/10.14778/2350229.2350253Google Scholar
Digital Library
- Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. 2014. Towards accurate histogram publication under differential privacy. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 587--595. Google Scholar
Cross Ref
Index Terms
Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization
Recommendations
Differentially Private K-Means Clustering
CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and PrivacyThere are two broad approaches for differentially private data analysis. The interactive approach aims at developing customized differentially private algorithms for various data mining tasks. The non-interactive approach aims at developing ...
Optimal Differentially Private Algorithms for k-Means Clustering
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsWe consider privacy-preserving k-means clustering. For the objective of minimizing the Wasserstein distance between the output and the optimal solution, we show that there is a polynomial-time (ε,δ)-differentially private algorithm which, for any ...
Practical multi-party private collaborative k-means clustering
Abstractk-means clustering is widely used in many fields such as data mining, machine learning, and information retrieval. In many cases, users need to cooperate to perform k-means clustering tasks. How to perform clustering without revealing ...






Comments