skip to main content
research-article

Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization

Published:26 October 2017Publication History
Skip Abstract Section

Abstract

k-means clustering is a widely used clustering analysis technique in machine learning. In this article, we study the problem of differentially private k-means clustering. Several state-of-the-art methods follow the single-workload approach, which adapts an existing machine-learning algorithm by making each step private. However, most of them do not have satisfactory empirical performance. In this work, we develop techniques to analyze the empirical error behaviors of one of the state-of-the-art single-workload approaches, DPLloyd, which is a differentially private version of the Lloyd algorithm for k>-means clustering. Based on the analysis, we propose an improvement of DPLloyd. We also propose a new algorithm for k-means clustering from the perspective of the noninteractive approach, which publishes a synopsis of the input dataset and then runs k-means on synthetic data generated from the synopsis. We denote this approach by EUGkM. After analyzing the empirical error behaviors of EUGkM, we further propose a hybrid approach that combines our DPLloyd improvement and EUGkM. Results from extensive and systematic experiments support our analysis and demonstrate the effectiveness of the DPLloyd improvement, EUGkM, and the hybrid approach.

References

  1. Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing 8, 1 (2012), 121--164. Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Lichman. 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.Google ScholarGoogle Scholar
  3. Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. 2010. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, 503--512. DOI:https://doi.org/10.1145/1835804.1835869Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. 2005. Practical privacy: The SuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’05). ACM, New York, 128--138. DOI:https://doi.org/10.1145/1065167.1065184Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. United States Census. 1991. Topologically Integrated Geographic Encoding and Referencing. Retrieved from http://www.census.gov/geo/maps-data/data/tiger.html.Google ScholarGoogle Scholar
  6. Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic regression. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS’08). Curran Associates, 289--296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially private empirical risk minimization. J. Mach. Learn. Res. 12 (July 2011), 1069--1109.Google ScholarGoogle Scholar
  8. Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. 2012. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12). IEEE Computer Society, 20--31. DOI:https://doi.org/10.1109/ICDE.2012.16Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’03). ACM, New York, 202--210. DOI:https://doi.org/10.1145/773153.773173Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II (ICALP’06). Springer-Verlag, Berlin,1--12. DOI:https://doi.org/10.1007/11787006_1Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cynthia Dwork. 2011. A firm foundation for private data analysis. Commun. ACM 54, 1 (Jan. 2011), 86--95. DOI:https://doi.org/10.1145/1866739.1866758Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC’06). Springer-Verlag, Berlin, 265--284. DOI:https://doi.org/10.1007/11681878_14Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cynthia Dwork and Kobbi Nissim. 2004. Privacy-Preserving Datamining on Vertically Partitioned Databases. Springer, Berlin, 528--544. DOI:https://doi.org/10.1007/978-3-540-28628-8_32Google ScholarGoogle Scholar
  14. Pasi Fränti. 2006. Clustering datasets. Retrieved from http://cs.joensuu.fi/sipu/datasets/.Google ScholarGoogle Scholar
  15. Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, 493--502. DOI:https://doi.org/10.1145/1835804.1835868Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A simple and practical algorithm for differentially private data release. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). Curran Associates, 2339--2347.Google ScholarGoogle Scholar
  17. Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. 2016. Principled evaluation of differentially private algorithms using DPBench. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, 139--154. DOI:https://doi.org/10.1145/2882903.2882931Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 1021--1032. DOI:https://doi.org/10.14778/1920841.1920970Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lei Jing. 2011. Differentially private M-estimators. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates, 361--369. http://dl.acm.org/citation.cfm?id=2986459.2986500Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Krishna and M. Narasimha Murty. 1999. Genetic K-means algorithm. Trans. Sys. Man Cyber. Part B 29, 3 (June 1999), 433--439. DOI:https://doi.org/10.1109/3477.764879Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ninghui Li, Wahbeh Qardaji, Dong Su, and Jianneng Cao. 2012. PrivBasis: Frequent itemset mining with differential privacy. Proc. VLDB Endow. 5, 11 (July 2012), 1340--1351. DOI:https://doi.org/10.14778/2350229.2350251Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ninghui Li, Weining Yang, and Wahbeh Qardaji. 2013. Differentially private grids for geospatial data. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE’13). IEEE Computer Society, 757--768. DOI:https://doi.org/10.1109/ICDE.2013.6544872Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Lloyd. 2006. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Sept. 2006), 129--137. DOI:https://doi.org/10.1109/TIT.1982.1056489Google ScholarGoogle Scholar
  24. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York.Google ScholarGoogle Scholar
  25. Frank McSherry. 2009. Privacy Integrated Queries (PINQ) Infrastructure. Retrieved from http://research.microsoft.com/en-us/downloads/73099525-fd8d-4966-9b93-574e6023147f/.Google ScholarGoogle Scholar
  26. Frank McSherry and Ilya Mironov. 2009. Differentially private recommender systems: Building privacy into the net. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 627--636. DOI:https://doi.org/10.1145/1557019.1557090Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE Computer Society, 94--103. DOI:https://doi.org/10.1109/FOCS.2007.41Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Frank D. McSherry. 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD’09). ACM, New York, 19--30. DOI:https://doi.org/10.1145/1559845.1559850Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Prashanth Mohan. 2012. GUPT: A platform for privacy-preserving data mining. Retrieved from https://github.com/prashmohan/GUPT.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David Culler. 2012. GUPT: Privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, New York, 349--360. DOI:https://doi.org/10.1145/2213836.2213876Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC’07). ACM, New York, 75--84. DOI:https://doi.org/10.1145/1250790.1250803Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. M. Peña, J. A. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 10 (Oct. 1999), 1027--1040. DOI:https://doi.org/10.1016/S0167-8655(99)00069-0Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Weiliang Qiu. 2015. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). Retrieved from http://cran.r-project.org/web/packages/clusterGeneration/index.html.Google ScholarGoogle Scholar
  34. Siddheswar Ray and Rose H. Turi. 1999. Determination of number of clusters in K-means clustering and application in colour image segmentation. In The 4th International Conference on Advances in Pattern Recognition and Digital Techniques. 137--143.Google ScholarGoogle Scholar
  35. Scipy.org. 2001. Scientific Computing Tools for Python. Retrieved from http://scipy.org/.Google ScholarGoogle Scholar
  36. Adam Smith. 2011. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the 43th Annual ACM Symposium on Theory of Computing (STOC’11). ACM, New York. 813–822. DOI: 10.1145/1993636.1993743 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. 2016. Differentially private K-means clustering. In Proceedings of the 6th ACM Conference on Data and Application Security and Privacy (CODASPY’16). ACM, New York, 26--37. DOI:https://doi.org/10.1145/2857705.2857708Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. B (Stat. Methodol.) 63, 2 (2001), 411--423. Google ScholarGoogle ScholarCross RefCross Ref
  39. Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 8 (Aug. 2011), 1200--1214. DOI:https://doi.org/10.1109/TKDE.2010.247Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2014. PrivBayes: Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). ACM, New York, 1423--1434. DOI:https://doi.org/10.1145/2588555.2588573Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, and Marianne Winslett. 2013. PrivGene: Differentially private model fitting using genetic algorithms. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). ACM, New York, 665--676. DOI:https://doi.org/10.1145/2463676.2465330Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. 2012. Functional mechanism: Regression analysis under differential privacy. Proc. VLDB Endow. 5, 11 (July 2012), 1364--1375. DOI:https://doi.org/10.14778/2350229.2350253Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. 2014. Towards accurate histogram publication under differential privacy. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 587--595. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Privacy and Security
        ACM Transactions on Privacy and Security  Volume 20, Issue 4
        November 2017
        150 pages
        ISSN:2471-2566
        EISSN:2471-2574
        DOI:10.1145/3143524
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2017
        • Revised: 1 August 2017
        • Accepted: 1 August 2017
        • Received: 1 October 2016
        Published in tops Volume 20, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!