Abstract
Using techniques employing smooth sensitivity, we develop a method for \( k \)-nearest neighbor missing data imputation with differential privacy. This requires bounding the number of data incomplete tuples that can have their data complete “donor” changed by making a single addition or deletion to the dataset. The multiplicity of a single individual’s impact on an imputed dataset necessarily means our mechanisms require the addition of more noise than mechanisms that ignore missing data, but we show empirically that this is significantly outweighed by the bias reduction from imputing missing data.
- [1] . 1978. Comparison of two procedures for imputing missing survey values. In Proceedings of the Survey Research Methods Section. American Statistical Association, 462–467. Retrieved from http://www.asasrms.org/Proceedings.Google Scholar
- [2] . 1997. Estimating earnings poverty in 1939: A comparison of orshansky-method and price-indexed definitions of poverty. The Review of Economics and Statistics 79, 3 (1997), 406–414. Retrieved from http://www.jstor.org/stable/2951387.Google Scholar
Cross Ref
- [3] . 2005. Toward privacy in public databases. In Proceedings of the Theory of Cryptography Conference. IACR, Cambridge, MA, 363–385. Retrieved from http://www.iacr.org/cryptodb/archive/2005/TCC/3614/3614.pdf.Google Scholar
Digital Library
- [4] . 2018. InfoClean: Protecting sensitive information in data cleaning. Journal of Data and Information Quality 9, 4(2018), 22.
DOI: Google ScholarDigital Library
- [5] . 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference.265–284.Google Scholar
Digital Library
- [6] . 2017. Differentially private random decision forests using smooth sensitivity. Expert Systems with Applications 78(2017), 16–31.
DOI: Google ScholarDigital Library
- [7] . 2018. Private Exploration Primitives for Data Cleaning.
Technical Report. Google Scholar - [8] . 2018. Smooth sensitivity based approach for differentially private PCA. In Proceedings of the Algorithmic Learning Theory. 438–450. Retrieved from http://proceedings.mlr.press/v83/gonem18a.html.Google Scholar
- [9] . 2018. PACAS: Privacy-Aware, data cleaning-as-a-service. In Proceedings of the 2018 IEEE International Conference on Big Data.1023–1030.
DOI: Google ScholarCross Ref
- [10] . 2007. Privacy-preserving imputation of missing data. Data & Knowledge Engineering 65, 1 (2007), 40–56.
DOI: Google ScholarDigital Library
- [11] . 2018. Differentially private matrix completion revisited. In Proceedings of the 35th International Conference on Machine Learning. and (Eds.), PMLR, 2220–2229. Retrieved from http://proceedings.mlr.press/v80/jain18b.html.Google Scholar
- [12] . 1982. Imputing for missing survey responses. In Proceedings of the Survey Research Methods Section. American Statistical Association, 22–33. Retrieved from http://www.asasrms.org/Proceedings.Google Scholar
- [13] . 2013. On differentially private low rank approximation. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms.1395–1414.
DOI: Google ScholarCross Ref
- [14] . 2016. PrivateClean: Data cleaning and differential privacy. In Proceedings of the 2016 International Conference on Management of Data. ACM, 937–951.
DOI: Google ScholarDigital Library
- [15] . 2018. Optimizing error of high-dimensional statistical queries under differential privacy. Proceedings of the VLDB Endowment 11, 10 (2018), 1206–1219.
DOI: Google ScholarDigital Library
- [16] . 2009. Differentially-private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 627–636.Google Scholar
Digital Library
- [17] . 2018. Introduction to data editing and allocation. Retrieved from https://usa.ipums.org/usa/flags.shtml.Google Scholar
- [18] . 2006. The effects of income imputation on microanalyses: Evidence from the european community household panel. Journal of the Royal Statistical Society Series A 169, 3(2006), 625–646.
DOI: Google ScholarCross Ref
- [19] . 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing. 75–84.Google Scholar
Digital Library
- [20] . 2011. Smooth sensitivity and sampling in private data analysis. (
May 17 2011). Retrieved from https://cs-people.bu.edu/ads22/pubs/NRS07/NRS07-full-draft-v1.pdf.Google Scholar - [21] . 2015. Differentially private analysis of outliers. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases.458–473.
DOI: Google ScholarCross Ref
- [22] . 2014. PriView: Practical differentially private release of marginal contingency tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1435–1446.
DOI: Google ScholarDigital Library
- [23] . 2018. IPUMS USA: Version 8.0 Extract of 1940 Census for U.S. Census Bureau Disclosure Avoidance Research.
DOI: Google ScholarCross Ref
- [24] 2014. American Community Survey Design and Methodology (January 2014).
Technical Report Version 2.0. United States Census Bureau. Retrieved from https://www.census.gov/programs-surveys/acs/methodology/design-and-methodology.html.Google Scholar - [25] . 2013. Preserving differential privacy in degree-correlation based graph generation. Transactions on Data Privacy 6, (22013), 127–145. Retrieved from http://www.tdp.cat/issues11/abs.a113a12.php.Google Scholar
Digital Library
- [26] . 2021. Differentially private naive bayes classifier using smooth sensitivity. Proceedings on Privacy Enhancing Technologies 2021, 4 (2021), 406–419.Google Scholar
Cross Ref
Index Terms
Differentially Private k-Nearest Neighbor Missing Data Imputation
Recommendations
Benchmarking k-nearest neighbour imputation with homogeneous Likert data
Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have ...
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data
METRICS '04: Proceedings of the Software Metrics, 10th International SymposiumStudies in many different fields of research suffer from the problem of missing data. With missing data, statistical tests will lose power, results may be biased, or analysis may not be feasible at all. There are several ways to handle the problem, for ...
Differentially private multidimensional data publishing
Various organizations collect data about individuals for various reasons, such as service improvement. In order to mine the collected data for useful information, data publishing has become a common practice among those organizations and data analysts, ...






Comments