Abstract
Digitization of healthcare records contributed to a large volume of functional scientific data that can help researchers to understand the behaviour of many diseases. However, the privacy implications of this data, particularly genomics data, have surfaced recently as the collection, dissemination, and analysis of human genomics data is highly sensitive. There have been multiple privacy attacks relying on the uniqueness of the human genome that reveals a participant or a certain group’s presence in a dataset. Therefore, the current data sharing policies have ruled out any public dissemination and adopted precautionary measures prior to genomics data release, which hinders timely scientific innovation. In this article, we investigate an approach that only releases the statistics from genomic data rather than the whole dataset and propose a generalized Differentially Private mechanism for Genome-wide Association Studies (GWAS). Our method provides a quantifiable privacy guarantee that adds noise to the intermediate outputs but ensures satisfactory accuracy of the private results. Furthermore, the proposed method offers multiple adjustable parameters that the data owners can set based on the optimal privacy requirements. These variables are presented as equalizers that balance between the privacy and utility of the GWAS. The method also incorporates Online Bin Packing technique [1], which further bounds the privacy loss linearly, growing according to the number of open bins and scales with the incoming queries. Finally, we implemented and benchmarked our approach using seven different GWAS studies to test the performance of the proposed methods. The experimental results demonstrate that for 1,000 arbitrary online queries, our algorithms are more than 80% accurate with reasonable privacy loss and exceed the state-of-the-art approaches on multiple studies (i.e., EigenStrat, LMM, TDT).
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, (article title Name)
- Joan Boyar, Shahin Kamali, Kim S. Larsen, and Alejandro López-Ortiz. 2016. Online bin packing with advice. Algorithmica 74, 1 (2016), 507--527.Google Scholar
Digital Library
- Robert H. Miller and Ida Sim. 2004. Physicians’ use of electronic medical records: Barriers and solutions. Health Affairs 23, 2 (2004), 116--126.Google Scholar
Cross Ref
- Guy Paré, Louis Raymond, Ana Ortiz de Guinea, Placide Poba-Nzaou, Marie-Claude Trudel, Josianne Marsan, and Thomas Micheneau. 2015. Electronic health record usage behaviors in primary care medical practices: A survey of family physicians in Canada. Int. J. Med. Inform. 84, 10 (2015), 857--867.Google Scholar
Cross Ref
- Muhammad Naveed, Erman Ayday, Ellen W. Clayton, Jacques Fellay, Carl A. Gunter, Jean-Pierre Hubaux, Bradley A. Malin, and XiaoFeng Wang. 2015. Privacy in the genomic era. ACM Comput. Surveys 48, 1 (2015), 6.Google Scholar
Digital Library
- Md Momin Al Aziz, Md Nazmus Sadat, Dima Alhadidi, Shuang Wang, Xiaoqian Jiang, Cheryl L. Brown, and Noman Mohammed. 2019. Privacy-preserving techniques of genomic data—A survey. Brief. Bioinform. 20, 3 (2019), 887--895. https://doi.org/10.1093/bib/bbx139Google Scholar
- Alexandros Mittos, Bradley Malin, and Emiliano De Cristofaro. 2019. Systematizing genome privacy research: A privacy-enhancing technologies perspective. Proc. Privacy Enhanc. Technol. 2019, 1 (2019), 87--107.Google Scholar
Cross Ref
- Bradley Malin, Kenneth Goodman et al. 2018. Between access and privacy: Challenges in sharing health data. Yearbook Med. Info. 27, 1 (2018), 055--059.Google Scholar
- The Personal Information Protection and Electronic Documents Act (PIPEDA). [n.d.]. Retrieved from https://goo.gl/TScuoW.Google Scholar
- Peter Kilbridge. 2003. The cost of HIPAA compliance. New England J. Med. 348, 15 (2003), 1423.Google Scholar
Cross Ref
- Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography. Springer, 265--284.Google Scholar
- Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming—Volume Part II (ICALP’06). 1--12.Google Scholar
- J. Hsu, M. Gaboardi, A. Haeberlen, S. Khanna, A. Narayan, B. C. Pierce, and A. Roth. 2014. Differential privacy: An economic method for choosing epsilon. In Proceedings of the IEEE 27th Computer Security Foundations Symposium. 398--410.Google Scholar
- Andreas Haeberlen, Benjamin C. Pierce, and Arjun Narayan. 2011. Differential privacy under fire. In Proceedings of the USENIX Security Symposium.Google Scholar
- Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In Proceedings of the 23rd USENIX Security Symposium (USENIXSecurity’14). 17--32.Google Scholar
- Md Momin Al Aziz, Reza Ghasemi, Md Waliullah, and Noman Mohammed. 2017. Aftermath of bustamante attack on genomic beacon service. BMC Med. Genom. 10, 2 (2017), 43.Google Scholar
- Moritz Hardt and Guy N. Rothblum. 2010. A multiplicative weights mechanism for privacy-preserving data analysis. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS’10). IEEE, 61--70.Google Scholar
- Fei Yu, Michal Rybar, Caroline Uhler, and Stephen E. Fienberg. 2014. Differentially-private logistic regression for detecting multiple-SNP association in GWAS databases. In Proceedings of the International Conference on Privacy in Statistical Databases. Springer, 170--184.Google Scholar
- Shuang Wang, Noman Mohammed, and Rui Chen. 2014. Differentially private genome data dissemination through top-down specialization. BMC Med. Info. Decision Making 14, 1 (2014), S2.Google Scholar
- Caroline Uhlerop, Aleksandra Slavković, and Stephen E. Fienberg. 2013. Privacy-preserving data sharing for genome-wide association studies. J. Privacy Confidential. 5, 1 (2013), 137.Google Scholar
- Aaron Johnson and Vitaly Shmatikov. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1079--1087.Google Scholar
Digital Library
- Yuichi Sei and Akihiko Ohsuga. 2017. Privacy-preserving Chi-squared testing for genome SNP databases. In Proceedings of the 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’17). IEEE, 3884--3889.Google Scholar
- Florian Tramèr, Zhicong Huang, Jean-Pierre Hubaux, and Erman Ayday. 2015. Differential privacy with bounded priors: Reconciling utility and privacy in genome-wide association studies. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1286--1297.Google Scholar
- Sean Simmons and Bonnie Berger. 2016. Realizing privacy preserving genome-wide association studies. Bioinformatics 32, 9 (2016), 1293--1300.Google Scholar
Cross Ref
- Fei Yu, Stephen E. Fienberg, Aleksandra B. Slavković, and Caroline Uhler. 2014. Scalable privacy-preserving data sharing methodology for genome-wide association studies. J. Biomed. Inform. 50 (2014), 133--141.Google Scholar
Cross Ref
- Sean Simmons, Cenk Sahinalp, and Bonnie Berger. 2016. Enabling privacy-preserving GWASs in heterogeneous human populations. Cell Syst. 3, 1 (2016), 54--61.Google Scholar
Cross Ref
- Meng Wang, Zhanglong Ji, Shuang Wang, Jihoon Kim, Hai Yang, Xiaoqian Jiang, and Lucila Ohno-Machado. 2017. Mechanisms to protect the privacy of families when using the transmission disequilibrium test in genome-wide association studies. Bioinformatics 33, 23 (2017), 3716--3725.Google Scholar
- Md Nazmus Sadat, Md Momin Al Aziz, Noman Mohammed, Feng Chen, Xiaoqian Jiang, and Shuang Wang. 2019. SAFETY: Secure GWAS in federated environment through a hYbrid solution. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 1 (2019), 93--102. DOI:10.1109/TCBB.2018.2829760Google Scholar
Digital Library
- Junfeng Fan and Frederik Vercauteren. 2012. Somewhat practical fully homomorphic encryption. IACR Cryptol. ePrint Arch. 2012 (2012), 144.Google Scholar
- Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 3–4 (2014), 211--407.Google Scholar
Digital Library
- Frank D. McSherry. 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 19--30.Google Scholar
- Indrajit Roy, Srinath T. V. Setty, Ann Kilzer, Vitaly Shmatikov, and Emmett Witchel. 2010. Airavat: Security and privacy for MapReduce. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’10), Vol. 10. 297--312.Google Scholar
- Jean Louis Raisaro, Juan Ramón Troncoso-Pastoriza, Mickaël Misbach, João Sá Sousa, Sylvain Pradervand, Edoardo Missiaglia, Olivier Michielin, Bryan Ford, and Jean-Pierre Hubaux. 2018. Med Co: Enabling secure and privacy-preserving exploration of distributed clinical and genomic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 4 (2018), 1328--1341.Google Scholar
- Jean Louis Raisaro, Gwangbae Choi, Sylvain Pradervand, Raphael Colsenet, Nathalie Jacquemont, Nicolas Rosat, Vincent Mooser, and Jean-Pierre Hubaux. 2018. Protecting privacy and security of genomic data in I2B2 with homomorphic encryption and differential privacy. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 5 (2018), 1413--1426.Google Scholar
Digital Library
- Greg Gibson. 2018. Population genetics and GWAS: A primer. PLoS Biol. 16, 3 (2018), e2005485.Google Scholar
- A. J. Paverd, Andrew Martin, and Ian Brown. 2014. Modelling and automatically analysing privacy properties for honest-but-curious adversaries. Technical Report.Google Scholar
- Harmonic Series. [n.d.]. Retrieved from https://en.wikipedia.org/wiki/Harmonic_series_(mathematics).Google Scholar
- Eric W. Weisstein. [n.d.]. Block-Stacking problem. https://mathworld.wolfram.com/BookStackingProblem.html.Google Scholar
- Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2017. The composition theorem for differential privacy. IEEE Trans. Info. Theory 63, 6 (2017), 4037--4049.Google Scholar
Digital Library
- Stanley L. Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63--69.Google Scholar
Cross Ref
- Laura Clarke, Xiangqun Zheng-Bradley, Richard Smith, Eugene Kulesha, Chunlin Xiao, Iliana Toneva, Brendan Vaughan, Don Preuss, Rasko Leinonen, Martin Shumway, et al. 2012. The 1,000 genomes project: Data management and community access. Nature Methods 9, 5 (2012), 459.Google Scholar
Cross Ref
- Differential Privacy GWAS-implementation. [n.d.]. Retrieved from https://github.com/mominbuet/DifferentialPrivacyGWAS.Google Scholar
- Lon R. Cardon and Lyle J. Palmer. 2003. Population stratification and spurious allelic association. Lancet 361, 9357 (2003), 598--604.Google Scholar
- Nour Almadhoun, Erman Ayday, and Özgür Ulusoy. 2020. Inference attacks against differentially private query results from genomic datasets including dependent tuples. Bioinformatics 36, Supplement 1 (2020), i136–i145.Google Scholar
- William S. Bush and Jason H. Moore. 2012. Genome-wide association studies. PLoS Comput. Biol. 8, 12 (2012), e1002822.Google Scholar
Cross Ref
- Steven S. Seiden. 2002. On the online bin packing problem. J. ACM 49, 5 (2002), 640--671.Google Scholar
Digital Library
- M. R. Garey and D. S. Johnson. 1981. Approximation algorithms for Bin packing problems: A survey. In Analysis and Design of Algorithms in Combinatorial Optimization. International Centre for Mechanical Sciences (Courses and Lectures), vol 266, G. Ausiello and M. Lucertini (Eds.). Springer. DOI:https://doi.org/10.1007/978-3-7091-2748-3_8Google Scholar
Index Terms
Online Algorithm for Differentially Private Genome-wide Association Studies
Recommendations
Privacy-preserving data exploration in genome-wide association studies
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningGenome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals ...
A differentially private algorithm for location data release
The rise of mobile technologies in recent years has led to large volumes of location information, which are valuable resources for knowledge discovery such as travel patterns mining and traffic analysis. However, location dataset has been confronted ...






Comments