Abstract
Encrypted deduplication combines encryption and deduplication to simultaneously achieve both data security and storage efficiency. State-of-the-art encrypted deduplication systems mainly build on deterministic encryption to preserve deduplication effectiveness. However, such deterministic encryption reveals the underlying frequency distribution of the original plaintext chunks. This allows an adversary to launch frequency analysis against the ciphertext chunks and infer the content of the original plaintext chunks. In this article, we study how frequency analysis affects information leakage in encrypted deduplication, from both attack and defense perspectives. Specifically, we target backup workloads and propose a new inference attack that exploits chunk locality to increase the coverage of inferred chunks. We further combine the new inference attack with the knowledge of chunk sizes and show its attack effectiveness against variable-size chunks. We conduct trace-driven evaluation on both real-world and synthetic datasets and show that our proposed attacks infer a significant fraction of plaintext chunks under backup workloads. To defend against frequency analysis, we present two defense approaches, namely MinHash encryption and scrambling. Our trace-driven evaluation shows that our combined MinHash encryption and scrambling scheme effectively mitigates the severity of the inference attacks, while maintaining high storage efficiency and incurring limited metadata access overhead.
- 2014. FSL Traces and Snapshots Public Archive. Retrieved from http://tracer.filesystems.org/.Google Scholar
- 2019. Ubuntu IRC Logs. Retrieved from http://irclogs.ubuntu.com.Google Scholar
- Martín Abadi, Dan Boneh, Ilya Mironov, Ananth Raghunathan, and Gil Segev. 2013. Message-locked encryption for lock-dependent messages. In Proceedings of the Conference on Advances in Cryptology (CRYPTO’13). 374--391.Google Scholar
Cross Ref
- Ibrahim A. Al-Kadit. 1992. Origins of cryptology: The arab contributions. Cryptologia 16, 2 (1992), 97--126.Google Scholar
Cross Ref
- Yamini Allu, Fred Douglis, Mahesh Kamat, Philip Shilane, Hugo Patterson, and Ben Zhu. 2017. Backup to the future: How workload and hardware changes continually redefine data domain file systems. IEEE Trans. Comput. 50, 7 (2017), 64--72.Google Scholar
Digital Library
- George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15).Google Scholar
- Paul Anderson and Le Zhang. 2010. Fast and secure laptop backups with encrypted de-duplication. In Proceedings of the 24th International Conference on Large Installation System Administration (LISA’10). 1--8.Google Scholar
Digital Library
- Frederik Armknecht, Jens-Matthias Bohli, Ghassan O. Karame, and Franck Youssef. 2015. Transparent data deduplication in the cloud. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 886--900.Google Scholar
Digital Library
- Frederik Armknecht, Colin Boyd, Gareth T. Davies, Kristian Gjøsteen, and Mohsen Toorani. 2017. Side channels in deduplication: Trade-offs between leakage and efficiency. In Proceedings of the ACM Asia Conference on Computer and Communications Security (ASIACCS’17). 266--274.Google Scholar
Digital Library
- Michael Arrington. 2006. AOL: “This Was a Screw Up.” Retrieved from https://techcrunch.com/2006/08/07/aol-this-was-a-screw-up/.Google Scholar
- Mihir Bellare and Sriram Keelveedhi. 2015. Interactive message-locked encryption and secure deduplication. In Proceedings of the Conference on Public-Key Cryptography (PKC’15). 516--538.Google Scholar
Cross Ref
- Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. 2013. DupLESS: Server-aided encryption for deduplicated storage. In Proceeding of the 22nd USENIX Security Symposium (USENIX Security’13). 179--194.Google Scholar
Digital Library
- Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. 2013. Message-locked encryption and secure deduplication. In Proceedings of the Conference on Advances in Cryptology (EUROCRYPT’13). 296--312.Google Scholar
Cross Ref
- Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceeding of the IEEE International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google Scholar
Cross Ref
- Vincent Bindschaedler, Paul Grubbs, David Cash, Thomas Ristenpart, and Vitaly Shmatikov. 2018. The Tao of inference in privacy-protected databases. In Proceedings of the VLDB Endowment, Vol. 11. 1715--1728.Google Scholar
Digital Library
- John Black. 2006. Compare-by-hash: A reasoned analysis. In Proceeding of the USENIX Annual Technical Conference (USENIX ATC’06). 85--90.Google Scholar
- Tønnes Brekne, André Årnes, and Arne Øslebø. 2005. Anonymization of IP traffic monitoring data: Attacks on two prefix-preserving anonymization schemes and some proposed remedies. In Proceeding of the International Workshop on Privacy Enhancing Technologies (PET’05). 179--196.Google Scholar
- Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceeding of the Compression and Complexity of Sequences (SEQUENCES’97). 21--29.Google Scholar
- David Cash, Paul Grubbs, Jason Perry, and Thomas Ristenpart. 2015. Leakage-abuse attacks against searchable encryption. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 668--679.Google Scholar
Digital Library
- Landon P. Cox, Christopher D. Murray, and Brian D. Noble. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI’02). 285--298.Google Scholar
Digital Library
- Barb Darrow. 2015. Harvard-affiliate McLean Hospital Loses Patient Data. Retrieved from http://fortune.com/2015/07/29/mclean-hospital-loses-patient-data/.Google Scholar
- John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. 2002. Reclaiming space from duplicate files in a serverless distributed file system. In Proceeding of the 22nd International Conference on Distributed Computing Systems (ICDCS’02). 617--624.Google Scholar
Digital Library
- Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. 2017. The logic of physical garbage collection in deduplicating storage. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 29--43.Google Scholar
Digital Library
- Yitao Duan. 2014. Distributed key generation for encrypted deduplication: Achieving the strongest privacy. In Proceedings of the ACM Workshop on Cloud Computing Security (CCSW’14). 57--68.Google Scholar
Digital Library
- Kave Eshghi and Hsiu Khuern Tang. 2005. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Technical Report HPL-2005-30(R.1). Hewlett-Packard Laboratories.Google Scholar
- Sanjay Ghemawat and Jeff Dean. 2014. LevelDB: A Fast Key/Value Storage Library by Google. Retrieved from https://github.com/google/leveldb.Google Scholar
- Paul Grubbs, Richard McPherson, Muhammad Naveed, Thomas Ristenpart, and Vitaly Shmatikov. 2016. Breaking web applications built on top of encrypted data. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’16). 1353--1364.Google Scholar
Digital Library
- Paul Grubbs, Kevin Sekniqi, Vincent Bindschaedler, Muhammad Naveed, and Thomas Ristenpart. 2017. Leakage-abuse attacks against order-revealing encryption. In Proceeding of the IEEE Symposium on Security and Privacy (SP’17). 655--672.Google Scholar
Cross Ref
- Robert Hackett. 2016. LinkedIn Lost 167 Million Account Credentials in Data Breach. Retrieved from http://fortune.com/2016/05/18/linkedin-data-breach-email-password/.Google Scholar
- Shai Halevi, Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2011. Proofs of ownership in remote storage systems. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11). 491--500.Google Scholar
Digital Library
- Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Secur. Priv. 8, 6 (2010), 40--47.Google Scholar
Digital Library
- HIPAA Journal. 2017. Hard Drive Theft Sees Data of 1 Million Individuals Exposed. Retrieved from https://www.hipaajournal.com/hard-drive-theft-sees-data-1-million-individuals-exposed-8859/.Google Scholar
- Mohammad Saiful Islam, Mehmet Kuzu, and Murat Kantarcioglu. 2012. Access pattern disclosure on searchable encryption: Ramification, attack and mitigation. In Proceedings of the Network and Distributed System Security Symposium (NDSS’12). 1--15.Google Scholar
- Keren Jin and Ethan L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceeding of the Israeli Experimental Systems Conference (SYSTOR’09). 7:1--7:12.Google Scholar
- Mahesh Kallahall, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. 2003. Plutus: Scalable secure file sharing on untrusted storage. In Proceedings of the USENIX Conference on File and Stroage Technologies (FAST’03). 29--42.Google Scholar
- Georgios Kellaris, George Kollios, Kobbi Nissim, and Adam O’Neill. 2016. Generic attacks on secure outsourced databases. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’16). 1329--1340.Google Scholar
Digital Library
- Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. In Proceeding of the European Conference on Machine Learning. 217--226.Google Scholar
Digital Library
- Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. 2010. Bimodal content defined chunking for backup streams. In Proceeding of the USENIX Conference on File and Storage Technologies (FAST’10).Google Scholar
- Ravi Kumar, Jasmine Novak, Bo Pang, and Andrew Tomkins. 2007. On anonymizing query logs via token-based hashing. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 629--638.Google Scholar
Digital Library
- Marie-Sarah Lacharité and Kenneth G. Paterson. 2015. A note on the optimality of frequency analysis vs. -optimization. Cryptology ePrint Archive: Report 2015/1158. Retrieved from https://eprint.iacr.org/2015/1158.Google Scholar
- Jingwei Li, Patrick P. C. Lee, Yanjing Ren, and Xiaosong Zhang. 2019. Metadedup: Deduplicating metadata in encrypted deduplication via indirection. In Proceeding of the 35th International Conference on Massive Storage Systems and Technology (MSST’19). 1--13.Google Scholar
Cross Ref
- Jingwei Li, Chuan Qin, Patrick P. C. Lee, and Xiaosong Zhang. 2017. Information leakage in encrypted deduplication via frequency analysis. In Proceeding of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’17). 1--12.Google Scholar
Cross Ref
- Mingqiang Li, Chuan Qin, and Patrick P. C. Lee. 2015. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15). 111--124.Google Scholar
Digital Library
- Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceeding of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--197.Google Scholar
Digital Library
- Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceeding of the USENIX Conference on File and Storage Technologies (FAST’09). 111--123.Google Scholar
- Jian Liu, N. Asokan, and Benny Pinkas. 2015. Secure deduplication of encrypted data without additional independent servers. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 874--885.Google Scholar
Digital Library
- Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, and Xiaoguang Liu. 2016. Lazy exact deduplication. In Proceeding of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16). 1--10.Google Scholar
Cross Ref
- Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. 2001. Handbook of Applied Cryptography. CRC Press.Google Scholar
Digital Library
- Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 1--1.Google Scholar
- Muhammad Naveed, Seny Kamara, and Charles V. Wright. 2015. Inference attacks on property-preserving encrypted databases. In Proceeding of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 644--655.Google Scholar
- Muhammad Naveed, Manoj Prabhakaran, and Carl A. Gunter. 2014. Dynamic searchable encryption via blind storage. In Proceedings of the IEEE Symposium on Security and Privacy (SP’14). 639--654.Google Scholar
- David Pouliot and Charles V. Wright. 2016. The shadow nemesis: Inference attacks on efficiently deployable, efficiently searchable encryption. In Proceedings of the 23th ACM Conference on Computer and Communications Security (CCS’16). 1341--1352.Google Scholar
- Chuan Qin, Jingwei Li, and Patrick P. C. Lee. 2017. The design and implementation of a rekeying-aware encrypted deduplication storage system. ACM Trans. Stor. 13, 1 (Mar. 2017), 9:1--9:30.Google Scholar
Digital Library
- Michael O. Rabin. 1981. Fingerprinting by Random Polynomials. Center for Research in Computing Technology, Harvard University. Technical Report TR-CSE-03-01.Google Scholar
- Hubert Ritzdorf, Ghassan Karame, Claudio Soriente, and Srdjan Čapkun. 2016. On information leakage in deduplicated storage systems. In Proceedings of the ACM on Cloud Computing Security Workshop (CCSW’16). 61--72.Google Scholar
Digital Library
- Peter Shah and Won So. 2015. Lamassu: Storage-efficient host-side encryption. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15). 333--345.Google Scholar
- Elaine Shi, T.-H. Hubert Chan, Emil Stefanov, and Mingfei Li. 2011. Oblivious RAM with O((log N)3) worst-case cost. In Proceedings of the Conference on Advances in Cryptology (ASIACRYPT’11). 197--214.Google Scholar
- Mark W. Storer, Kevin Greenan, Darrell D. E. Long, and Ethan L. Miller. 2008. Secure data deduplication. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability (StorageSS’08). 1--10.Google Scholar
- Zhu Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16).Google Scholar
Cross Ref
- Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Conference on Annual Technical Conference (USENIX ATC’12). 24--24.Google Scholar
- David C. Uthus and David W. Aha. 2013. The Ubuntu chat corpus for multiparticipant chat analysis. In Proceedings of the AAAI Spring Symposium. 99--102.Google Scholar
- Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 33--48.Google Scholar
Digital Library
- Zooko Wilcox-O’Hearn and Brian Warner. 2008. Tahoe: The least-authority filesystem. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability (StorageSS’08). 21--26.Google Scholar
Digital Library
- Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, and Yukun Zhou. 2016. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE 104, 9 (2016), 1681--1710.Google Scholar
Cross Ref
- Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity locality based near exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). 285--298.Google Scholar
- Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou. 2016. All your queries are belong to us: The power of file-injection attacks on searchable encryption. In Proceeding of the 25th USENIX Security Symposium (Security’16). 707--720.Google Scholar
- Benjamin Zhu, Kai Li, and R. Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). 269--282.Google Scholar
Digital Library
Index Terms
Information Leakage in Encrypted Deduplication via Frequency Analysis: Attacks and Defenses
Recommendations
The Design and Implementation of a Rekeying-Aware Encrypted Deduplication Storage System
Special Issue on USENIX FAST 2016 and Regular PapersRekeying refers to an operation of replacing an existing key with a new key for encryption. It renews security protection to protect against key compromise and enable dynamic access control in cryptographic storage. However, it is non-trivial to realize ...
Secure Deduplication of Encrypted Data without Additional Independent Servers
CCS '15: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications SecurityEncrypting data on client-side before uploading it to a cloud storage is essential for protecting users' privacy. However client-side encryption is at odds with the standard practice of deduplication. Reconciling client-side encryption with cross-user ...
Revisiting Frequency Analysis against Encrypted Deduplication via Statistical Distribution
IEEE INFOCOM 2022 - IEEE Conference on Computer CommunicationsEncrypted deduplication addresses both security and storage efficiency in large-scale storage systems: it ensures that each plaintext is encrypted to a ciphertext by a symmetric key derived from the content of the plaintext, so as to allow deduplication ...






Comments