skip to main content
research-article

Information Leakage in Encrypted Deduplication via Frequency Analysis: Attacks and Defenses

Authors Info & Claims
Published:29 March 2020Publication History
Skip Abstract Section

Abstract

Encrypted deduplication combines encryption and deduplication to simultaneously achieve both data security and storage efficiency. State-of-the-art encrypted deduplication systems mainly build on deterministic encryption to preserve deduplication effectiveness. However, such deterministic encryption reveals the underlying frequency distribution of the original plaintext chunks. This allows an adversary to launch frequency analysis against the ciphertext chunks and infer the content of the original plaintext chunks. In this article, we study how frequency analysis affects information leakage in encrypted deduplication, from both attack and defense perspectives. Specifically, we target backup workloads and propose a new inference attack that exploits chunk locality to increase the coverage of inferred chunks. We further combine the new inference attack with the knowledge of chunk sizes and show its attack effectiveness against variable-size chunks. We conduct trace-driven evaluation on both real-world and synthetic datasets and show that our proposed attacks infer a significant fraction of plaintext chunks under backup workloads. To defend against frequency analysis, we present two defense approaches, namely MinHash encryption and scrambling. Our trace-driven evaluation shows that our combined MinHash encryption and scrambling scheme effectively mitigates the severity of the inference attacks, while maintaining high storage efficiency and incurring limited metadata access overhead.

References

  1. 2014. FSL Traces and Snapshots Public Archive. Retrieved from http://tracer.filesystems.org/.Google ScholarGoogle Scholar
  2. 2019. Ubuntu IRC Logs. Retrieved from http://irclogs.ubuntu.com.Google ScholarGoogle Scholar
  3. Martín Abadi, Dan Boneh, Ilya Mironov, Ananth Raghunathan, and Gil Segev. 2013. Message-locked encryption for lock-dependent messages. In Proceedings of the Conference on Advances in Cryptology (CRYPTO’13). 374--391.Google ScholarGoogle ScholarCross RefCross Ref
  4. Ibrahim A. Al-Kadit. 1992. Origins of cryptology: The arab contributions. Cryptologia 16, 2 (1992), 97--126.Google ScholarGoogle ScholarCross RefCross Ref
  5. Yamini Allu, Fred Douglis, Mahesh Kamat, Philip Shilane, Hugo Patterson, and Ben Zhu. 2017. Backup to the future: How workload and hardware changes continually redefine data domain file systems. IEEE Trans. Comput. 50, 7 (2017), 64--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. George Amvrosiadis and Medha Bhadkamkar. 2015. Identifying trends in enterprise data protection systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15).Google ScholarGoogle Scholar
  7. Paul Anderson and Le Zhang. 2010. Fast and secure laptop backups with encrypted de-duplication. In Proceedings of the 24th International Conference on Large Installation System Administration (LISA’10). 1--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Frederik Armknecht, Jens-Matthias Bohli, Ghassan O. Karame, and Franck Youssef. 2015. Transparent data deduplication in the cloud. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 886--900.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Frederik Armknecht, Colin Boyd, Gareth T. Davies, Kristian Gjøsteen, and Mohsen Toorani. 2017. Side channels in deduplication: Trade-offs between leakage and efficiency. In Proceedings of the ACM Asia Conference on Computer and Communications Security (ASIACCS’17). 266--274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Michael Arrington. 2006. AOL: “This Was a Screw Up.” Retrieved from https://techcrunch.com/2006/08/07/aol-this-was-a-screw-up/.Google ScholarGoogle Scholar
  11. Mihir Bellare and Sriram Keelveedhi. 2015. Interactive message-locked encryption and secure deduplication. In Proceedings of the Conference on Public-Key Cryptography (PKC’15). 516--538.Google ScholarGoogle ScholarCross RefCross Ref
  12. Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. 2013. DupLESS: Server-aided encryption for deduplicated storage. In Proceeding of the 22nd USENIX Security Symposium (USENIX Security’13). 179--194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. 2013. Message-locked encryption and secure deduplication. In Proceedings of the Conference on Advances in Cryptology (EUROCRYPT’13). 296--312.Google ScholarGoogle ScholarCross RefCross Ref
  14. Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceeding of the IEEE International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  15. Vincent Bindschaedler, Paul Grubbs, David Cash, Thomas Ristenpart, and Vitaly Shmatikov. 2018. The Tao of inference in privacy-protected databases. In Proceedings of the VLDB Endowment, Vol. 11. 1715--1728.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. John Black. 2006. Compare-by-hash: A reasoned analysis. In Proceeding of the USENIX Annual Technical Conference (USENIX ATC’06). 85--90.Google ScholarGoogle Scholar
  17. Tønnes Brekne, André Årnes, and Arne Øslebø. 2005. Anonymization of IP traffic monitoring data: Attacks on two prefix-preserving anonymization schemes and some proposed remedies. In Proceeding of the International Workshop on Privacy Enhancing Technologies (PET’05). 179--196.Google ScholarGoogle Scholar
  18. Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceeding of the Compression and Complexity of Sequences (SEQUENCES’97). 21--29.Google ScholarGoogle Scholar
  19. David Cash, Paul Grubbs, Jason Perry, and Thomas Ristenpart. 2015. Leakage-abuse attacks against searchable encryption. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 668--679.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Landon P. Cox, Christopher D. Murray, and Brian D. Noble. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI’02). 285--298.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Barb Darrow. 2015. Harvard-affiliate McLean Hospital Loses Patient Data. Retrieved from http://fortune.com/2015/07/29/mclean-hospital-loses-patient-data/.Google ScholarGoogle Scholar
  22. John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. 2002. Reclaiming space from duplicate files in a serverless distributed file system. In Proceeding of the 22nd International Conference on Distributed Computing Systems (ICDCS’02). 617--624.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. 2017. The logic of physical garbage collection in deduplicating storage. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 29--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yitao Duan. 2014. Distributed key generation for encrypted deduplication: Achieving the strongest privacy. In Proceedings of the ACM Workshop on Cloud Computing Security (CCSW’14). 57--68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kave Eshghi and Hsiu Khuern Tang. 2005. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Technical Report HPL-2005-30(R.1). Hewlett-Packard Laboratories.Google ScholarGoogle Scholar
  26. Sanjay Ghemawat and Jeff Dean. 2014. LevelDB: A Fast Key/Value Storage Library by Google. Retrieved from https://github.com/google/leveldb.Google ScholarGoogle Scholar
  27. Paul Grubbs, Richard McPherson, Muhammad Naveed, Thomas Ristenpart, and Vitaly Shmatikov. 2016. Breaking web applications built on top of encrypted data. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’16). 1353--1364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Paul Grubbs, Kevin Sekniqi, Vincent Bindschaedler, Muhammad Naveed, and Thomas Ristenpart. 2017. Leakage-abuse attacks against order-revealing encryption. In Proceeding of the IEEE Symposium on Security and Privacy (SP’17). 655--672.Google ScholarGoogle ScholarCross RefCross Ref
  29. Robert Hackett. 2016. LinkedIn Lost 167 Million Account Credentials in Data Breach. Retrieved from http://fortune.com/2016/05/18/linkedin-data-breach-email-password/.Google ScholarGoogle Scholar
  30. Shai Halevi, Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2011. Proofs of ownership in remote storage systems. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11). 491--500.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Secur. Priv. 8, 6 (2010), 40--47.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. HIPAA Journal. 2017. Hard Drive Theft Sees Data of 1 Million Individuals Exposed. Retrieved from https://www.hipaajournal.com/hard-drive-theft-sees-data-1-million-individuals-exposed-8859/.Google ScholarGoogle Scholar
  33. Mohammad Saiful Islam, Mehmet Kuzu, and Murat Kantarcioglu. 2012. Access pattern disclosure on searchable encryption: Ramification, attack and mitigation. In Proceedings of the Network and Distributed System Security Symposium (NDSS’12). 1--15.Google ScholarGoogle Scholar
  34. Keren Jin and Ethan L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceeding of the Israeli Experimental Systems Conference (SYSTOR’09). 7:1--7:12.Google ScholarGoogle Scholar
  35. Mahesh Kallahall, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. 2003. Plutus: Scalable secure file sharing on untrusted storage. In Proceedings of the USENIX Conference on File and Stroage Technologies (FAST’03). 29--42.Google ScholarGoogle Scholar
  36. Georgios Kellaris, George Kollios, Kobbi Nissim, and Adam O’Neill. 2016. Generic attacks on secure outsourced databases. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’16). 1329--1340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. In Proceeding of the European Conference on Machine Learning. 217--226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. 2010. Bimodal content defined chunking for backup streams. In Proceeding of the USENIX Conference on File and Storage Technologies (FAST’10).Google ScholarGoogle Scholar
  39. Ravi Kumar, Jasmine Novak, Bo Pang, and Andrew Tomkins. 2007. On anonymizing query logs via token-based hashing. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). 629--638.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Marie-Sarah Lacharité and Kenneth G. Paterson. 2015. A note on the optimality of frequency analysis vs. -optimization. Cryptology ePrint Archive: Report 2015/1158. Retrieved from https://eprint.iacr.org/2015/1158.Google ScholarGoogle Scholar
  41. Jingwei Li, Patrick P. C. Lee, Yanjing Ren, and Xiaosong Zhang. 2019. Metadedup: Deduplicating metadata in encrypted deduplication via indirection. In Proceeding of the 35th International Conference on Massive Storage Systems and Technology (MSST’19). 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  42. Jingwei Li, Chuan Qin, Patrick P. C. Lee, and Xiaosong Zhang. 2017. Information leakage in encrypted deduplication via frequency analysis. In Proceeding of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’17). 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  43. Mingqiang Li, Chuan Qin, and Patrick P. C. Lee. 2015. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15). 111--124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceeding of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceeding of the USENIX Conference on File and Storage Technologies (FAST’09). 111--123.Google ScholarGoogle Scholar
  46. Jian Liu, N. Asokan, and Benny Pinkas. 2015. Secure deduplication of encrypted data without additional independent servers. In Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 874--885.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, and Xiaoguang Liu. 2016. Lazy exact deduplication. In Proceeding of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16). 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  48. Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. 2001. Handbook of Applied Cryptography. CRC Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 1--1.Google ScholarGoogle Scholar
  50. Muhammad Naveed, Seny Kamara, and Charles V. Wright. 2015. Inference attacks on property-preserving encrypted databases. In Proceeding of the 22nd ACM Conference on Computer and Communications Security (CCS’15). 644--655.Google ScholarGoogle Scholar
  51. Muhammad Naveed, Manoj Prabhakaran, and Carl A. Gunter. 2014. Dynamic searchable encryption via blind storage. In Proceedings of the IEEE Symposium on Security and Privacy (SP’14). 639--654.Google ScholarGoogle Scholar
  52. David Pouliot and Charles V. Wright. 2016. The shadow nemesis: Inference attacks on efficiently deployable, efficiently searchable encryption. In Proceedings of the 23th ACM Conference on Computer and Communications Security (CCS’16). 1341--1352.Google ScholarGoogle Scholar
  53. Chuan Qin, Jingwei Li, and Patrick P. C. Lee. 2017. The design and implementation of a rekeying-aware encrypted deduplication storage system. ACM Trans. Stor. 13, 1 (Mar. 2017), 9:1--9:30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Michael O. Rabin. 1981. Fingerprinting by Random Polynomials. Center for Research in Computing Technology, Harvard University. Technical Report TR-CSE-03-01.Google ScholarGoogle Scholar
  55. Hubert Ritzdorf, Ghassan Karame, Claudio Soriente, and Srdjan Čapkun. 2016. On information leakage in deduplicated storage systems. In Proceedings of the ACM on Cloud Computing Security Workshop (CCSW’16). 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Peter Shah and Won So. 2015. Lamassu: Storage-efficient host-side encryption. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15). 333--345.Google ScholarGoogle Scholar
  57. Elaine Shi, T.-H. Hubert Chan, Emil Stefanov, and Mingfei Li. 2011. Oblivious RAM with O((log N)3) worst-case cost. In Proceedings of the Conference on Advances in Cryptology (ASIACRYPT’11). 197--214.Google ScholarGoogle Scholar
  58. Mark W. Storer, Kevin Greenan, Darrell D. E. Long, and Ethan L. Miller. 2008. Secure data deduplication. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability (StorageSS’08). 1--10.Google ScholarGoogle Scholar
  59. Zhu Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16).Google ScholarGoogle ScholarCross RefCross Ref
  60. Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Conference on Annual Technical Conference (USENIX ATC’12). 24--24.Google ScholarGoogle Scholar
  61. David C. Uthus and David W. Aha. 2013. The Ubuntu chat corpus for multiparticipant chat analysis. In Proceedings of the AAAI Spring Symposium. 99--102.Google ScholarGoogle Scholar
  62. Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 33--48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Zooko Wilcox-O’Hearn and Brian Warner. 2008. Tahoe: The least-authority filesystem. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability (StorageSS’08). 21--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, and Yukun Zhou. 2016. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE 104, 9 (2016), 1681--1710.Google ScholarGoogle ScholarCross RefCross Ref
  65. Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity locality based near exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). 285--298.Google ScholarGoogle Scholar
  66. Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou. 2016. All your queries are belong to us: The power of file-injection attacks on searchable encryption. In Proceeding of the 25th USENIX Security Symposium (Security’16). 707--720.Google ScholarGoogle Scholar
  67. Benjamin Zhu, Kai Li, and R. Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). 269--282.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Information Leakage in Encrypted Deduplication via Frequency Analysis: Attacks and Defenses

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 16, Issue 1
          ATC 2019 Special Section and Regular Papers
          February 2020
          155 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3386184
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 March 2020
          • Accepted: 1 October 2019
          • Revised: 1 August 2019
          • Received: 1 March 2019
          Published in tos Volume 16, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!