Abstract
Deduplication decreases the physical occupancy of files in a storage volume by removing duplicate copies of data chunks, but creates data-sharing dependencies that complicate standard storage management tasks. Specifically, data migration plans must consider the dependencies between files that are remapped to new volumes and files that are not. Thus far, only greedy approaches have been suggested for constructing such plans, and it is unclear how they compare to one another and how much they can be improved.
We set to bridge this gap for seeding—migration in which the target volume is initially empty. We prove that even this basic instance of data migration is NP-hard in the presence of deduplication. We then present GoSeed, a formulation of seeding as an integer linear programming (ILP) problem, and three acceleration methods for applying it to real-sized storage volumes. Our experimental evaluation shows that, while the greedy approaches perform well on “easy” problem instances, the cost of their solution can be significantly higher than that of GoSeed’s solution on “hard” instances, for which they are sometimes unable to find a solution at all.
- [n.d.]. CPLEX Optimizer. IBM. Retrieved on Dec. 29, 2019 from https://www.ibm.com/analytics/cplex-optimizer.Google Scholar
- [n.d.]. The Fastest Mathematical Programming Solver. Gurobi. Retrieved on Dec. 29, 2019 from http://www.gurobi.com/.Google Scholar
- [n.d.]. GLPK (GNU Linear Programming Kit). Free Software Foundation. Retrieved on Dec. 29, 2019 from https://www.gnu.org/software/glpk/.Google Scholar
- [n.d.]. Introduction to lp_solve 5.5.2.5. Free Software Foundation. Retrieved on Dec. 29, 2019 from http://lpsolve.sourceforge.net/5.5/.Google Scholar
- [n.d.]. SNIA IOTTA Repository. SNIA. Retrieved on Dec. 29, 2019 from http://iotta.snia.org/tracetypes/6.Google Scholar
- Laszlo Ladanyi, Ted Ralphs, Menal Guzelsoy, and Ashutosh Mahajan. [n.d.]. SYMPHONY development home page. Retrieved on Dec. 29, 2019 from https://projects.coin-or.org/SYMPHONY.Google Scholar
- [n.d.]. Traces and Snapshots Public Archive. File systems and Storage Lab (FSL), Stony Brook University. Retrieved on Dec. 29, 2019 from http://tracer.filesystems.org/.Google Scholar
- Jeph Abara. 1989. Applying integer linear programming to the fleet assignment problem. Interfaces 19, 4 (1989), 20–28. Google Scholar
Digital Library
- Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee, and George Varghese. 2010. EndRE: An end-system redundancy elimination service for enterprises. In 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). Google Scholar
Digital Library
- Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale. 2018. Can’t we all get along? Redesigning protection storage for modern workloads. In USENIX Annual Technical Conference (USENIX ATC’18). Google Scholar
Digital Library
- Eric Anderson, Joseph Hall, Jason D. Hartline, Michael Hobbs, Anna R. Karlin, Jared Saia, Ram Swaminathan, and John Wilkes. 2001. An experimental study of data migration algorithms. In 5th International Workshop on Algorithm Engineering (WAE01). Google Scholar
Digital Library
- Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: Running circles around storage administration. In 1st USENIX Conference on File and Storage Technologies (FAST’02). Google Scholar
Digital Library
- Eric Anderson, Mahesh Kallahalla, Susan Spence, Ram Swaminathan, and Qiang Wan. 2002. Ergastulum: Quickly Finding Near-optimal Storage System Designs. HP Laboratories.Google Scholar
- Alysson Bessani, Miguel Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2013. DepSky: Dependable and secure storage in a cloud-of-clouds. ACM Trans. Storage 9, 4 (Nov. 2013). Google Scholar
Digital Library
- Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS’09).Google Scholar
Cross Ref
- Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In 9th USENIX Conference on File and Stroage Technologies (FAST’11). Google Scholar
Digital Library
- Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based memory deduplication through page access characteristics. In 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’14). Google Scholar
Digital Library
- Austin T. Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. 2009. Decentralized deduplication in SAN cluster file systems. In Conference on USENIX Annual Technical Conference (USENIX’09). Google Scholar
Digital Library
- George B. Dantzig. 1963. Linear Programming and Extensions. Princeton University Press, Princeton, NJ.Google Scholar
- Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’10). Google Scholar
Digital Library
- Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In 9th USENIX Conference on File and Storage Technologies (FAST’11). Google Scholar
Digital Library
- Fred Douglis, Deepti Bhardwaj, Hangwei Qian, and Philip Shilane. 2011. Content-aware load balancing for distributed backup. In 25th International Conference on Large Installation System Administration (LISA’11). Google Scholar
Digital Library
- Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. 2017. The logic of physical garbage collection in deduplicating storage. In 15th USENIX Conference on File and Storage Technologies (’17). Google Scholar
Digital Library
- Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In 7th Conference on File and Storage Technologies (FAST’09). Google Scholar
Digital Library
- Abhinav Duggal, Fani Jenkins, Philip Shilane, Ramprasad Chinthekindi, Ritesh Shah, and Mahesh Kamat. 2019. Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere! In 2019 USENIX Annual Technical Conference (USENIX ATC’19). Google Scholar
Digital Library
- EMC Corporation 2015. INTRODUCTION TO THE EMC XtremIO STORAGE ARRAY (Ver. 4.0) (rev. 08 ed.). EMC Corporation. Retrieved May 30, 2016.Google Scholar
- Jingxin Feng and Jiri Schindler. 2013. A deduplication study for host-side caches in virtualized data center environments. In 29th IEEE Symposium on Mass Storage Systems and Technologies (MSST’13).Google Scholar
Cross Ref
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In USENIX Annual Technical Conference (USENIX ATC’14). Google Scholar
Digital Library
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST’15). Google Scholar
Digital Library
- Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). Google Scholar
Digital Library
- Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. 2011. Leveraging value locality in optimizing NAND flash-based SSDs. In 9th USENIX Conference on File and Stroage Technologies (FAST’11). Google Scholar
Digital Library
- Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). Google Scholar
Digital Library
- Danny Harnik, Moshik Hershcovitch, Yosef Shatsky, Amir Epstein, and Ronen Kat. 2019. Sketching volume capacities in deduplicated storage. In 17th USENIX Conference on File and Storage Technologies (FAST’19). Google Scholar
Digital Library
- Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication-from theory to practice. In 14th USENIX Conference on File and Storage Technologies (FAST16). Google Scholar
Digital Library
- Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Secur. Priv. 8, 6 (Nov. 2010), 40–47. Google Scholar
Digital Library
- Charles B. Morrey III and Dirk Grunwald. 2006. Content-based block caching. In 23rd IEEE Symposium on Mass Storage Systems and Technologies (MSST’06).Google Scholar
- Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In 5th International Systems and Storage Conference (SYSTOR’12). Google Scholar
Digital Library
- R. Karp. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations, R. Miller and J. Thatcher (Eds.). Plenum Press, 85–103.Google Scholar
- Cheng Li, Philip Shilane, Fred Douglis, Hyong Shim, Stephen Smaldone, and Grant Wallace. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In USENIX Annual Technical Conference (USENIX ATC’14). Google Scholar
Digital Library
- Jin Li, Xiaofeng Chen, Mingqiang Li, Jingwei Li, Patrick PC Lee, and Wenjing Lou. 2014. Secure deduplication with efficient and reliable convergent key management. IEEE Trans’ Parallel Distrib. Syst. 25, 6 (June 2014), 1615–1625. Google Scholar
Digital Library
- Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In 11th USENIX Conference on File and Storage Technologies (FAST’13). Google Scholar
Digital Library
- Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In 7th Conference on File and Storage Technologies (FAST’09). Google Scholar
Digital Library
- Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST’14). Google Scholar
Digital Library
- Chenyang Lu, Guillermo A. Alvarez, and John Wilkes. 2002. Aqueduct: Online data migration with performance guarantees. In 1st USENIX Conference on File and Storage Technologies (FAST’02). Google Scholar
Digital Library
- Udi Manber. 1994. Finding similar files in a large file system. In USENIX Winter Technical Conference (WTEC’94). Google Scholar
Digital Library
- Keiichi Matsuzawa, Mitsuo Hayasaka, and Takahiro Shinagawa. 2018. The quick migration of file servers. In 11th ACM International Systems and Storage Conference (SYSTOR’18). Google Scholar
Digital Library
- Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Toni Cortes, Michael Kuhn, and Julian Kunkel. 2012. A study on data deduplication in HPC storage systems. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). Google Scholar
Digital Library
- Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In 9th USENIX Conference on File and Storage Technologies (FAST’11). Google Scholar
Digital Library
- Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. In 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google Scholar
Digital Library
- Aviv Nachman, Gala Yadgar, and Sarai Sheinvald. 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In 18th USENIX Conference on File and Storage Technologies (FAST’20). Google Scholar
Digital Library
- P. C. Nagesh and Atish Kathpal. 2013. Rangoli: Space management in deduplication environments. In 6th International Systems and Storage Conference (SYSTOR’13). Google Scholar
Digital Library
- Youngjin Nam, Guanlin Lu, Nohhyun Park, Weijun Xiao, and David H. C. Du. 2011. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In 2011 IEEE International Conference on High Performance Computing and Communications (HPCC’11). Google Scholar
Digital Library
- A. Richards and J. P. How. 2002. Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In American Control Conference, Vol. 3. 1936–1941.Google Scholar
- Prateek Sharma and Purushottam Kulkarni. 2012. Singleton: System-wide page deduplication in virtual environments. In 21st International Symposium on High-performance Parallel and Distributed Computing (HPDC’12). Google Scholar
Digital Library
- Philip Shilane, Ravi Chitloor, and Uday Kiran Jonnala. 2016. 99 deduplication problems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). Google Scholar
Digital Library
- Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In 10th USENIX Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Mark W. Storer, Kevin Greenan, Darrell D. E. Long, and Ethan L. Miller. 2008. Secure data deduplication. In ACM International Workshop on Storage Security and Survivability (StorageSS’08). Google Scholar
Digital Library
- John D. Strunk, Eno Thereska, Christos Faloutsos, and Gregory R. Ganger. 2008. Using utility to provision storage systems. In 6th USENIX Conference on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In 32nd Symposium on Mass Storage Systems and Technologies (MSST’16).Google Scholar
Cross Ref
- Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In USENIX Annual Technical Conference (USENIX ATC’12). Google Scholar
Digital Library
- Nguyen Tran, Marcos K. Aguilera, and Mahesh Balakrishnan. 2011. Online migration for geo-distributed storage systems. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). Google Scholar
Digital Library
- Carl A. Waldspurger. 2002. Memory resource management in VMware ESX server. ACM SIGOPS Oper. Syst. Rev. - OSDI’02 36, SI (Dec. 2002), 181–194. Google Scholar
Digital Library
- Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In 10th USENIX Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Nai Xia, Chen Tian, Yan Luo, Hang Liu, and Xiaoliang Wang. 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In 16th USENIX Conference on File and Storage Technologies (FAST’18). Google Scholar
Digital Library
- Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Perform. Eval. 79 (2014), 258–272. Google Scholar
Cross Ref
- Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In USENIX Annual Technical Conference (USENIX ATC’16). Google Scholar
Digital Library
- Zhichao Yan, Hong Jiang, Yujuan Tan, and Hao Luo. 2016. Deduplicating compressed contents in cloud storage environment. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). Google Scholar
Digital Library
- Yanhua Zhang, X. Sun, and Baowei Wang. 2016. Efficient algorithm for k-barrier coverage based on integer linear programming. China Commun. 13, 7 (July 2016), 16–23.Google Scholar
- Zhichao Cao, Hao Wen, Fenggang Wu, and David H. C. Du. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In 16th USENIX Conference on File and Storage Technologies (FAST’18). Google Scholar
Digital Library
- Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th USENIX Conference on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Charlie Shucheng Zhu, Georg Weissenbacher, and Sharad Malik. 2012. Coverage-based trace signal selection for fault localisation in post-silicon validation. In 8th International Haifa Verification Conference—Hardware and Software: Verification and Testing (HVC’12). Google Scholar
Digital Library
Index Terms
GoSeed: Optimal Seeding Plan for Deduplicated Storage
Recommendations
The what, The from, and The to: The Migration Games in Deduplicated Systems
Deduplication reduces the size of the data stored in large-scale storage systems by replacing duplicate data blocks with references to their unique copies. This creates dependencies between files that contain similar content and complicates the management ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
On Information Leakage in Deduplicated Storage Systems
CCSW '16: Proceedings of the 2016 ACM on Cloud Computing Security WorkshopMost existing cloud storage providers rely on data deduplication in order to significantly save storage costs by storing duplicate data only once. While the literature has thoroughly analyzed client-side information leakage associated with the use of ...






Comments