skip to main content
research-article

GoSeed: Optimal Seeding Plan for Deduplicated Storage

Authors Info & Claims
Published:16 August 2021Publication History
Skip Abstract Section

Abstract

Deduplication decreases the physical occupancy of files in a storage volume by removing duplicate copies of data chunks, but creates data-sharing dependencies that complicate standard storage management tasks. Specifically, data migration plans must consider the dependencies between files that are remapped to new volumes and files that are not. Thus far, only greedy approaches have been suggested for constructing such plans, and it is unclear how they compare to one another and how much they can be improved.

We set to bridge this gap for seeding—migration in which the target volume is initially empty. We prove that even this basic instance of data migration is NP-hard in the presence of deduplication. We then present GoSeed, a formulation of seeding as an integer linear programming (ILP) problem, and three acceleration methods for applying it to real-sized storage volumes. Our experimental evaluation shows that, while the greedy approaches perform well on “easy” problem instances, the cost of their solution can be significantly higher than that of GoSeed’s solution on “hard” instances, for which they are sometimes unable to find a solution at all.

References

  1. [n.d.]. CPLEX Optimizer. IBM. Retrieved on Dec. 29, 2019 from https://www.ibm.com/analytics/cplex-optimizer.Google ScholarGoogle Scholar
  2. [n.d.]. The Fastest Mathematical Programming Solver. Gurobi. Retrieved on Dec. 29, 2019 from http://www.gurobi.com/.Google ScholarGoogle Scholar
  3. [n.d.]. GLPK (GNU Linear Programming Kit). Free Software Foundation. Retrieved on Dec. 29, 2019 from https://www.gnu.org/software/glpk/.Google ScholarGoogle Scholar
  4. [n.d.]. Introduction to lp_solve 5.5.2.5. Free Software Foundation. Retrieved on Dec. 29, 2019 from http://lpsolve.sourceforge.net/5.5/.Google ScholarGoogle Scholar
  5. [n.d.]. SNIA IOTTA Repository. SNIA. Retrieved on Dec. 29, 2019 from http://iotta.snia.org/tracetypes/6.Google ScholarGoogle Scholar
  6. Laszlo Ladanyi, Ted Ralphs, Menal Guzelsoy, and Ashutosh Mahajan. [n.d.]. SYMPHONY development home page. Retrieved on Dec. 29, 2019 from https://projects.coin-or.org/SYMPHONY.Google ScholarGoogle Scholar
  7. [n.d.]. Traces and Snapshots Public Archive. File systems and Storage Lab (FSL), Stony Brook University. Retrieved on Dec. 29, 2019 from http://tracer.filesystems.org/.Google ScholarGoogle Scholar
  8. Jeph Abara. 1989. Applying integer linear programming to the fleet assignment problem. Interfaces 19, 4 (1989), 20–28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee, and George Varghese. 2010. EndRE: An end-system redundancy elimination service for enterprises. In 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale. 2018. Can’t we all get along? Redesigning protection storage for modern workloads. In USENIX Annual Technical Conference (USENIX ATC’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eric Anderson, Joseph Hall, Jason D. Hartline, Michael Hobbs, Anna R. Karlin, Jared Saia, Ram Swaminathan, and John Wilkes. 2001. An experimental study of data migration algorithms. In 5th International Workshop on Algorithm Engineering (WAE01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: Running circles around storage administration. In 1st USENIX Conference on File and Storage Technologies (FAST’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eric Anderson, Mahesh Kallahalla, Susan Spence, Ram Swaminathan, and Qiang Wan. 2002. Ergastulum: Quickly Finding Near-optimal Storage System Designs. HP Laboratories.Google ScholarGoogle Scholar
  14. Alysson Bessani, Miguel Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2013. DepSky: Dependable and secure storage in a cloud-of-clouds. ACM Trans. Storage 9, 4 (Nov. 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS’09).Google ScholarGoogle ScholarCross RefCross Ref
  16. Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In 9th USENIX Conference on File and Stroage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based memory deduplication through page access characteristics. In 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Austin T. Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. 2009. Decentralized deduplication in SAN cluster file systems. In Conference on USENIX Annual Technical Conference (USENIX’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. George B. Dantzig. 1963. Linear Programming and Extensions. Princeton University Press, Princeton, NJ.Google ScholarGoogle Scholar
  20. Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fred Douglis, Deepti Bhardwaj, Hangwei Qian, and Philip Shilane. 2011. Content-aware load balancing for distributed backup. In 25th International Conference on Large Installation System Administration (LISA’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. 2017. The logic of physical garbage collection in deduplicating storage. In 15th USENIX Conference on File and Storage Technologies (’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In 7th Conference on File and Storage Technologies (FAST’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Abhinav Duggal, Fani Jenkins, Philip Shilane, Ramprasad Chinthekindi, Ritesh Shah, and Mahesh Kamat. 2019. Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere! In 2019 USENIX Annual Technical Conference (USENIX ATC’19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. EMC Corporation 2015. INTRODUCTION TO THE EMC XtremIO STORAGE ARRAY (Ver. 4.0) (rev. 08 ed.). EMC Corporation. Retrieved May 30, 2016.Google ScholarGoogle Scholar
  27. Jingxin Feng and Jiri Schindler. 2013. A deduplication study for host-side caches in virtualized data center environments. In 29th IEEE Symposium on Mass Storage Systems and Technologies (MSST’13).Google ScholarGoogle ScholarCross RefCross Ref
  28. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In USENIX Annual Technical Conference (USENIX ATC’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. 2011. Leveraging value locality in optimizing NAND flash-based SSDs. In 9th USENIX Conference on File and Stroage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Danny Harnik, Moshik Hershcovitch, Yosef Shatsky, Amir Epstein, and Ronen Kat. 2019. Sketching volume capacities in deduplicated storage. In 17th USENIX Conference on File and Storage Technologies (FAST’19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication-from theory to practice. In 14th USENIX Conference on File and Storage Technologies (FAST16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Secur. Priv. 8, 6 (Nov. 2010), 40–47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Charles B. Morrey III and Dirk Grunwald. 2006. Content-based block caching. In 23rd IEEE Symposium on Mass Storage Systems and Technologies (MSST’06).Google ScholarGoogle Scholar
  37. Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In 5th International Systems and Storage Conference (SYSTOR’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Karp. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations, R. Miller and J. Thatcher (Eds.). Plenum Press, 85–103.Google ScholarGoogle Scholar
  39. Cheng Li, Philip Shilane, Fred Douglis, Hyong Shim, Stephen Smaldone, and Grant Wallace. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In USENIX Annual Technical Conference (USENIX ATC’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jin Li, Xiaofeng Chen, Mingqiang Li, Jingwei Li, Patrick PC Lee, and Wenjing Lou. 2014. Secure deduplication with efficient and reliable convergent key management. IEEE Trans’ Parallel Distrib. Syst. 25, 6 (June 2014), 1615–1625. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In 11th USENIX Conference on File and Storage Technologies (FAST’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In 7th Conference on File and Storage Technologies (FAST’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Chenyang Lu, Guillermo A. Alvarez, and John Wilkes. 2002. Aqueduct: Online data migration with performance guarantees. In 1st USENIX Conference on File and Storage Technologies (FAST’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Udi Manber. 1994. Finding similar files in a large file system. In USENIX Winter Technical Conference (WTEC’94). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Keiichi Matsuzawa, Mitsuo Hayasaka, and Takahiro Shinagawa. 2018. The quick migration of file servers. In 11th ACM International Systems and Storage Conference (SYSTOR’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Toni Cortes, Michael Kuhn, and Julian Kunkel. 2012. A study on data deduplication in HPC storage systems. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In 9th USENIX Conference on File and Storage Technologies (FAST’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. In 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Aviv Nachman, Gala Yadgar, and Sarai Sheinvald. 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In 18th USENIX Conference on File and Storage Technologies (FAST’20). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. P. C. Nagesh and Atish Kathpal. 2013. Rangoli: Space management in deduplication environments. In 6th International Systems and Storage Conference (SYSTOR’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Youngjin Nam, Guanlin Lu, Nohhyun Park, Weijun Xiao, and David H. C. Du. 2011. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In 2011 IEEE International Conference on High Performance Computing and Communications (HPCC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. A. Richards and J. P. How. 2002. Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In American Control Conference, Vol. 3. 1936–1941.Google ScholarGoogle Scholar
  54. Prateek Sharma and Purushottam Kulkarni. 2012. Singleton: System-wide page deduplication in virtual environments. In 21st International Symposium on High-performance Parallel and Distributed Computing (HPDC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Philip Shilane, Ravi Chitloor, and Uday Kiran Jonnala. 2016. 99 deduplication problems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Mark W. Storer, Kevin Greenan, Darrell D. E. Long, and Ethan L. Miller. 2008. Secure data deduplication. In ACM International Workshop on Storage Security and Survivability (StorageSS’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. John D. Strunk, Eno Thereska, Christos Faloutsos, and Gregory R. Ganger. 2008. Using utility to provision storage systems. In 6th USENIX Conference on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In 32nd Symposium on Mass Storage Systems and Technologies (MSST’16).Google ScholarGoogle ScholarCross RefCross Ref
  60. Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In USENIX Annual Technical Conference (USENIX ATC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Nguyen Tran, Marcos K. Aguilera, and Mahesh Balakrishnan. 2011. Online migration for geo-distributed storage systems. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Carl A. Waldspurger. 2002. Memory resource management in VMware ESX server. ACM SIGOPS Oper. Syst. Rev. - OSDI’02 36, SI (Dec. 2002), 181–194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Nai Xia, Chen Tian, Yan Luo, Hang Liu, and Xiaoliang Wang. 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In 16th USENIX Conference on File and Storage Technologies (FAST’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Perform. Eval. 79 (2014), 258–272. Google ScholarGoogle ScholarCross RefCross Ref
  66. Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In USENIX Annual Technical Conference (USENIX ATC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Zhichao Yan, Hong Jiang, Yujuan Tan, and Hao Luo. 2016. Deduplicating compressed contents in cloud storage environment. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Yanhua Zhang, X. Sun, and Baowei Wang. 2016. Efficient algorithm for k-barrier coverage based on integer linear programming. China Commun. 13, 7 (July 2016), 16–23.Google ScholarGoogle Scholar
  69. Zhichao Cao, Hao Wen, Fenggang Wu, and David H. C. Du. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In 16th USENIX Conference on File and Storage Technologies (FAST’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th USENIX Conference on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Charlie Shucheng Zhu, Georg Weissenbacher, and Sharad Malik. 2012. Coverage-based trace signal selection for fault localisation in post-silicon validation. In 8th International Haifa Verification Conference—Hardware and Software: Verification and Testing (HVC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GoSeed: Optimal Seeding Plan for Deduplicated Storage

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 17, Issue 3
        August 2021
        227 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3477268
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2021 Association for Computing Machinery.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 August 2021
        • Accepted: 1 March 2021
        • Revised: 1 December 2020
        • Received: 1 September 2020
        Published in tos Volume 17, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!