skip to main content
research-article
Public Access

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

Published:11 May 2018Publication History
Skip Abstract Section

Abstract

Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.

References

  1. D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’09). IEEE Computer Society, 1--9.Google ScholarGoogle Scholar
  2. Zhen Cao, Vasily Tarasov, Hari Raman, Dean Hildebrand, and Erez Zadok. 2017. On the performance variation in modern storage stacks. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 329--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference. USENIX, 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). USENIX, 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Douglis, D. Bhardwaj, H. Qian, and P. Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the USENIX Large Installation System Administration Conference. USENIX, 13--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J. Li, and S. Sengupta. 2012. Primary data deduplication—Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference. USENIX, 285--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kave Eshghi, Mark Lillibridge, Deepavali Bhagwat, and Mark Watkins. 2015. Improving Multi-Node Deduplication Performance for Interleaved Data via Sticky-Auction Routing. Technical Report HPL-2015-77. HP Laboratories.Google ScholarGoogle Scholar
  8. D. Frey, A. Kermarrec, and K. Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the Symposium on Cloud Computing (SOCC’12). ACM, 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. FSL-data-set 2016. FSLHomes data set and tools. Retrieved from tracer.filesystems.org.Google ScholarGoogle Scholar
  10. Min Fu, Dan Feng, Yu Hua, Xubin He, and Zuoning Chen. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting history information. In Proceedings of the Annual Technical Conference. USENIX, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the International Conference on Middleware. ACM, 354--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Fu, N. Xiao, X. Liao, and F. Liu. 2013. Application-aware client-side data reduction and encryption of personal data in cloud backup services. J. Comput. Sci. Technol. 28, 6 (Nov. 2013), 1012--1024.Google ScholarGoogle ScholarCross RefCross Ref
  13. A. George and B. Medha. 2015. Identifying trends in enterprise data protection systems. In USENIX Annual Technical Conference. USENIX, 151--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Gharaibeh, C. Constantinescu, M. Lu, A. Sharma, R. Routray, P. Sarkar, D. Pease, and M. Ripeanu. 2014. DedupT: Deduplication for tape systems. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE Computer Society, 1--11.Google ScholarGoogle Scholar
  15. Jhon Gratz and David Reinsel. 2010. The Digital Universe Decade—Are You Ready? IDC White Paper.Google ScholarGoogle Scholar
  16. F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference. USENIX, 25--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Jianting. 2012. A deduplication-based data archiving system. In Proceedings of the International Conference on Image, Vision and Computing (ICIVC’12). ACM, 1--12.Google ScholarGoogle Scholar
  18. K. Jin and E. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). ACM, Haifa, Israel, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Stor. 6, 3 (2010), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Li, C. Qin, and P. Lee. 2015. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of the USENIX Annual Technical Conference. USENIX, 111--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Lillibridge and K. Eshghi. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, 183--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). USENIX, 111--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015. Metadata considered harmful … to deduplication. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems. USENIX, 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015. Using deduplicating storage for efficient disk image deployment. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation. IEEE Computer Society, 1--14.Google ScholarGoogle Scholar
  25. M. Lu, D. Chambliss, J. Glider, and C. Constantinescu. 2012. Insights for data reduction in primary storage: A practical analysis. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’12). ACM, Haifa, Israel, 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Meister and A. Brinkmann. 2009. Multi-level comparison of data deduplication in a backup scenario. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MSST’10). IEEE Computer Society,1--6.Google ScholarGoogle Scholar
  28. D. Meister, A. Brinkmann, and T. Suss. 2013. File recipe compression in data deduplication systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX,175--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel. 2012. A study on data deduplication in hpc storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Meyer and W. Bolosky. 2011. A study of practical deduplication. ACM Trans. Stor. 7, 4 (2011), 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Park and D. Lilja. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10). IEEE Computer Society, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd International IEEE Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  34. Yujuan Tan, Dan Feng, Fangting Huang, and Zhichao Yan. 2011. SORT: A similarity-ownership based routing scheme to improve data read performance for deduplication clusters. Int. J. Adv. Comput. Technol. 3, 9 (2011), 270--277.Google ScholarGoogle Scholar
  35. V. Tarasov, A. Mudrankitony, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference. USENIX, 261--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Calkowski, C. Dubnicki, and A. Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). USENIX, 225--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Vaughn, C. Miller, O. Ekenta, H. Sun, M. Bhadkamkar, P. Efstathopoulos, and E. Kardes. 2015. Soothsayer: Predicting capacity usage in backup storage systems. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’15). IEEE, Atlanta, GA, USA, 208--217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Villars, C. Olofson, and M. Eastwood. 2011. Big Data: What It Is and Why You Should Care. White Paper.Google ScholarGoogle Scholar
  39. G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, 33--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Wei, H. Jiang, K. Zhou, and D. Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of the Symposium on Mass Storage Systems and Technologies Conference (MSST’10). IEEE Computer Society, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference. USENIX, 26--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). IEEE Computer Society, 1--12.Google ScholarGoogle Scholar
  43. Y. Zhou, D. Feng, W. Xia, M. Fu, F. Huang, Y. Zhang, and C. Li. 2015. SecDep: A user-aware efficient fine-grained secure dedupication scheme with multi-level key management. In Proceedings of the 31th Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE Computer Society, 1--14.Google ScholarGoogle Scholar
  44. B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. USENIX, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 14, Issue 2
        May 2018
        210 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3208078
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 May 2018
        • Accepted: 1 January 2018
        • Revised: 1 December 2017
        • Received: 1 September 2017
        Published in tos Volume 14, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!