Abstract
Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.
- D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’09). IEEE Computer Society, 1--9.Google Scholar
- Zhen Cao, Vasily Tarasov, Hari Raman, Dean Hildebrand, and Erez Zadok. 2017. On the performance variation in modern storage stacks. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 329--343. Google Scholar
Digital Library
- B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference. USENIX, 16. Google Scholar
Digital Library
- W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). USENIX, 15--29. Google Scholar
Digital Library
- F. Douglis, D. Bhardwaj, H. Qian, and P. Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the USENIX Large Installation System Administration Conference. USENIX, 13--13. Google Scholar
Digital Library
- A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J. Li, and S. Sengupta. 2012. Primary data deduplication—Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference. USENIX, 285--296. Google Scholar
Digital Library
- Kave Eshghi, Mark Lillibridge, Deepavali Bhagwat, and Mark Watkins. 2015. Improving Multi-Node Deduplication Performance for Interleaved Data via Sticky-Auction Routing. Technical Report HPL-2015-77. HP Laboratories.Google Scholar
- D. Frey, A. Kermarrec, and K. Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the Symposium on Cloud Computing (SOCC’12). ACM, 17. Google Scholar
Digital Library
- FSL-data-set 2016. FSLHomes data set and tools. Retrieved from tracer.filesystems.org.Google Scholar
- Min Fu, Dan Feng, Yu Hua, Xubin He, and Zuoning Chen. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting history information. In Proceedings of the Annual Technical Conference. USENIX, 181--192. Google Scholar
Digital Library
- Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the International Conference on Middleware. ACM, 354--373. Google Scholar
Digital Library
- Y. Fu, N. Xiao, X. Liao, and F. Liu. 2013. Application-aware client-side data reduction and encryption of personal data in cloud backup services. J. Comput. Sci. Technol. 28, 6 (Nov. 2013), 1012--1024.Google Scholar
Cross Ref
- A. George and B. Medha. 2015. Identifying trends in enterprise data protection systems. In USENIX Annual Technical Conference. USENIX, 151--164. Google Scholar
Digital Library
- A. Gharaibeh, C. Constantinescu, M. Lu, A. Sharma, R. Routray, P. Sarkar, D. Pease, and M. Ripeanu. 2014. DedupT: Deduplication for tape systems. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE Computer Society, 1--11.Google Scholar
- Jhon Gratz and David Reinsel. 2010. The Digital Universe Decade—Are You Ready? IDC White Paper.Google Scholar
- F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference. USENIX, 25--25. Google Scholar
Digital Library
- M. Jianting. 2012. A deduplication-based data archiving system. In Proceedings of the International Conference on Image, Vision and Computing (ICIVC’12). ACM, 1--12.Google Scholar
- K. Jin and E. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). ACM, Haifa, Israel, 7. Google Scholar
Digital Library
- R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Stor. 6, 3 (2010), 13. Google Scholar
Digital Library
- M. Li, C. Qin, and P. Lee. 2015. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of the USENIX Annual Technical Conference. USENIX, 111--124. Google Scholar
Digital Library
- M. Lillibridge and K. Eshghi. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, 183--197. Google Scholar
Digital Library
- M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). USENIX, 111--123. Google Scholar
Digital Library
- X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015. Metadata considered harmful … to deduplication. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems. USENIX, 11. Google Scholar
Digital Library
- X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015. Using deduplicating storage for efficient disk image deployment. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation. IEEE Computer Society, 1--14.Google Scholar
- M. Lu, D. Chambliss, J. Glider, and C. Constantinescu. 2012. Insights for data reduction in primary storage: A practical analysis. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’12). ACM, Haifa, Israel, 14. Google Scholar
Digital Library
- D. Meister and A. Brinkmann. 2009. Multi-level comparison of data deduplication in a backup scenario. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). Google Scholar
Digital Library
- D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MSST’10). IEEE Computer Society,1--6.Google Scholar
- D. Meister, A. Brinkmann, and T. Suss. 2013. File recipe compression in data deduplication systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX,175--182. Google Scholar
Digital Library
- D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel. 2012. A study on data deduplication in hpc storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society, 7. Google Scholar
Digital Library
- D. Meyer and W. Bolosky. 2011. A study of practical deduplication. ACM Trans. Stor. 7, 4 (2011), 14. Google Scholar
Digital Library
- N. Park and D. Lilja. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10). IEEE Computer Society, 1--10. Google Scholar
Digital Library
- K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd International IEEE Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 1--7.Google Scholar
Cross Ref
- Yujuan Tan, Dan Feng, Fangting Huang, and Zhichao Yan. 2011. SORT: A similarity-ownership based routing scheme to improve data read performance for deduplication clusters. Int. J. Adv. Comput. Technol. 3, 9 (2011), 270--277.Google Scholar
- V. Tarasov, A. Mudrankitony, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference. USENIX, 261--272. Google Scholar
Digital Library
- C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Calkowski, C. Dubnicki, and A. Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). USENIX, 225--239. Google Scholar
Digital Library
- C. Vaughn, C. Miller, O. Ekenta, H. Sun, M. Bhadkamkar, P. Efstathopoulos, and E. Kardes. 2015. Soothsayer: Predicting capacity usage in backup storage systems. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’15). IEEE, Atlanta, GA, USA, 208--217. Google Scholar
Digital Library
- R. Villars, C. Olofson, and M. Eastwood. 2011. Big Data: What It Is and Why You Should Care. White Paper.Google Scholar
- G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, 33--48. Google Scholar
Digital Library
- J. Wei, H. Jiang, K. Zhou, and D. Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of the Symposium on Mass Storage Systems and Technologies Conference (MSST’10). IEEE Computer Society, 1--14. Google Scholar
Digital Library
- W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference. USENIX, 26--28. Google Scholar
Digital Library
- T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). IEEE Computer Society, 1--12.Google Scholar
- Y. Zhou, D. Feng, W. Xia, M. Fu, F. Huang, Y. Zhang, and C. Li. 2015. SecDep: A user-aware efficient fine-grained secure dedupication scheme with multi-level key management. In Proceedings of the 31th Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE Computer Society, 1--14.Google Scholar
- B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. USENIX, 1--14. Google Scholar
Digital Library
Index Terms
Cluster and Single-Node Analysis of Long-Term Deduplication Patterns
Recommendations
A study of practical deduplication
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Decentralized deduplication in SAN cluster file systems
USENIX'09: Proceedings of the 2009 conference on USENIX Annual technical conferenceFile systems hosting virtual machines typically contain many duplicated blocks of data resulting in wasted storage space and increased storage array cache footprint. Deduplication addresses these problems by storing a single instance of each unique data ...






Comments