Abstract
Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.
To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.
- [1] Dell Technologies. 2021. IDC The Business Value of Storage Solutions from Dell Technologies. Retrieved from https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/idc-the-business-value-of-storage-solutions-from-dell-technologies.pdf.Google Scholar
- [2] FSL. 2021. Traces and Snapshots Public Archive. Retrieved from https://tracer.filesystems.org/.Google Scholar
- [3] . 2018. HDD vs SSD: What Does the Future for Storage Hold? Retrieved from https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/.Google Scholar
- [4] . 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’19). 129–142.Google Scholar
- [5] . 2015. TIGER: Thermal-aware file assignment in storage clusters. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2015), 558–573.Google Scholar
Digital Library
- [6] . 2017. Memory deduplication: An effective approach to improve the memory system. J. Info. Sci. Eng. 33, 5 (2017), 1103–1120.Google Scholar
- [7] . 2015. Reducing fragmentation for in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEE Trans. Parallel Distrib. Syst. 27, 3 (2015), 855–868.Google Scholar
Digital Library
- [8] . 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the USENIX Annual Technical Conference (ATC’14). 181–192.Google Scholar
- [9] . 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 331–344.Google Scholar
Digital Library
- [10] . 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC’11).Google Scholar
- [11] . 2017. Smartmd: A high performance deduplication engine with mixed pages. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 733–744.Google Scholar
- [12] . 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference. 1–12.Google Scholar
Digital Library
- [13] . 2007. Practical guide to controlled experiments on the web: Listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 959–967.Google Scholar
Digital Library
- [14] . 2014. A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing. Springer, 457–471.Google Scholar
Cross Ref
- [15] . 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183–197.Google Scholar
Digital Library
- [16] . 2021. Improving restore performance of deduplication systems via a greedy rewriting scheme. In Proceedings of the 27th International Conference on Parallel and Distributed Systems (ICPADS’21).Google Scholar
Cross Ref
- [17] . 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 256–273.Google Scholar
- [18] . 2014. PLC-cache: Endurable SSD cache for deduplication-based primary storage. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE, 1–12.Google Scholar
Cross Ref
- [19] . 2015. Boafft: Distributed deduplication for big data storage in the cloud. IEEE Trans. Cloud Comput. 8, 4 (2015), 1199–1211.Google Scholar
- [20] . 2017. Lazy exact deduplication. ACM Trans. Stor. 13, 2 (2017), 1–26.Google Scholar
Digital Library
- [21] . 2018. Efficient replica migration scheme for distributed cloud storage systems. IEEE Trans. Cloud Comput. 9, 1 (2018), 155–167.Google Scholar
Cross Ref
- [22] . 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 193–207.Google Scholar
- [23] . 2019. RapidCDC: Leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In Proceedings of the ACM Symposium on Cloud Computing. 220–232.Google Scholar
Digital Library
- [24] . 2013. Concurrent deletion in a distributed Content-Addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 161–174.Google Scholar
Digital Library
- [25] . 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 1–7.Google Scholar
Cross Ref
- [26] . 2018. Improving restore performance in deduplication-based backup systems via a fine-grained defragmentation approach. IEEE Trans. Parallel Distrib. Syst. 29, 10 (2018), 2254–2267.Google Scholar
Cross Ref
- [27] . 2020. Improving the performance of deduplication-based storage cache via content-driven cache management methods. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2020), 214–228.Google Scholar
Cross Ref
- [28] . 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference (ATC’12). 261–272.Google Scholar
- [29] . 2014. End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3, 2 (2014), 169–181.Google Scholar
Cross Ref
- [30] . 2018. Improving restore performance in deduplication systems via a cost-efficient rewriting scheme. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2018), 119–132.Google Scholar
Digital Library
- [31] . 2019. PFP: Improving the reliability of deduplication-based storage systems with per-file parity. IEEE Trans. Parallel Distrib. Syst. 30, 9 (2019), 2117–2129.Google Scholar
Digital Library
- [32] . 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 325–340.Google Scholar
- [33] . 2014. Combining deduplication and delta compression to achieve low-overhead data reduction on backup datasets. In Proceedings of the Data Compression Conference. IEEE, 203–212.Google Scholar
Cross Ref
- [34] . 2020. The design of fast content-defined chunking for data deduplication based storage systems. IEEE Trans. Parallel Distrib. Syst. 31, 9 (2020), 2017–2031.Google Scholar
Cross Ref
- [35] . 2021. Boosting the restoring performance of deduplication data by classifying backup metadata. ACM/IMS Trans. Data Sci. 2, 2 (2021), 1–16.Google Scholar
Digital Library
- [36] . 2021. Improving the performance of deduplication-based backup systems via container utilization based hot fingerprint entry distilling. ACM Trans. Stor. 17, 4 (2021), 1–23.Google Scholar
Digital Library
- [37] . 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’15). IEEE, 1337–1345.Google Scholar
Cross Ref
- [38] . 2020. Duphunter: Flexible high-performance deduplication for docker registries. In Proceedings of the USENIX Annual Technical Conference (ATC’20). 769–783.Google Scholar
- [39] . 2018. LDFS: A low latency in-line data deduplication file system. IEEE Access 6 (2018), 15743–15753.Google Scholar
Cross Ref
- [40] . 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 269–282.Google Scholar
- [41] . 2021. The dilemma between deduplication and locality: Can both be achieved? In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 171–185.Google Scholar
Index Terms
InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization
Recommendations
Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling
Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are ...
SAR: SSD Assisted Restore Optimization for Deduplication-Based Storage Systems in the Cloud
NAS '12: Proceedings of the 2012 IEEE Seventh International Conference on Networking, Architecture, and StorageThe explosive growth of digital content results in enormous strains on the storage systems in the cloud environment. The data deduplication technology has been demonstrated to be very effective in shortening the backup window and saving the network ...
A new content-defined chunking algorithm for data deduplication in cloud storage
Chunking is a process to split a file into smaller files called chunks. In some applications, such as remote data compression, data synchronization, and data deduplication, chunking is important because it determines the duplicate detection performance ...






Comments