Abstract
Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container. Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID. HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID. EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.
- [1] n.d. Traces and Snapshots Public Archive. Retrieved September 13, 2021 from http://tracer.filesystems.org.Google Scholar
- [2] n.d. The Linux Kernel Archives. Retrieved September 13, 2021 from https://www.kernel.org.Google Scholar
- [3] . 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE 2009 International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.
IEEE ,Los Alamitos, CA .Google ScholarCross Ref
- [4] . 2007. Experiencing Data De-duplication: Improving Efficiency and Reducing Capacity Requirements. White Paper. Enterprise Strategy Group.Google Scholar
- [5] . 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422–426. Google Scholar
Digital Library
- [6] . 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 17th USENIX Conference on File and Storage Technologies
(FAST'19) . Google ScholarDigital Library
- [7] . 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proceedings of the 16th USENIX Conference on File and Storage Technologies
(FAST'18) . Google ScholarDigital Library
- [8] . 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX Annual Technical Conference
(ATC'10) . Google ScholarDigital Library
- [9] . 2011. What is the future of disk drives, death or rebirth? ACM Computing Surveys 43, 3 (2011), 1–27. Google Scholar
Digital Library
- [10] . 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Annual Technical Conference
(ATC'14) . Google ScholarDigital Library
- [11] . 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies
(FAST'15) . Google ScholarDigital Library
- [12] . 2010. Digital universe decade—Are you ready? Retrieved September 19, 2021 from http://viewer.media.bitpipe.com/938044859_264/1287663101_75/Digital-Universe.pdf.Google Scholar
- [13] . 2012. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the Far East. IDC iView: IDC Analyze the Future 2012 (2012), 1–16.Google Scholar
- [14] . 2011. Building a high-performance deduplication system. In Proceedings of the 2011 USENIX Annual Technical Conference
(ATC'11) . Google ScholarDigital Library
- [15] . 2017. SmartMD: A high performance deduplication engine with mixed pages. In Proceedings of the 2017 USENIX Annual Technical Conference
(ATC'17) . Google ScholarDigital Library
- [16] . 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Security & Privacy 8, 6 (2010), 40–47. Google Scholar
Digital Library
- [17] . 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference.
ACM ,New York, NY . Google ScholarDigital Library
- [18] . 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transaction on Storage 6, 3 (2010), 1–26. Google Scholar
Digital Library
- [19] . 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies
(FAST'13) . Google ScholarDigital Library
- [20] . 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies
(FAST'09) . Google ScholarDigital Library
- [21] . 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Transaction on Storage 10, 2 (2014), 1–22. Google Scholar
Digital Library
- [22] . 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference.
ACM ,New York, NY . Google ScholarDigital Library
- [23] . 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating System Principles
(SOSP'01) . Google ScholarDigital Library
- [24] . 2011. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proceedings of the 13th IEEE International Conference on High Performance Computing and Communication
(HPCC'11) . Google ScholarDigital Library
- [25] . 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the IEEE 20th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.
IEEE ,Los Alamitos, CA . Google ScholarDigital Library
- [26] . 2019. RapidCDC: Leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In Proceedings of the ACM Symposium on Cloud Computing,
(SoCC'19) .ACM, New York, NY , 220–232. Google ScholarDigital Library
- [27] . 2014. A survey and classification of storage deduplication systems. ACM Computing Surveys 47, 1 (2014), Article 11, 30 pages. Google Scholar
Digital Library
- [28] . 2002. Venti: A new approach to archival storage. In Proceedings of the 1st Conference on File and Storage Technologies
(FAST'20) . Google ScholarDigital Library
- [29] . 1981. Fingerprinting by Random Polynomials. Technical Report. Hebrew University of Jerusalem.Google Scholar
- [30] . 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the 2012 USENIX Annual Technical Conference
(ATC'12) . Google ScholarDigital Library
- [31] . 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the 2011 USENIX Annual Technical Conference
(ATC'11) . Google ScholarDigital Library
- [32] . 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In Proceedings of the 16th USENIX Conference on File and Storage Technologies
(FAST'18) . Google ScholarDigital Library
- [33] . 2018. LDFS: A low latency in-line data deduplication file system. IEEE Access 6 (2018), 15743–15753.Google Scholar
Cross Ref
- [34] . 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies
(FAST'08) . Google ScholarDigital Library
- [35] . 2021. The dilemma between deduplication and locality: Can both be achieved? In Proceedings of the 19th USENIX Conference on File and Storage Technologies
(FAST'21) .Google Scholar
Index Terms
Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling
Recommendations
InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization
Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many ...
Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata
Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a ...
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud
Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, ...






Comments