skip to main content
research-article

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Authors Info & Claims
Published:15 October 2021Publication History
Skip Abstract Section

Abstract

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container. Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID. HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID. EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

REFERENCES

  1. [1] FSL. n.d. Traces and Snapshots Public Archive. Retrieved September 13, 2021 from http://tracer.filesystems.org.Google ScholarGoogle Scholar
  2. [2] Kernel.org. n.d. The Linux Kernel Archives. Retrieved September 13, 2021 from https://www.kernel.org.Google ScholarGoogle Scholar
  3. [3] Bhagwat Deepavali, Eshghi Kave, Long Darrell D. E., and Lillibridge Mark. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE 2009 International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Biggar Heidi. 2007. Experiencing Data De-duplication: Improving Efficiency and Reducing Capacity Requirements. White Paper. Enterprise Strategy Group.Google ScholarGoogle Scholar
  5. [5] Bloom Burton H.. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cao Zhichao, Liu Shiyong, Wu Fenggang, Wang Guohua, Li Bingzhe, and Du David H. C.. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST'19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cao Zhichao, Wen Hao, Wu Fenggang, and Du David H. C.. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Debnath Biplob K., Sengupta Sudipta, and Li Jin. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Deng Yuhui. 2011. What is the future of disk drives, death or rebirth? ACM Computing Surveys 43, 3 (2011), 127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Fu Min, Feng Dan, Hua Yu, He Xubin, Chen Zuoning, Xia Wen, Huang Fangting, and Liu Qing. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC'14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Fu Min, Feng Dan, Hua Yu, He Xubin, Chen Zuoning, Xia Wen, Zhang Yucheng, and Tan Yujuan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST'15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Gants John. 2010. Digital universe decade—Are you ready? Retrieved September 19, 2021 from http://viewer.media.bitpipe.com/938044859_264/1287663101_75/Digital-Universe.pdf.Google ScholarGoogle Scholar
  13. [13] Gantz John and Reinsel David. 2012. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the Far East. IDC iView: IDC Analyze the Future 2012 (2012), 116.Google ScholarGoogle Scholar
  14. [14] Guo Fanglu and Efstathopoulos Petros. 2011. Building a high-performance deduplication system. In Proceedings of the 2011 USENIX Annual Technical Conference (ATC'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Guo Fan, Li Yongkun, Xu Yinlong, Jiang Song, and Lui John C. S.. 2017. SmartMD: A high performance deduplication engine with mixed pages. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC'17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Harnik Danny, Pinkas Benny, and Shulman-Peleg Alexandra. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Security & Privacy 8, 6 (2010), 4047. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Kaczmarczyk Michal, Barczynski Marcin, Kilian Wojciech, and Dubnicki Cezary. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Koller Ricardo and Rangaswami Raju. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transaction on Storage 6, 3 (2010), 126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Lillibridge Mark, Eshghi Kave, and Bhagwat Deepavali. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST'13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Lillibridge Mark, Eshghi Kave, Bhagwat Deepavali, Deolalikar Vinay, Trezis Greg, and Camble Peter. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST'09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Mao Bo, Jiang Hong, Wu Suzhen, Fu Yinjin, and Tian Lei. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Transaction on Storage 10, 2 (2014), 122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Meister Dirk, Kaiser Jürgen, and Brinkmann André. 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference. ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Muthitacharoen Athicha, Chen Benjie, and Mazières David. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating System Principles (SOSP'01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Nam Youngjin, Lu Guanlin, Park Nohhyun, Xiao Weijun, and Du David H. C.. 2011. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In Proceedings of the 13th IEEE International Conference on High Performance Computing and Communication (HPCC'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Nam Young Jin, Park Dongchul, and Du David H. C.. 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the IEEE 20th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Ni Fan and Jiang Song. 2019. RapidCDC: Leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In Proceedings of the ACM Symposium on Cloud Computing, (SoCC'19). ACM, New York, NY, 220232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Paulo João and Pereira José. 2014. A survey and classification of storage deduplication systems. ACM Computing Surveys 47, 1 (2014), Article 11, 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Quinlan Sean and Dorward Sean. 2002. Venti: A new approach to archival storage. In Proceedings of the 1st Conference on File and Storage Technologies (FAST'20). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Rabin Michael O.. 1981. Fingerprinting by Random Polynomials. Technical Report. Hebrew University of Jerusalem.Google ScholarGoogle Scholar
  30. [30] Tarasov Vasily, Mudrankit Amar, Buik Will, Shilane Philip, Kuenning Geoff, and Zadok Erez. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Jiang Dan Feng Wen Xia, Hong, and Hua Yu. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the 2011 USENIX Annual Technical Conference (ATC'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Xia Nai, Tian Chen, Luo Yan, Liu Hang, and Wang Xiaoliang. 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Zhou Yongtao, Deng Yuhui, Yang Laurence T., Yang Ru, and Si Lei. 2018. LDFS: A low latency in-line data deduplication file system. IEEE Access 6 (2018), 1574315753.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Zhu Benjamin, Li Kai, and Patterson R. Hugo. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Zou Xiangyu, Yuan Jingsong, Shilane Philip, Xia Wen, Zhang Haijun, and Xuan Wang. 2021. The dilemma between deduplication and locality: Can both be achieved? In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST'21).Google ScholarGoogle Scholar

Index Terms

  1. Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 17, Issue 4
          November 2021
          201 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3487989
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2021
          • Accepted: 1 March 2021
          • Revised: 1 January 2021
          • Received: 1 October 2020
          Published in tos Volume 17, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!