skip to main content
research-article

InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization

Authors Info & Claims
Published:11 January 2023Publication History
Skip Abstract Section

Abstract

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.

To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

REFERENCES

  1. [1] Dell Technologies. 2021. IDC The Business Value of Storage Solutions from Dell Technologies. Retrieved from https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/idc-the-business-value-of-storage-solutions-from-dell-technologies.pdf.Google ScholarGoogle Scholar
  2. [2] FSL. 2021. Traces and Snapshots Public Archive. Retrieved from https://tracer.filesystems.org/.Google ScholarGoogle Scholar
  3. [3] Bauer R.. 2018. HDD vs SSD: What Does the Future for Storage Hold? Retrieved from https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/.Google ScholarGoogle Scholar
  4. [4] Cao Zhichao, Liu Shiyong, Wu Fenggang, Wang Guohua, Li Bingzhe, and Du David H. C.. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’19). 129142.Google ScholarGoogle Scholar
  5. [5] Chavan Ajit, Alghamdi Mohammed I., Jiang Xunfei, Qin Xiao, Qiu Meikang, Jiang Minghua, and Zhang Jifu. 2015. TIGER: Thermal-aware file assignment in storage clusters. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2015), 558573.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Deng Yuhui, Huang Xinyu, Song Liangshan, Zhou Yongtao, and Wang Frank Z.. 2017. Memory deduplication: An effective approach to improve the memory system. J. Info. Sci. Eng. 33, 5 (2017), 11031120.Google ScholarGoogle Scholar
  7. [7] Fu Min, Feng Dan, Hua Yu, He Xubin, Chen Zuoning, Liu Jingning, Xia Wen, Huang Fangting, and Liu Qing. 2015. Reducing fragmentation for in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEE Trans. Parallel Distrib. Syst. 27, 3 (2015), 855868.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Fu Min, Feng Dan, Hua Yu, He Xubin, Chen Zuoning, Xia Wen, Huang Fangting, and Liu Qing. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the USENIX Annual Technical Conference (ATC’14). 181192.Google ScholarGoogle Scholar
  9. [9] Fu Min, Feng Dan, Hua Yu, He Xubin, Chen Zuoning, Xia Wen, Zhang Yucheng, and Tan Yujuan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 331344.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Guo Fanglu and Efstathopoulos Petros. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC’11).Google ScholarGoogle Scholar
  11. [11] Guo Fan, Li Yongkun, Xu Yinlong, Jiang Song, and Lui John C. S.. 2017. Smartmd: A high performance deduplication engine with mixed pages. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 733744.Google ScholarGoogle Scholar
  12. [12] Kaczmarczyk Michal, Barczynski Marcin, Kilian Wojciech, and Dubnicki Cezary. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference. 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Kohavi Ron, Henne Randal M., and Sommerfield Dan. 2007. Practical guide to controlled experiments on the web: Listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 959967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Lai Rongyu, Hua Yu, Feng Dan, Xia Wen, Fu Min, and Yang Yifan. 2014. A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing. Springer, 457471.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Lillibridge Mark, Eshghi Kave, and Bhagwat Deepavali. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Lin Lifang, Deng Yuhui, and Zhou Yi. 2021. Improving restore performance of deduplication systems via a greedy rewriting scheme. In Proceedings of the 27th International Conference on Parallel and Distributed Systems (ICPADS’21).Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Lin Xing, Lu Guanlin, Douglis Fred, Shilane Philip, and Wallace Grant. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 256273.Google ScholarGoogle Scholar
  18. [18] Liu Jian, Chai Yunpeng, Qin Xiao, and Xiao Yuan. 2014. PLC-cache: Endurable SSD cache for deduplication-based primary storage. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE, 112.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Luo Shengmei, Zhang Guangyan, Wu Chengwen, Khan Samee, and Li Keqin. 2015. Boafft: Distributed deduplication for big data storage in the cloud. IEEE Trans. Cloud Comput. 8, 4 (2015), 1199–1211.Google ScholarGoogle Scholar
  20. [20] Ma Jingwei, Stones Rebecca J., Ma Yuxiang, Wang Jingui, Ren Junjie, Wang Gang, and Liu Xiaoguang. 2017. Lazy exact deduplication. ACM Trans. Stor. 13, 2 (2017), 126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Mseddi Amina, Salahuddin Mohammad A., Zhani Mohamed Faten, Elbiaze Halima, and Glitho Roch H.. 2018. Efficient replica migration scheme for distributed cloud storage systems. IEEE Trans. Cloud Comput. 9, 1 (2018), 155167.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Nachman Aviv, Yadgar Gala, and Sheinvald Sarai. 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 193207.Google ScholarGoogle Scholar
  23. [23] Ni Fan and Jiang Song. 2019. RapidCDC: Leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In Proceedings of the ACM Symposium on Cloud Computing. 220232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Strzelczak Przemyslaw, Adamczyk Elzbieta, Herman-Izycka Urszula, Sakowicz Jakub, Slusarczyk Lukasz, Wrona Jaroslaw, and Dubnicki Cezary. 2013. Concurrent deletion in a distributed Content-Addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 161174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Sun Zhen, Kuenning Geoff, Mandal Sonam, Shilane Philip, Tarasov Vasily, Xiao Nong et al. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Tan Yujuan, Wang Baiping, Wen Jian, Yan Zhichao, Jiang Hong, and Srisa-an Witawas. 2018. Improving restore performance in deduplication-based backup systems via a fine-grained defragmentation approach. IEEE Trans. Parallel Distrib. Syst. 29, 10 (2018), 22542267.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Tan Yujuan, Xu Congcong, Xie Jing, Yan Zhichao, Jiang Hong, Srisa-an Witawas, Chen Xianzhang, and Liu Duo. 2020. Improving the performance of deduplication-based storage cache via content-driven cache management methods. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2020), 214228.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Tarasov Vasily, Mudrankit Amar, Buik Will, Shilane Philip, Kuenning Geoff, and Zadok Erez. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference (ATC’12). 261272.Google ScholarGoogle Scholar
  29. [29] Wu Chase Qishi, Lin Xiangyu, Yu Dantong, Xu Wei, and Li Li. 2014. End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3, 2 (2014), 169181.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Wu Jie, Hua Yu, Zuo Pengfei, and Sun Yuanyuan. 2018. Improving restore performance in deduplication systems via a cost-efficient rewriting scheme. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2018), 119132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Wu Suzhen, Mao Bo, Jiang Hong, Luan Huagao, and Zhou Jindong. 2019. PFP: Improving the reliability of deduplication-based storage systems with per-file parity. IEEE Trans. Parallel Distrib. Syst. 30, 9 (2019), 21172129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Xia Nai, Tian Chen, Luo Yan, Liu Hang, and Wang Xiaoliang. 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 325340.Google ScholarGoogle Scholar
  33. [33] Xia Wen, Jiang Hong, Feng Dan, and Tian Lei. 2014. Combining deduplication and delta compression to achieve low-overhead data reduction on backup datasets. In Proceedings of the Data Compression Conference. IEEE, 203212.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Xia Wen, Zou Xiangyu, Jiang Hong, Zhou Yukun, Liu Chuanyi, Feng Dan, Hua Yu, Hu Yuchong, and Zhang Yucheng. 2020. The design of fast content-defined chunking for data deduplication based storage systems. IEEE Trans. Parallel Distrib. Syst. 31, 9 (2020), 20172031.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Yang Ru, Deng Yuhui, Zhou Yi, and Huang Ping. 2021. Boosting the restoring performance of deduplication data by classifying backup metadata. ACM/IMS Trans. Data Sci. 2, 2 (2021), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Zhang Datong, Deng Yuhui, Zhou Yi, Zhu Yifeng, and Qin Xiao. 2021. Improving the performance of deduplication-based backup systems via container utilization based hot fingerprint entry distilling. ACM Trans. Stor. 17, 4 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Zhang Yucheng, Jiang Hong, Feng Dan, Xia Wen, Fu Min, Huang Fangting, and Zhou Yukun. 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’15). IEEE, 13371345.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zhao Nannan, Albahar Hadeel, Abraham Subil, Chen Keren, Tarasov Vasily, Skourtis Dimitrios, Rupprecht Lukas, Anwar Ali, and Butt Ali R. 2020. Duphunter: Flexible high-performance deduplication for docker registries. In Proceedings of the USENIX Annual Technical Conference (ATC’20). 769783.Google ScholarGoogle Scholar
  39. [39] Zhou Yongtao, Deng Yuhui, Yang Laurence T., Yang Ru, and Si Lei. 2018. LDFS: A low latency in-line data deduplication file system. IEEE Access 6 (2018), 1574315753.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Zhu Benjamin, Li Kai, and Patterson R. Hugo. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 269282.Google ScholarGoogle Scholar
  41. [41] Zou Xiangyu, Yuan Jingsong, Shilane Philip, Xia Wen, Zhang Haijun, and Wang Xuan. 2021. The dilemma between deduplication and locality: Can both be achieved? In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 171185.Google ScholarGoogle Scholar

Index Terms

  1. InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 19, Issue 1
            February 2023
            259 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/3578369
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 January 2023
            • Online AM: 19 November 2022
            • Accepted: 25 July 2022
            • Revised: 8 July 2022
            • Received: 11 January 2022
            Published in tos Volume 19, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!