skip to main content
research-article
Open Access

Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata

Authors Info & Claims
Published:21 April 2021Publication History
Skip Abstract Section

Abstract

Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a deduplication system. Rewriting algorithms are used to alleviate the fragmentation problem by improving the restoring speed of a deduplication system. However, rewriting methods give birth to a big sacrifice in terms of deduplication ratio, leading to a huge storage space waste. Furthermore, traditional backup approaches treat file metadata and chunk metadata as the same, which causes frequent on-disk metadata accesses.

In this article, we start by analyzing storage characteristics of backup metadata. An intriguing finding shows that with 10 million files, the file metadata merely takes up approximately 340 MB. Motivated by this finding, we propose a Classified-Metadata based Restoring method (CMR) that classifies backup metadata into file metadata and chunk metadata. Because the file metadata merely takes up a meager amount of space, CMR maintains all file metadata in memory, whereas chunk metadata are aggressively prefetched to memory in a greedy manner. A deduplication system with CMR in place exhibits three salient features: (i) It avoids rewriting algorithms’ additional overhead by reducing the number of disk reads in a restoring process, (ii) it increases the restoring throughput without sacrificing the deduplication ratio, and (iii) it thoroughly leverages the hardware resources to boost the restoring performance. To quantitatively evaluate the performance of CMR, we compare our CMR against two state-of-the-art approaches, namely, a history-aware rewriting method (HAR) and a context-based rewriting scheme (CAP). The experimental results show that compared to HAR and CAP, CMR reduces the restoring time by 27.2% and 29.3%, respectively. Moreover, the deduplication ratio is improved by 1.91% and 4.36%, respectively.

References

  1. Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS’09). IEEE, 1–9.Google ScholarGoogle Scholar
  2. Pramod Bhatotia, Rodrigo Rodrigues, and Akshat Verma. 2012. Shredder: GPU-accelerated incremental storage and computation. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Zhichao Cao, Shiyong Liu, Fenggang Wu, Guohua Wang, Bingzhe Li, and David H. C. Du. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 129–142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (ATC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yuhui Deng. 2011. What is the future of disk drives, death or rebirth? Comput. Surv. 43, 3 (2011), 1–28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yuhui Deng, Xinyu Huang, Liangshan Song, Yongtao Zhou, and Frank Wang. 2017. Memory deduplication: An effective approach to improve the memory system. J. Inf. Sci. Eng. 33, 5 (2017), 1103–1120.Google ScholarGoogle Scholar
  7. Yuhui Deng and Frank Wang. 2008. Exploring the performance impact of stripe size on network attached storage systems. J. Syst. Arch. 54, 8 (2008), 787–796. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes. 2007. Jumbo store: Providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’07). 123–138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, 181–192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (ATC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. James Hamilton. 2009. The cost of latency. Perspectives Blog. https://perspectives.mvdirona.com/2009/10/the-cost-of-latency/.Google ScholarGoogle Scholar
  12. Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ron Kohavi, Randal M. Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: Listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 959–967. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. 2010. Bimodal content defined chunking forbackup streams. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’10). 239–252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 2013 USENIX Conference on USENIX Annual Technical Conference (ATC’13). 183–198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’09), Vol. 9. 111–123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chuanyi Liu, Yibo Xue, Dapeng Ju, and Dongsheng Wang. 2009. A novel optimization method to improve de-duplication storage system performance. In Proceedings of the 15th International Conference on Parallel and Distributed Systems (ICPADS’09). IEEE, 228–235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dirk Meister, André Brinkmann, and Tim Süß. 2013. File recipe compression in data deduplication systems. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 175–182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hao Meng, Weizhe Zhang, Yiming Wang, Dong Li, Wen Xia, Hao Wang, and Chen Lou. 2019. Multi-parameter performance modeling based on machine learning with basic block features. In Proceedings of the 17th IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE, 316–323.Google ScholarGoogle Scholar
  20. Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. Acm Sigops Operating Systems Review 35, 5 (2001), 174–187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Young Jin Nam, Dongchul Park, and David H. C. Du. 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 201–208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival storage. In Proceedings of the USENIX Conference on File and Storage Technologies (ATC). 89–101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Schulman. 2004. Disaster recovery issues and solutions. Hitachi data systems white paper. Hitachi Data Systems White Paper (2004).Google ScholarGoogle Scholar
  24. Kiran Srinivasan, Timothy Bisson, Garth R. Goodson, and Kaladhar Voruganti. 2012. iDedup: latency-aware, inline data deduplication for primary storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), Vol. 12. 1–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. PRESTON W. C. 2010. Restoring deduped data in deduplication systems. http://searchdatabackup.techtarget.com/feature/Restoringdeduped-data-in-deduplication-systems.Google ScholarGoogle Scholar
  26. Zhuowei Wang, Lianglun Cheng, Hao Wang, Wuqing Zhao, and Xiaoyu Song. 2019. Energy optimization by software prefetching for task granularity in GPU-based embedded systems. IEEE Transactions on Industrial Electronics 67, 6 (2019), 5120–5131.Google ScholarGoogle Scholar
  27. Chunlin Wu, Xingqin Lin, Daren Yu, Wei Xu, and Luoqing Li. 2014. End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Transaction on Cloud Computing 3, 2 (2014), 169–181.Google ScholarGoogle Scholar
  28. Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (ATC). 26–28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Zhongtao Wang. 2012. P-dedupe: Exploiting parallelism in data deduplication system. In Proceedings of the 7th International Conference on Networking, Architecture and Storage (NAS). IEEE, 338–347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yongtao Zhou, Yuhui Deng, Junjie Xie, and Laurence Yang. 2018. EPAS: A sampling based similarity identification algorithm for the cloud. IEEE Transaction on Cloud Computing, 6 (2018), 720–733.Google ScholarGoogle Scholar
  31. Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM/IMS Transactions on Data Science
            ACM/IMS Transactions on Data Science  Volume 2, Issue 2
            May 2021
            149 pages
            ISSN:2691-1922
            DOI:10.1145/3454114
            Issue’s Table of Contents

            Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 21 April 2021
            • Accepted: 1 November 2020
            • Revised: 1 September 2020
            • Received: 1 April 2020
            Published in tds Volume 2, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!