Abstract
Restoring data is the main purpose of data backup in storage systems. The fragmentation issue, caused by physically scattering logically continuous data across a variety of disk locations, poses a negative impact on the restoring performance of a deduplication system. Rewriting algorithms are used to alleviate the fragmentation problem by improving the restoring speed of a deduplication system. However, rewriting methods give birth to a big sacrifice in terms of deduplication ratio, leading to a huge storage space waste. Furthermore, traditional backup approaches treat file metadata and chunk metadata as the same, which causes frequent on-disk metadata accesses.
In this article, we start by analyzing storage characteristics of backup metadata. An intriguing finding shows that with 10 million files, the file metadata merely takes up approximately 340 MB. Motivated by this finding, we propose a Classified-Metadata based Restoring method (CMR) that classifies backup metadata into file metadata and chunk metadata. Because the file metadata merely takes up a meager amount of space, CMR maintains all file metadata in memory, whereas chunk metadata are aggressively prefetched to memory in a greedy manner. A deduplication system with CMR in place exhibits three salient features: (i) It avoids rewriting algorithms’ additional overhead by reducing the number of disk reads in a restoring process, (ii) it increases the restoring throughput without sacrificing the deduplication ratio, and (iii) it thoroughly leverages the hardware resources to boost the restoring performance. To quantitatively evaluate the performance of CMR, we compare our CMR against two state-of-the-art approaches, namely, a history-aware rewriting method (HAR) and a context-based rewriting scheme (CAP). The experimental results show that compared to HAR and CAP, CMR reduces the restoring time by 27.2% and 29.3%, respectively. Moreover, the deduplication ratio is improved by 1.91% and 4.36%, respectively.
- Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS’09). IEEE, 1–9.Google Scholar
- Pramod Bhatotia, Rodrigo Rodrigues, and Akshat Verma. 2012. Shredder: GPU-accelerated incremental storage and computation. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’12). Google Scholar
Digital Library
- Zhichao Cao, Shiyong Liu, Fenggang Wu, Guohua Wang, Bingzhe Li, and David H. C. Du. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 129–142. Google Scholar
Digital Library
- Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (ATC’10). Google Scholar
Digital Library
- Yuhui Deng. 2011. What is the future of disk drives, death or rebirth? Comput. Surv. 43, 3 (2011), 1–28. Google Scholar
Digital Library
- Yuhui Deng, Xinyu Huang, Liangshan Song, Yongtao Zhou, and Frank Wang. 2017. Memory deduplication: An effective approach to improve the memory system. J. Inf. Sci. Eng. 33, 5 (2017), 1103–1120.Google Scholar
- Yuhui Deng and Frank Wang. 2008. Exploring the performance impact of stripe size on network attached storage systems. J. Syst. Arch. 54, 8 (2008), 787–796. Google Scholar
Digital Library
- Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes. 2007. Jumbo store: Providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’07). 123–138. Google Scholar
Digital Library
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, 181–192. Google Scholar
Digital Library
- Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (ATC’11). Google Scholar
Digital Library
- James Hamilton. 2009. The cost of latency. Perspectives Blog. https://perspectives.mvdirona.com/2009/10/the-cost-of-latency/.Google Scholar
- Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference. ACM, 15. Google Scholar
Digital Library
- Ron Kohavi, Randal M. Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: Listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 959–967. Google Scholar
Digital Library
- Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. 2010. Bimodal content defined chunking forbackup streams. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’10). 239–252. Google Scholar
Digital Library
- Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 2013 USENIX Conference on USENIX Annual Technical Conference (ATC’13). 183–198. Google Scholar
Digital Library
- Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’09), Vol. 9. 111–123. Google Scholar
Digital Library
- Chuanyi Liu, Yibo Xue, Dapeng Ju, and Dongsheng Wang. 2009. A novel optimization method to improve de-duplication storage system performance. In Proceedings of the 15th International Conference on Parallel and Distributed Systems (ICPADS’09). IEEE, 228–235. Google Scholar
Digital Library
- Dirk Meister, André Brinkmann, and Tim Süß. 2013. File recipe compression in data deduplication systems. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 175–182. Google Scholar
Digital Library
- Hao Meng, Weizhe Zhang, Yiming Wang, Dong Li, Wen Xia, Hao Wang, and Chen Lou. 2019. Multi-parameter performance modeling based on machine learning with basic block features. In Proceedings of the 17th IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE, 316–323.Google Scholar
- Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. Acm Sigops Operating Systems Review 35, 5 (2001), 174–187. Google Scholar
Digital Library
- Young Jin Nam, Dongchul Park, and David H. C. Du. 2012. Assuring demanded read performance of data deduplication storage with backup datasets. In Proceedings of the 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 201–208. Google Scholar
Digital Library
- Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival storage. In Proceedings of the USENIX Conference on File and Storage Technologies (ATC). 89–101. Google Scholar
Digital Library
- R. Schulman. 2004. Disaster recovery issues and solutions. Hitachi data systems white paper. Hitachi Data Systems White Paper (2004).Google Scholar
- Kiran Srinivasan, Timothy Bisson, Garth R. Goodson, and Kaladhar Voruganti. 2012. iDedup: latency-aware, inline data deduplication for primary storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), Vol. 12. 1–14. Google Scholar
Digital Library
- PRESTON W. C. 2010. Restoring deduped data in deduplication systems. http://searchdatabackup.techtarget.com/feature/Restoringdeduped-data-in-deduplication-systems.Google Scholar
- Zhuowei Wang, Lianglun Cheng, Hao Wang, Wuqing Zhao, and Xiaoyu Song. 2019. Energy optimization by software prefetching for task granularity in GPU-based embedded systems. IEEE Transactions on Industrial Electronics 67, 6 (2019), 5120–5131.Google Scholar
- Chunlin Wu, Xingqin Lin, Daren Yu, Wei Xu, and Luoqing Li. 2014. End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Transaction on Cloud Computing 3, 2 (2014), 169–181.Google Scholar
- Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (ATC). 26–28. Google Scholar
Digital Library
- Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Zhongtao Wang. 2012. P-dedupe: Exploiting parallelism in data deduplication system. In Proceedings of the 7th International Conference on Networking, Architecture and Storage (NAS). IEEE, 338–347. Google Scholar
Digital Library
- Yongtao Zhou, Yuhui Deng, Junjie Xie, and Laurence Yang. 2018. EPAS: A sampling based similarity identification algorithm for the cloud. IEEE Transaction on Cloud Computing, 6 (2018), 720–733.Google Scholar
- Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). Google Scholar
Digital Library
Index Terms
Boosting the Restoring Performance of Deduplication Data by Classifying Backup Metadata
Recommendations
Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling
Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are ...
Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets
MASCOTS '12: Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication SystemsData deduplication has been widely adopted in contemporary backup storage systems. It not only saves storage space considerably, but also shortens the data backup time significantly. Since the major goal of the original data deduplication lies in saving ...
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud
Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, ...






Comments