Abstract
With the rapid increase in the amount of data produced and the development of new types of storage devices, storage tiering continues to be a popular way to achieve a good tradeoff between performance and cost-effectiveness. In a basic two-tier storage system, a storage tier with higher performance and typically higher cost (the fast tier) is used to store frequently-accessed (active) data while a large amount of less-active data are stored in the lower-performance and low-cost tier (the slow tier). Data are migrated between these two tiers according to their activity. In this article, we propose a Tier-aware Data Deduplication-based File System, called TDDFS, which can operate efficiently on top of a two-tier storage environment.
Specifically, to achieve better performance, nearly all file operations are performed in the fast tier. To achieve higher cost-effectiveness, files are migrated from the fast tier to the slow tier if they are no longer active, and this migration is done with data deduplication. The distinctiveness of our design is that it maintains the non-redundant (unique) chunks produced by data deduplication in both tiers if possible. When a file is reloaded (called a reloaded file) from the slow tier to the fast tier, if some data chunks of the file already exist in the fast tier, then the data migration of these chunks from the slow tier can be avoided. Our evaluation shows that TDDFS achieves close to the best overall performance among various file-tiering designs for two-tier storage systems.
- David Reinsel, John Gantz, and John Rydning. 2018. The digitization of the world from edge to core. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.Google Scholar
- Y. Liu, M. Chen, and S. Mao. 2014. Big data: A survey. Mobile Netw. Appl. 19, 2 (2014), 171--209. Google Scholar
Digital Library
- Frank B. Schmuck and Roger L. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02) 2. Google Scholar
Digital Library
- J. Bonwick. ZFS deduplication. 2009. https://blogs.oracle.com/bonwick/entry/zfs_dedup.Google Scholar
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1--10. Google Scholar
Digital Library
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. USENIX Association, 307--320. Google Scholar
Digital Library
- Andrew W. Leung, Shankar Pasupathy, Garth R. Goodson, and Ethan L. Miller. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the 2008 USENIX Annual Technical Conference (USENIX ATC’08). 213--226. Google Scholar
Digital Library
- Kitchin R. 2014. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Sage.Google Scholar
- Hui Wang and Peter Varman. 2014. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 229--242. Google Scholar
Digital Library
- Tivoli storage productivity center v5.2 documentation. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSNE44_5.2.0 /com.ibm.tpc_V52.doc/tpc_kc_homepage.html.Google Scholar
- Dell storage sc8000. Retrieved from https://www.dell.com/en-us/work/shop/cty/pdp/spd/dell-compellent-sc8000.Google Scholar
- Dell EMC glossary. Retrieved from http://www.emc.com/corporate/glossary/fully-automated-storage-tiering.htm.Google Scholar
- Automated storage tiering and the NetApp virtual storage tier. Retrieved from https://community.netapp.com/t5/Tech-OnTap-Articles/Automated-Storage-Tiering-and-the-NetApp-Virtual-Storage-Tier/ta-p/84825.Google Scholar
- Samuel Burk Siewert, Nicholas Martin Nielsen, Phillip Clark, and Lars E. Boehnke. 2010. Systems and methods for block-level management of tiered storage. (August 5 2010). US Patent App. 12/364,271.Google Scholar
- Anant Baderdinni. 2013. Relative heat index based hot data determination for block based storage tiering. (July 2 2013). US Patent 8,478,939.Google Scholar
- Sonam Mandal, Geoff Kuenning, Dongju Ok, Varun Shastry, Philip Shilane, Sun Zhen, Vasily Tarasov, and Erez Zadok. 2016. Using hints to improve inline block-layer deduplication. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 315--322. Google Scholar
Digital Library
- Sangwook Kim, Hwanju Kim, Sang-Hoon Kim, Joonwon Lee, and Jinkyu Jeong. 2015. Request-oriented durable write caching for application performance. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC’15). 193--206. Google Scholar
Digital Library
- Overview of GPFS. Retrieved from https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.do c/bl1ins_intro.htm.Google Scholar
- ReFS. Retrieved from https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview.Google Scholar
- Oracle hierarchical storage manager. Retrieved from https://www.oracle.com/storage/tape-storage/hierarchical-storage-manager/.Google Scholar
- Drew Roselli and Thomas E. Anderson. 1998. Characteristics of File System Workloads. University of California, Berkeley, Computer Science Division.Google Scholar
- Drew S. Roselli, Jacob R. Lorch, Thomas E. Anderson, et al. 2000. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference (USENIX ATC’00). 41--54. Google Scholar
Digital Library
- Dutch T. Meyer and William J. Bolosky. 2012. A study of practical deduplication. ACM Trans. Stor. 7, 4 (2012), 14. Google Scholar
Digital Library
- Yinjin Fu, Hong Jian, Nong Xiao, Lei Tian, and Fang Liu. 2011. AA-Dedupe: An application-aware source deduplication approach for cloud backup services in the personal computing environment. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 112--120. Google Scholar
Digital Library
- Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication-from theory to practice. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 277--290. Google Scholar
Digital Library
- Jaehong Min, Daeyoung Yoon, and Youjip Won. 2011. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6 (2011), 824--840. Google Scholar
Digital Library
- Wenji Li, Gregory Jean-Baptise, Juan Riveros, Giri Narasimhan, Tony Zhang, and Ming Zhao. 2016. CacheDedup: In-line deduplication for flash caching. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 301--314. Google Scholar
Digital Library
- Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). 101--114. Google Scholar
Digital Library
- João Paulo and José Pereira. 2014. A survey and classification of storage deduplication systems. ACM Computing Surveys (CSUR) 47, 1 (2014), 11. Google Scholar
Digital Library
- Jyoti Malhotra and Jagdish Bakal. 2015. A survey and comparative study of data deduplication techniques. In Proceedings of the International Conference on Pervasive Computing (ICPC’15). IEEE, 1--5.Google Scholar
Cross Ref
- Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, and Yukun Zhou. 2016. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE 104, 9 (2016), 1681--1710.Google Scholar
Cross Ref
- Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--197. Google Scholar
Digital Library
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 181--192. Google Scholar
Digital Library
- Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 331--344. Google Scholar
Digital Library
- Guanlin Lu, Young Jin Nam, and David H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of the IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). IEEE, 1--11.Google Scholar
- Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC’10). 215--230. Google Scholar
Digital Library
- RocksDB. Retrieved from https://github.com/facebook/rocksdb.Google Scholar
- FUSE. Retrieved from http://fuse.sourceforge.net/.Google Scholar
- Benjamin Zhu, Kai Li, and R. Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--14. Google Scholar
Digital Library
- Zhichao Cao, Hao Wen, Fenggang Wu, and David H. C. Du. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 309--324. Google Scholar
Digital Library
- Zhichao Cao, Shiyong Liu, Fenggang Wu, Guohua Wang, Bingzhe Li, and David H. C. Du. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19).Google Scholar
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’08), Vol. 57. Google Scholar
Digital Library
- Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, and Yarik Markov. 2017. The first collision for full SHA-1. In Proceedings of the Annual International Cryptology Conference. Springer, 570--596.Google Scholar
Cross Ref
- Dell SC series. Retrieved from http://en.community.dell.com/techcenter/extras/m/white_papers/20442763/download.Google Scholar
- Filebench. Retrieved from https://github.com/filebench/filebench/wiki.Google Scholar
- Intel. 750 series 400GB SSD. Retrieved from https://www.amazon.com/intel-single-400gb-solid-ssdpe2mw400g4x1/dp/b011i61l70.Google Scholar
- Seagate. 6T enterprise HDD (ST6000NM0024). Retrieved from https://www.amazon.com/seagate-barracuda-3-5-inch-internal-st6000dm004/dp/b01loojbh8.Google Scholar
- Kiran Srinivasan, Timothy Bisson, Garth R. Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 299--312. Google Scholar
Digital Library
- Vasily Tarasov, Deepak Jain, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok. 2014. Dmdedup: Device mapper target for data deduplication. In Proceedings of the 2014 Ottawa Linux Symposium.Google Scholar
- Yoshihiro Tsuchiya and Takashi Watanabe. 2011. DBLK: Deduplication for primary block storage. In Proceedings of the IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST’11). IEEE, 1--5. Google Scholar
Digital Library
- Aaron Brown and Kristopher Kosmatka. 2010. Block-level inline data deduplication in ext3. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
- Cheng Li, Philip Shilane, Fred Douglis, Hyong Shim, Stephen Smaldone, and Grant Wallace. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 501--512. Google Scholar
Digital Library
- Lessfs. 2012. Retrieved from https://fedoraproject.org/wiki/Features/LessFS.Google Scholar
- Opendedup--SDFS. 2012. Retrieved from http://www.opendedup.org.Google Scholar
- Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. 2012. Primary data deduplication—large scale study and system design. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC’12). 285--296. Google Scholar
Digital Library
Index Terms
TDDFS: A Tier-Aware Data Deduplication-Based File System
Recommendations
Experiences with Hierarchical Storage Management Support in Blue Whale File System
PDCAT '10: Proceedings of the 2010 International Conference on Parallel and Distributed Computing, Applications and TechnologiesIn order to meet the challenges of significant storage and application growth, as well as shortened backup windows and limited IT resources, more and more organizations embrace Hierarchical Storage Management (HSM). Parts of SAN file systems provide the ...
TPFS: A High-Performance Tiered File System for Persistent Memories and Disks
Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM ...
Automated and Intelligent Data Migration Strategy in High Energy Physical Storage Systems
Big Scientific Data ManagementAbstractAs a data-intensive computing application, high-energy physics requires to process and store massive data at the PB or EB level. It requires high performance data access and large volume of data storage as well. Some enterprises and research ...






Comments