skip to main content
research-article
Public Access

TDDFS: A Tier-Aware Data Deduplication-Based File System

Authors Info & Claims
Published:05 February 2019Publication History
Skip Abstract Section

Abstract

With the rapid increase in the amount of data produced and the development of new types of storage devices, storage tiering continues to be a popular way to achieve a good tradeoff between performance and cost-effectiveness. In a basic two-tier storage system, a storage tier with higher performance and typically higher cost (the fast tier) is used to store frequently-accessed (active) data while a large amount of less-active data are stored in the lower-performance and low-cost tier (the slow tier). Data are migrated between these two tiers according to their activity. In this article, we propose a Tier-aware Data Deduplication-based File System, called TDDFS, which can operate efficiently on top of a two-tier storage environment.

Specifically, to achieve better performance, nearly all file operations are performed in the fast tier. To achieve higher cost-effectiveness, files are migrated from the fast tier to the slow tier if they are no longer active, and this migration is done with data deduplication. The distinctiveness of our design is that it maintains the non-redundant (unique) chunks produced by data deduplication in both tiers if possible. When a file is reloaded (called a reloaded file) from the slow tier to the fast tier, if some data chunks of the file already exist in the fast tier, then the data migration of these chunks from the slow tier can be avoided. Our evaluation shows that TDDFS achieves close to the best overall performance among various file-tiering designs for two-tier storage systems.

References

  1. David Reinsel, John Gantz, and John Rydning. 2018. The digitization of the world from edge to core. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.Google ScholarGoogle Scholar
  2. Y. Liu, M. Chen, and S. Mao. 2014. Big data: A survey. Mobile Netw. Appl. 19, 2 (2014), 171--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Frank B. Schmuck and Roger L. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02) 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Bonwick. ZFS deduplication. 2009. https://blogs.oracle.com/bonwick/entry/zfs_dedup.Google ScholarGoogle Scholar
  5. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. USENIX Association, 307--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrew W. Leung, Shankar Pasupathy, Garth R. Goodson, and Ethan L. Miller. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the 2008 USENIX Annual Technical Conference (USENIX ATC’08). 213--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kitchin R. 2014. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Sage.Google ScholarGoogle Scholar
  9. Hui Wang and Peter Varman. 2014. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 229--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tivoli storage productivity center v5.2 documentation. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSNE44_5.2.0 /com.ibm.tpc_V52.doc/tpc_kc_homepage.html.Google ScholarGoogle Scholar
  11. Dell storage sc8000. Retrieved from https://www.dell.com/en-us/work/shop/cty/pdp/spd/dell-compellent-sc8000.Google ScholarGoogle Scholar
  12. Dell EMC glossary. Retrieved from http://www.emc.com/corporate/glossary/fully-automated-storage-tiering.htm.Google ScholarGoogle Scholar
  13. Automated storage tiering and the NetApp virtual storage tier. Retrieved from https://community.netapp.com/t5/Tech-OnTap-Articles/Automated-Storage-Tiering-and-the-NetApp-Virtual-Storage-Tier/ta-p/84825.Google ScholarGoogle Scholar
  14. Samuel Burk Siewert, Nicholas Martin Nielsen, Phillip Clark, and Lars E. Boehnke. 2010. Systems and methods for block-level management of tiered storage. (August 5 2010). US Patent App. 12/364,271.Google ScholarGoogle Scholar
  15. Anant Baderdinni. 2013. Relative heat index based hot data determination for block based storage tiering. (July 2 2013). US Patent 8,478,939.Google ScholarGoogle Scholar
  16. Sonam Mandal, Geoff Kuenning, Dongju Ok, Varun Shastry, Philip Shilane, Sun Zhen, Vasily Tarasov, and Erez Zadok. 2016. Using hints to improve inline block-layer deduplication. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 315--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sangwook Kim, Hwanju Kim, Sang-Hoon Kim, Joonwon Lee, and Jinkyu Jeong. 2015. Request-oriented durable write caching for application performance. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC’15). 193--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Overview of GPFS. Retrieved from https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.do c/bl1ins_intro.htm.Google ScholarGoogle Scholar
  19. ReFS. Retrieved from https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview.Google ScholarGoogle Scholar
  20. Oracle hierarchical storage manager. Retrieved from https://www.oracle.com/storage/tape-storage/hierarchical-storage-manager/.Google ScholarGoogle Scholar
  21. Drew Roselli and Thomas E. Anderson. 1998. Characteristics of File System Workloads. University of California, Berkeley, Computer Science Division.Google ScholarGoogle Scholar
  22. Drew S. Roselli, Jacob R. Lorch, Thomas E. Anderson, et al. 2000. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference (USENIX ATC’00). 41--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dutch T. Meyer and William J. Bolosky. 2012. A study of practical deduplication. ACM Trans. Stor. 7, 4 (2012), 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yinjin Fu, Hong Jian, Nong Xiao, Lei Tian, and Fang Liu. 2011. AA-Dedupe: An application-aware source deduplication approach for cloud backup services in the personal computing environment. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 112--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication-from theory to practice. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 277--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jaehong Min, Daeyoung Yoon, and Youjip Won. 2011. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6 (2011), 824--840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wenji Li, Gregory Jean-Baptise, Juan Riveros, Giri Narasimhan, Tony Zhang, and Ming Zhao. 2016. CacheDedup: In-line deduplication for flash caching. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 301--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). 101--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. João Paulo and José Pereira. 2014. A survey and classification of storage deduplication systems. ACM Computing Surveys (CSUR) 47, 1 (2014), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jyoti Malhotra and Jagdish Bakal. 2015. A survey and comparative study of data deduplication techniques. In Proceedings of the International Conference on Pervasive Computing (ICPC’15). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  31. Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, and Yukun Zhou. 2016. A comprehensive study of the past, present, and future of data deduplication. Proc. IEEE 104, 9 (2016), 1681--1710.Google ScholarGoogle ScholarCross RefCross Ref
  32. Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 331--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Guanlin Lu, Young Jin Nam, and David H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of the IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). IEEE, 1--11.Google ScholarGoogle Scholar
  36. Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC’10). 215--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. RocksDB. Retrieved from https://github.com/facebook/rocksdb.Google ScholarGoogle Scholar
  38. FUSE. Retrieved from http://fuse.sourceforge.net/.Google ScholarGoogle Scholar
  39. Benjamin Zhu, Kai Li, and R. Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhichao Cao, Hao Wen, Fenggang Wu, and David H. C. Du. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 309--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Zhichao Cao, Shiyong Liu, Fenggang Wu, Guohua Wang, Bingzhe Li, and David H. C. Du. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19).Google ScholarGoogle Scholar
  42. Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’08), Vol. 57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, and Yarik Markov. 2017. The first collision for full SHA-1. In Proceedings of the Annual International Cryptology Conference. Springer, 570--596.Google ScholarGoogle ScholarCross RefCross Ref
  44. Dell SC series. Retrieved from http://en.community.dell.com/techcenter/extras/m/white_papers/20442763/download.Google ScholarGoogle Scholar
  45. Filebench. Retrieved from https://github.com/filebench/filebench/wiki.Google ScholarGoogle Scholar
  46. Intel. 750 series 400GB SSD. Retrieved from https://www.amazon.com/intel-single-400gb-solid-ssdpe2mw400g4x1/dp/b011i61l70.Google ScholarGoogle Scholar
  47. Seagate. 6T enterprise HDD (ST6000NM0024). Retrieved from https://www.amazon.com/seagate-barracuda-3-5-inch-internal-st6000dm004/dp/b01loojbh8.Google ScholarGoogle Scholar
  48. Kiran Srinivasan, Timothy Bisson, Garth R. Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 299--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Vasily Tarasov, Deepak Jain, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok. 2014. Dmdedup: Device mapper target for data deduplication. In Proceedings of the 2014 Ottawa Linux Symposium.Google ScholarGoogle Scholar
  50. Yoshihiro Tsuchiya and Takashi Watanabe. 2011. DBLK: Deduplication for primary block storage. In Proceedings of the IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST’11). IEEE, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Aaron Brown and Kristopher Kosmatka. 2010. Block-level inline data deduplication in ext3. University of Wisconsin-Madison Department of Computer Sciences.Google ScholarGoogle Scholar
  52. Cheng Li, Philip Shilane, Fred Douglis, Hyong Shim, Stephen Smaldone, and Grant Wallace. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’14). 501--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lessfs. 2012. Retrieved from https://fedoraproject.org/wiki/Features/LessFS.Google ScholarGoogle Scholar
  54. Opendedup--SDFS. 2012. Retrieved from http://www.opendedup.org.Google ScholarGoogle Scholar
  55. Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Ottean, Jin Li, and Sudipta Sengupta. 2012. Primary data deduplication—large scale study and system design. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC’12). 285--296. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TDDFS: A Tier-Aware Data Deduplication-Based File System

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 15, Issue 1
            Special Issue on ACM International Systems and Storage Conference (SYSTOR) 2018
            February 2019
            194 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/3311821
            • Editor:
            • Sam H. Noh
            Issue’s Table of Contents

            Copyright © 2019 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 5 February 2019
            • Accepted: 1 November 2018
            • Revised: 1 September 2018
            • Received: 1 December 2017
            Published in tos Volume 15, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!