Abstract
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. We found that whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and 87% of the savings for backup images. We also studied file fragmentation, finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files.
- Agrawal, N., Bolosky, W., Douceur, J., and Lorch, J. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- BackupRead. 2010. Microsoft Corp. BackupRead function. MSDN. http://msdn.microsoft.com/en-us/library/aa362509(VS.85).aspxGoogle Scholar
- Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., and Hristidis, V. 2009. Borg: Block-reorganization for self-optimizing storage systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Bhagwat, D., Eshghi, K., Long, D., and Lillibridge, M. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup, In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA.Google Scholar
- Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426. Google Scholar
Digital Library
- Bolosky, W., Corbin, S., Goebel, D., and Douceur, J. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium. Google Scholar
Digital Library
- Clements, A., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. InProceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technology. Google Scholar
Digital Library
- Dorward, S. and Quinlan, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Douceur, J. and Bolosky, W. 1999. A large-scale study of file-system contents. In Proceeedings of the ACM SIGMETRICS International Conference on Measurement and Modelling of Computer Systems. ACM, New York. Google Scholar
Digital Library
- Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., and Welnicki, M. 2009. Hydrastor: A scalable secondary storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Huang, H., Hung, W., and Shin, K. G. 2005. Fs2: Dynamic data replication in free disk space for improving disk performance and energy consumption. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, New York. Google Scholar
Digital Library
- Kulkarni, P., Douglis, F., Lavoie, J., and Tracey, J. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Jin, K. and Miller, E. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR: The Israeli Experimental Systems Conference. Google Scholar
Digital Library
- Lillibridge, M., Eshghi, K., Bhagwat, D., Deola-Likar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux SymposiumGoogle Scholar
- MS Atime. 2010. Microsoft Corp. Disabling last access time in Windows Vista to improve NTFS perfomance. The Storage Team Blog. http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-time-in-windows-vista-to-improve-ntfs-performance.aspx.Google Scholar
- MS Filesystem. 2010. Microsoft Corp. File systems. Microsoft TechNet. http://technet.microsoft.com/en-us/library/cc938929.aspx.Google Scholar
- VSS. 2010. Microsoft Corp.Volume shadow copy service. MSDN. http://msdn.microsoft.com/en-us/library/bb968832(VS.85).aspx.Google Scholar
- Miller, D. R. 2009. Storage economics: Four principles for reducing total cost of ownership. Hitachi Corporate Web Site. http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf.Google Scholar
- Murphy, N. and Seltzer, M. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems. Google Scholar
Digital Library
- Nagar, R. 1997. Windows NT File System Internals. O'Reilly. Google Scholar
Digital Library
- Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems. In Proceedings of the. USENIX Annual Technical Conference. Google Scholar
Digital Library
- Rabin, M. 1981. Fingerprinting by random polynomials. Tech. rep. TR-CSE-03-01. Harvard University Center for Research in Computing Technology.Google Scholar
- Rivest, R. 1992. The MD5 message-digest algorithm. http://tools.ietf.org/rfc/rfc1321.txt. Google Scholar
Digital Library
- Satyanarayanan, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Scheduled Tasks. 2010. Microsoft Corp. description of the scheduled tasks in Widows Vista. Microsoft support. http://support.microsoft.com/kb/939039.Google Scholar
- Seltzer, M. and Smith, K. 1997. File system aging: Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS, ACM, New York. Google Scholar
Digital Library
- Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. 1996. Scalability in the XFS file system. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Vogels, W. 1999. File system usage in windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York. Google Scholar
Digital Library
- Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Cakowski, G., Dubnicki, C., and Bohra, A. 2010. Hydrafs: A high-throughput file system for the Hydrastor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Ungureanu, E. and Kruus, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Zhu, B., Li, K., and Patterson, H. 2008 Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, 1--14. Google Scholar
Digital Library
Index Terms
A study of practical deduplication
Recommendations
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Survey on Deduplication Techniques in Flash-Based Storage
FRUCT'22: Proceedings of the 22st Conference of Open Innovations Association FRUCTData deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was in?uenced by appearance of Solid State Drive. This new type of disk has signi?cant differences from ...
Coupling Right-Provisioned Cold Storage Data Centers with Deduplication
ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingModern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce ...






Comments