Abstract
For five years, we collected annual snapshots of file-system metadata from over 60,000 Windows PC file systems in a large corporation. In this article, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory size, namespace structure, file-system population, storage capacity and consumption, and degree of file modification. We present a generative model that explains the namespace structure and the distribution of directory sizes. We find significant temporal trends relating to the popularity of certain file types, the origin of file content, the way the namespace is used, and the degree of variation among file systems, as well as more pedestrian changes in size and capacities. We give examples of consequent lessons for designers of file systems and related software.
- Adya, A., Bolosky, W., Castro, M., Cermak, G., Chaiken, R., Douceur, J., Howell, J., Lorch, J., Theimer, M., and Wattenhofer, R.P. 2002. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, 1--14. Google Scholar
Digital Library
- Agrawal, N.A., Bolosky, W.J., Douceur, J.R., and Lorch, J.R. 2007. A five-year study of file system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), San Jose, CA, 31--45. Google Scholar
Digital Library
- Arpaci-Dusseau, A.C. and Arpaci-Dusseau, R.H. 2001. Information and control in gray-box systems. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP), Banff, Canada, 43--56. Google Scholar
Digital Library
- Barford, P. and Crovella, M. 1998. Generating representative web workloads for network and server performance evaluation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Madison, WI, 151--160. Google Scholar
Digital Library
- Bennett, J.M., Bauer, M.A., and Kinchlea, D. 1991. Characteristics of files in NFS environments. In Proceedings of the ACM SIGSMALL/PC Symposium on Small Systems, Toronto, Ontario, Candada, 33--40. Google Scholar
Digital Library
- Bolosky, W.J., Corbin, S., Goebel, D., and Douceur, J.R. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, WA. Google Scholar
Digital Library
- Bonwick, J. 2006. ZFS: The last word in file systems. http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf.Google Scholar
- Chapman, G. 2002. Why does Explorer think I only want to see my documents? http://pubs.logicalexpressions.com/Pub0009/LPMArticle.asp?ID=189.Google Scholar
- Cox, L.P., Murray, C.D., and Noble, B.D. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, 285--298. Google Scholar
Digital Library
- Douceur, J.R. and Bolosky, W.J. 1999. A large-scale study of file system contents. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Atlanta, GA, 59--70. Google Scholar
Digital Library
- Downey, A.B. 2001. The structural cause of file size distributions. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Cambridge, MA, 328--329. Google Scholar
Digital Library
- Evans, K.M. and Kuenning, G.H. 2002. A study of irregularities in file-size distributions. In Proceedings of the International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), San Diego, CA.Google Scholar
- Freund, J.E. 1992. Mathematical Statistics, 5th ed. Prentice Hall. Google Scholar
Digital Library
- Gribble, S.D., Manku, G.S., Roselli, D.S., Brewer, E.A., Gibson, T.J., and Miller, E.L. 1998. Self-Similarity in file systems. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Madison, WI, 141--150. Google Scholar
Digital Library
- Gunawi, H.S., Agrawal, N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., and Schindler, J. 2005. Deconstructing commodity storage clusters. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), Madison, WI, 60--71. Google Scholar
Digital Library
- Irlam, G. 1993. Unix file size survey -- 1993. http://www.base.com/gordoni/ufs93.html.Google Scholar
- Knuth, D.E. 1981. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 2nd ed. Addison-Wesley. Google Scholar
Digital Library
- Mahmoud, H.M. 1992. Distances in random plane-oriented recursive trees. J. Comput. Appl. Math. 41, 237--245. Google Scholar
Digital Library
- Mesnier, M., Thereska, E., Ganger, G.R., Ellard, D., and Seltzer, M. 2004. File classification in self-* storage systems. In Proceedings of the 1st International Conference on Autonomic Computing (ICAC), New York. Google Scholar
Digital Library
- Microsoft. 2006. SetFileTime. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wcecoreos5/html/wce50lrfsetfiletime.asp.Google Scholar
- Mitchell, S. 1997. Inside the Windows 95 file system. O'Reilly, Sebastopol, CA. Google Scholar
Digital Library
- Mitzenmacher, M. 2004. Dynamic models for file sizes and double Pareto distributions. Internet Math. 1, 3, 305--333.Google Scholar
Cross Ref
- Mullender, S.J. and Tanenbaum, A.S. 1984. Immediate files. Softw. Pract. Exper. 14, 4 (Apr.), 365--368. Google Scholar
Digital Library
- Ousterhout, J.K., Costa, H.D., Harrison, D., Kunze, J.A., Kupfer, M., and Thompson, J.G. 1985. A trace-driven analysis of the UNIX 4.2 BSD file system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (SOSP), Orcas Island, WA, 15--24. Google Scholar
Digital Library
- Reiser, H. 2006. Three reasons why ReiserFS is great for you. http://www.namesys.com/.Google Scholar
- Roselli, D., Lorch, J.R., and Anderson, T.E. 2000. A comparison of file system workloads. In Proceedings of the USENIX Annual Technical Conference, San Diego, CA, 41--54. Google Scholar
Digital Library
- Satyanarayanan, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (SOSP), Pacific Grove, CA, 96--108. Google Scholar
Digital Library
- Sienknecht, T.F., Friedrich, R.J., Martinka, J.J., and Friedenbach, P.M. 1994. The implications of distributed data in a commercial environment on the design of hierarchical storage management. In Proceedings of the 16th IFIP Working Group 7.3 International Symposium on Computer Performance Modeling and Evaluation. 3--25. Google Scholar
Digital Library
- Smith, K. and Seltzer, M. 1994. File layout and file system performance. Tech. Rep. TR-35-94, Harvard University.Google Scholar
- Solomon, D.A. 1998. Inside Windows NT, 2nd ed. Microsoft Press, Redmond, WA. Google Scholar
Digital Library
- Vogels, W. 1999. File system usage in Windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), Kiawah Island, SC, 93--109. Google Scholar
Digital Library
Index Terms
A five-year study of file-system metadata
Recommendations
A five-year study of file-system metadata
FAST '07: Proceedings of the 5th USENIX conference on File and Storage TechnologiesFor five years, we collected annual snapshots of filesystem metadata from over 60,000 Windows PC file systems in a large corporation. In this paper, we use these snapshots to study temporal changes in file size, file age, file-type frequency, directory ...
A multiple-file write scheme for improving write performance of small files in Fast File System
Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
Implementation of a stackable file system for real-time network backup
We propose a backup system based on a stackable mirroring file system, general-purpose mirroring file system (GMFS). This file system mirrors data in real-time on the file system layer. It uses the typical network file system (NFS) and backs up data to ...






Comments