skip to main content
research-article

Understanding and Improving Computational Science Storage Access through Continuous Characterization

Published:01 October 2011Publication History
Skip Abstract Section

Abstract

Computational science applications are driving a demand for increasingly powerful storage systems. While many techniques are available for capturing the I/O behavior of individual application trial runs and specific components of the storage system, continuous characterization of a production system remains a daunting challenge for systems with hundreds of thousands of compute cores and multiple petabytes of storage. As a result, these storage systems are often designed without a clear understanding of the diverse computational science workloads they will support.

In this study, we outline a methodology for scalable, continuous, systemwide I/O characterization that combines storage device instrumentation, static file system analysis, and a new mechanism for capturing detailed application-level behavior. This methodology allows us to identify both system-wide trends and application-specific I/O strategies. We demonstrate the effectiveness of our methodology by performing a multilevel, two-month study of Intrepid, a 557-teraflop IBM Blue Gene/P system. During that time, we captured application-level I/O characterizations from 6,481 unique jobs spanning 38 science and engineering projects. We used the results of our study to tune example applications, highlight trends that impact the design of future storage systems, and identify opportunities for improvement in I/O characterization methodology.

References

  1. Agrawal, N., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2008. Towards realistic file-system benchmarks with CodeMRI. SIGMETRICS Perform. Eval. Rev. 36, 2, 52--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anderson, E. 2009. Capture, conversion, and analysis of an intense NFS workload. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). USENIX Association, Berkeley, CA, 139--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Carns, P., Latham, R., Ross, R., Iskra, K., Lang, S., and Riley, K. 2009. 24/7 characterization of petascale I/O workloads. In Proceedings of the Workshop on Interfaces and Architectures for Scientific Data Storage.Google ScholarGoogle Scholar
  4. Darshan. 2010. Darshan. http://www.mcs.anl.gov/research/projects/darshan/.Google ScholarGoogle Scholar
  5. Dayal, S. 2008. Characterizing HEC storage systems at rest. Tech. rep. CMU-PDL-08-109, Parallel Data Lab, Carnegie Mellon University.Google ScholarGoogle Scholar
  6. Ganger, G. R. 1995. Generating representative synthetic workloads: An unsolved problem. In Proceedings of the Computer Measurement Group (CMG) Conference. 1263--1269.Google ScholarGoogle Scholar
  7. Godard, S. 2010. SYSSTATutilities homepage. http://pagesperso-orange.fr/sebastien.godard/.Google ScholarGoogle Scholar
  8. INCITE. 2010. U.S. Department of Energy INCITE program. http://www.er.doe.gov/ascr/incite/.Google ScholarGoogle Scholar
  9. Kim, Y., Gunasekaran, R., Shipman, G., Dillow, D., Zhang, Z., and Settlemyer, B. 2010. Workload characterization of a leadership class storage cluster. In Proceedings of the 5th Petascale Data Storage Workshop (PDSW). 1--5.Google ScholarGoogle Scholar
  10. Klundt, R., Weston, M., and Ward, L. 2008. I/O tracing on Catamount. Tech. rep. SAND2008-3684, Sandia National Laboratory.Google ScholarGoogle Scholar
  11. Konwinski, A., Bent, J., Nunez, J., and Quist, M. 2007. Towards an I/O tracing framework taxonomy. In Proceedings of the 2nd International Workshop on Petascale Data Storage (PDSW’07). ACM, New York, NY, 56--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lang, S., Carns, P., Latham, R., Ross, R., Harms, K., and Allcock, W. 2009. I/O performance challenges at leadership scale. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. LANL-Trace. 2010. HPC-5 open source software projects: LANL-Trace. http://institute.lanl.gov/data/software/#lanl-trace.Google ScholarGoogle Scholar
  14. Leung, A. W., Pasupathy, S., Goodson, G., and Miller, E. L. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the USENIX Technical Conference. USENIX Association, Berkeley, CA, 213--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Liao, W. and Choudhary, A. 2008. Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE Press, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nieuwejaar, N., Kotz, D., Purakayastha, A., Ellis, C. S., and Best, M. 1996. File-access characteristics of parallel scientific workloads. IEEE Trans. Paral. Distrib. Syst. 7, 10, 1075--1089. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Noeth, M., Ratn, P., Mueller, F., Schulz, M., and de Supinski, B. R. 2009. Scalatrace: Scalable compression and replay of communication traces for high-performance computing. J. Paral. Distrib. Comput. 69, 696--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Reed, D. A., Aydt, R. A., Noe, R. J., Roth, P. C., Shields, K. A., Schwartz, B. W., and Tavera, L. F. 1993. Scalable performance analysis: The Pablo performance analysis environment. In Proceedings of the Scalable Parallel Libraries Conference. IEEE Computer Society, 104--113.Google ScholarGoogle Scholar
  19. Roth, P. C. 2007. Characterizing the I/O behavior of scientific applications on the Cray XT. In Proceedings of the 2nd International Workshop on Petascale Data Storage (PDSW’07). ACM, New York, NY, 50--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Schmuck, F. and Haskin, R. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the FAST Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Seelam, S., Chung, I.-H., Hong, D.-Y., Wen, H.-F., and Yu, H. 2008. Early experiences in application level I/O tracing on Blue Gene systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  22. Smirni, E. and Reed, D. 1997. Workload characterization of input/output intensive parallel applications. In Proceedings of the Conference on Modelling Techniques and Tools for Computer Performance Evaluation. Lecture Notes in Computer Science, vol. 1245. Springer-Verlag, 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Traeger, A., Zadok, E., Joukov, N., and Wright, C. P. 2008. A nine year study of file system and storage benchmarking. ACM Trans. Stor. 4, 2, 1--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Uselton, A., Hawison, M., Wright, N., Skinner, D., Shalf, J., Oliker, L., Keen, N., and Karavanic, K. 2010. Parallel I/O performance: From events to ensembles. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  25. Vetter, J. S. and McCracken, M. O. 2001. Statistical scalability analysis of communication operations in distributed applications. SIGPLAN Notices 36, 7, 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vijayakumar, K., Mueller, F., Ma, X., and Roth, P. C. 2009. Scalable I/O tracing and analysis. In Proceedings of the 4th Annual Workshop on Petascale Data Storage (PDSW’09). ACM, New York, NY, 26--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wang, F., Xin, Q., Hong, B., Brandt, S. A., Miller, E. L., Long, D. D. E., and Mclarty, T. T. 2004. File system workload analysis for large scale scientific computing applications. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies. 139--152.Google ScholarGoogle Scholar
  28. Wright, N. J., Pfeiffer, W., and Snavely, A. 2009. Characterizing parallel scaling of scientific applications using IPM. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing.Google ScholarGoogle Scholar
  29. Wu, X., Vijayakumar, K., Mueller, F., Ma, X., and Roth, P. C. 2011. Probabilistic communication and I/O tracing with deterministic replay at scale. In Proceedings of the International Conference on Parallel Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yu, H., Sahoo, R. K., Howson, C., Almasi, G., Castanos, J. G., Gupta, M., Moreira, J. E., Parker, J. J., Engelsiepen, T. E., Ross, R., Thakur, R., Latham, R., and Gropp, W. D. 2006. High performance file I/O for the BlueGene/L supercomputer. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture.Google ScholarGoogle Scholar

Index Terms

  1. Understanding and Improving Computational Science Storage Access through Continuous Characterization

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      Full Access

                      • Published in

                        cover image ACM Transactions on Storage
                        ACM Transactions on Storage  Volume 7, Issue 3
                        October 2011
                        120 pages
                        ISSN:1553-3077
                        EISSN:1553-3093
                        DOI:10.1145/2027066
                        Issue’s Table of Contents

                        Copyright © 2011 ACM

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        • Published: 1 October 2011
                        • Received: 1 August 2011
                        • Accepted: 1 August 2011
                        Published in tos Volume 7, Issue 3

                        Permissions

                        Request permissions about this article.

                        Request Permissions

                        Check for updates

                        Qualifiers

                        • research-article
                        • Research
                        • Refereed

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader
                      About Cookies On This Site

                      We use cookies to ensure that we give you the best experience on our website.

                      Learn more

                      Got it!