Abstract
Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for communication modeling due to its lossless yet scalable trace collection. Estimating the impact of scaling on communication efficiency still remains nontrivial due to execution-time variations and exposure to hardware and software artifacts.
This work contributes a fundamentally novel modeling scheme. We synthetically generate the application trace for large numbers of nodes via extrapolation from a set of smaller traces. We devise an innovative approach for topology extrapolation of single program, multiple data (SPMD) codes with stencil or mesh communication. Experimental results show that the extrapolated traces precisely reflect the communication behavior and the performance characteristics at the target scale for both strong and weak scaling applications. The extrapolated trace can subsequently be (a) replayed to assess communication requirements before porting an application, (b) transformed to autogenerate communication benchmarks for various target platforms, and (c) analyzed to detect communication inefficiencies and scalability limitations.
To the best of our knowledge, rapidly obtaining the communication behavior of parallel applications at arbitrary scale with the availability of timed replay, yet without actual execution of the application, at this scale, is without precedence and has the potential to enable otherwise infeasible system simulation at the exascale level.
- Bailey, D. and Snavely, A. 2005. Performance modeling: Understanding the present and predicting the future. In Proceedings of the Euro-Par Conference. Google Scholar
Digital Library
- Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, D., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. 1991. The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5, 3, 63--73.Google Scholar
Digital Library
- Bergroth, L., Hakonen, H., and Raita, T. 2000. A survey of longest common subsequence algorithms. In Proceedings of the 7th International Symposium on String Processing Information Retrieval (SPIRE'00). Los Alamitos, CA, 39. Google Scholar
Digital Library
- Brunst, H., Kranzlmüller, D., and Nagel, W. 2005. Tools for scalable parallel program analysis—Vampir NG and DeWiz. In International Series Eng. Comput. Sci. Distributed and Parallel Systems 777, 92--102.Google Scholar
- Deshpande, V. 2011. Automatic generation of complete communication skeletons from traces. M.S. thesis, North Carolina State University, Raleigh, NC.Google Scholar
- Eckert, Z. and Nutt, G. 1996. Trace extrapolation for parallel programs on shared memory multiprocessors. Tech. rep. TR CU-CS-804-96, Department of Computer Science, University of Colorado at Boulder, Boulder, CO.Google Scholar
- Eckert, Z. K. F. and Nutt, G. J. 1994. Parallel program trace extrapolation. In Proceedings of the International Conference on Parallel Processing. 103--107. Google Scholar
Digital Library
- Faraj, A., Patarasuk, P., and Yuan, X. 2007. A study of process arrival patterns for MPI collective operations. In Proceedings of the International Conference on Supercomputing. 168--179. Google Scholar
Digital Library
- Gropp, W., Lusk, E., Doss, N., and Skjellum, A. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22, 6, 789--828. Google Scholar
Digital Library
- Gruber, B., Haring, G., Kranzlmueller, D., and Volkert, J. 1996. Parallel programming with capse—A case study. In Proceedings of the International Euromicro Conference on Parallel, Distributed, and Network-Based Processing. Google Scholar
Digital Library
- Gustafson, J. L. 1988. Reevaluating Amdahl's law. Comm. ACM 31, 5, 532--533. Google Scholar
Digital Library
- Hermanns, M.-A., Geimer, M., Wolf, F., and Wylie, B. J. N. 2009. Verifying causality between distant performance phenomena in large-scale mpi applications. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. Los Alamitos, CA, 78--84. Google Scholar
Digital Library
- Hoisie, A., Lubeck, O. M., and Wasserman, H. J. 1999. Performance analysis of wavefront algorithms on very-large scale distributed systems. In Proceedings of the Workshop on Wide Area Networks and High Performance Computing. Springer-Verlag, 171--187. Google Scholar
Digital Library
- Ïpek, E., McKee, S. A., Caruana, R., de Supinski, B. R., and Schulz, M. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. 195--206. Google Scholar
Digital Library
- Kerbyson, D., Alme, H., Hoisie, A., Petrini, F., Wasserman, H., and Gittings, M. 2001. Predictive performance and scalability modeling of a large-scale application. In International Conference on Supercomputing. Google Scholar
Digital Library
- Kerbyson, D. J. and Hoisie, A. 2006. Performance modeling of the blue gene architecture. In Proceedings of the IEEE John Vincent Atanasoff International Symposium on Modern Computing. 252--259. Google Scholar
Digital Library
- Knüpfer, A., Brendel, R., Brunst, H., Mix, H., and Nagel, W. E. 2006. Introducing the open trace format (OTF). In Proceedings of the International Conference on Computational Science. 526--533. Google Scholar
Digital Library
- Labarta, J., Girona, S., and Cortes, T. 1997. Analyzing scheduling policies using dimemas. Parallel Comput. 23, 1--2, 23--34. Google Scholar
Digital Library
- Nagel, W. E., Arnold, A., Weber, M., Hoppe, H. C., and Solchenbach, K. 1996. VAMPIR: Visualization and analysis of MPI resources. Supercomput. 12, 1, 69--80.Google Scholar
- Noeth, M., Mueller, F., Schulz, M., and de Supinski, B. R. 2007. Scalable compression and replay of communication traces in massively parallel environments. In Proceedings of the International Parallel and Distributed Processing Symposium.Google Scholar
- Noeth, M., Mueller, F., Schulz, M., and de Supinski, B. R. 2009. Scalatrace: Scalable compression and replay of communication traces in high performance computing. J. Parall. Distrib. Comput. 69, 8, 969--710. Google Scholar
Digital Library
- Pillet, V., Labarta, J., Cortes, T., and Girona, S. 1995. PARAVER: A tool to visualise and analyze parallel code. In Proceedings of WoTUG-18: Transputer and OCCAM Developments. Transputer and Occam Engineering Series, vol. 44. 17--31.Google Scholar
- Preissl, R., Köckerbauer, T., Schulz, M., Kranzlmüller, D., Supinski, B. R. D., and Quinlan, D. J. 2008. Detecting patterns in mpi communication traces. In Proceedings of the 37th International Conference on Parallel Processing. Los Alamitos, CA, 230--237. Google Scholar
Digital Library
- Preissl, R., Schulz, M., Kranzlmüller, D., Supinski, B. R., and Quinlan, D. J. 2008. Using mpi communication patterns to guide source code transformations. In Proceedings of the 8th International Conference on Computational Science. Part III, Springer-Verlag, Berlin, 253--260. Google Scholar
Digital Library
- Ratn, P., Mueller, F., de Supinski, B. R., and Schulz, M. 2008. Preserving time in large-scale communication traces. In Proceedings of the International Conference on Supercomputing. 46--55. Google Scholar
Digital Library
- Rodrigues, A. F., Murphy, R. C., Kogge, P., and Underwood, K. D. 2006. The structural simulation toolkit: exploring novel architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing. 157. Google Scholar
Digital Library
- Ronsse, M. and Kranzlmueller, D. 1998. Roltmp-replay of lamport timestamps for message passing systems. In Proceedings of the International Euromicro Conference on Parallel, Distributed, and Network-Based Processing.Google Scholar
- Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57. Google Scholar
Digital Library
- Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., and Purkayastha, A. 2002. A framework for performance modeling and prediction. In Proceedings of the International Conference on Supercomputing. Google Scholar
Digital Library
- Vetter, J. and McCracken, M. 2001. Statistical scalability analysis of communication operations in distributed applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Google Scholar
Digital Library
- Wasserman, H., Hoisie, A., and Lubeck, O. 2000. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. Int. J. High Perform. Comput. Appli. 14, 330--346. Google Scholar
Digital Library
- Wu, X., Mueller, F., and Pakin, S. 2011. Automatic generation of executable communication specifications from parallel applications. In Proceedings of the International Conference on Supercomputing. (ICS '11). ACM, New York, 12--21. Google Scholar
Digital Library
- Wu, X., Vijayakumar, K., Mueller, F., Ma, X., and Roth, P. C. 2011. Probabilistic communication and i/o tracing with deterministic replay at scale. In Proceedings of the International Conference on Parallel Processing. Google Scholar
Digital Library
- Xu, Q., Prithivathi, R., Subhlok, J., and Zheng, R. 2008. Logicalization of mpi communication traces. Tech. rep. UH-CS-08-07, Department of Computer Science, University of Houston.Google Scholar
- Xu, Q. and Subhlok, J. 2008. Construction and evaluation of coordinated performance skeletons. In Proceeding of the International Conference on High Performance Computing. 73--86. Google Scholar
Digital Library
- Zhai, J., Chen, W., and Zheng, W. 2010. Phantom: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 305--314. Google Scholar
Digital Library
- Zhai, J., Sheng, T., He, J., Chen, W., and Zheng, W. 2009. Fact: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the International Conference on Supercomputing. 1--12. Google Scholar
Digital Library
Index Terms
ScalaExtrap: Trace-based communication extrapolation for SPMD programs
Recommendations
ScalaExtrap: trace-based communication extrapolation for spmd programs
PPoPP '11Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for ...
ScalaExtrap: trace-based communication extrapolation for spmd programs
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingPerformance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for ...
An embedded multi-resolution AMBA trace analyzer for microprocessor-based SoC integration
DAC '07: Proceedings of the 44th annual Design Automation ConferenceThe bus tracing is used to catch related signals for further investigation and analysis. However, the trace size of cycle-accurate tracing is large and the trace cycle is shallow unless using a proper compression mechanism. In this paper, we propose an ...






Comments