ABSTRACT
Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routine's calling context. Existing performance tools fall short in this respect. Prior strategies for attributing context-sensitive performance at the source level either compromise measurement accuracy, remain too close to the binary, or require custom compilers. To understand the performance of fully optimized modular code, we developed two novel binary analysis techniques: 1) on-the-fly analysis of optimized machine code to enable minimally intrusive and accurate attribution of costs to dynamic calling contexts; and 2) post-mortem analysis of optimized machine code and its debugging sections to recover its program structure and reconstruct a mapping back to its source code. By combining the recovered static program structure with dynamic calling context information, we can accurately attribute performance metrics to calling contexts, procedures, loops, and inlined instances of procedures. We demonstrate that the fusion of this information provides unique insight into the performance of complex modular codes. This work is implemented in the HPCToolkit performance tools (http://hpctoolkit.org).
- V. S. Adve, J. Mellor-Crummey, M. Anderson, J.-C. Wang, D. A. Reed, and K. Kennedy. An integrated compilation and performance analysis environment for data parallel programs. In Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 50, New York, NY, USA, 1995. ACM Press. Google Scholar
Digital Library
- Apple Computer. Shark. http://developer.apple.com/tools/sharkoptimize.html.Google Scholar
- G. Brooks, G. J. Hansen, and S. Simmons. A new approach to debugging optimized code. In PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, pages 1--11, New York, NY, USA, 1992. ACM Press. Google Scholar
Digital Library
- B. Buck and J. K. Hollingsworth. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14(4):317--329, Winter 2000. Google Scholar
Digital Library
- M. Charney. XED2 user guide. http://www.pintool.org/docs/24110/Xed/html.Google Scholar
- R. Cohn and P. G. Lowney. Hot cold optimization of large Windows/NT applications. In MICRO 29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, pages 80--89, Washington, DC, USA, 1996. IEEE Computer Society. Google Scholar
Digital Library
- A. Dubey, L. Reid, and R. Fisher. Introduction to FLASH 3.0, with application to supersonic turbulence. Physica Scripta, 132:014046, 2008.Google Scholar
Cross Ref
- Free Standards Group. DWARF debugging information format, version 3. http://dwarf.freestandards.org. 20 December, 2005.Google Scholar
- N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In ICS'05: Proceedings of the 19th annual International Conference on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM Press. Google Scholar
Digital Library
- N. Froyd, N. Tallent, J. Mellor-Crummey, and R. Fowler. Call path profiling for unmodified, optimized binaries. In GCC Summit'06: Proceedings of the GCC Developers' Summit, 2006, pages 21--36, 2006.Google Scholar
- S. L. Graham, P. B. Kessler, and M. K. McKusick. Gprof: A call graph execution profiler. In SIGPLAN'82: Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction, pages 120--126, New York, NY, USA, 1982. ACM Press. Google Scholar
Digital Library
- R. J. Hall. Call path profiling. In ICSE'92: Proceedings of the 14th international conference on Software engineering, pages 296--306, New York, NY, USA, 1992. ACM Press. Google Scholar
Digital Library
- P. Havlak. Nesting of reducible and irreducible loops. ACM Trans. Program. Lang. Syst., 19(4):557--567, 1997. Google Scholar
Digital Library
- Intel Corporation. Intel performance tuning utility. \hrefhttp://software.intel.com/en--us/articles/intel-performance-tuning-utility http://software.intel.com/en-us/articles/intel-performance-\\tuning-utility.Google Scholar
- Intel Corporation. Intel VTune performance analyzer. http://www.intel.com/software/products/vtune.Google Scholar
- ITAPS working group. The ITAPS iMesh interface. http://www.tstt-scidac.org/software/documentation/iMesh_userguide.pdf.Google Scholar
- J. Levon et al. OProfile. http://oprofile.sourceforge.net.Google Scholar
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI'05: Proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation, pages 190--200, New York, NY, USA, 2005. ACM Press. Google Scholar
Digital Library
- J. Mellor-Crummey, R. Fowler, G. Marin, and N. Tallent. HPCView: A tool for top-down analysis of node performance. The Journal of Supercomputing, 23(1):81--104, 2002. Google Scholar
Digital Library
- D. Monroe. ENERGY Science with DIGITAL Combustors. http://www.scidacreview.org/0602/html/combustion.html.Google Scholar
- D. Mosberger-Tang. libunwind. http://www.nongnu.org/libunwind.Google Scholar
- T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF'07: Proceedings of the 4th international conference on Computing frontiers, pages 143--152, New York, NY, USA, 2007. ACM. Google Scholar
Digital Library
- T. Mytkowicz, A. Diwan, M. Hauswirth, and P. Sweeney. Producing wrong data without doing anything obviously wrong! In Fourteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'09), 2009. Google Scholar
Digital Library
- N. Rosenblum, X. Zhu, B. Miller, and K. Hunt. Learning to analyze binary computer code. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008), pages 798--804, 2008. Google Scholar
Digital Library
- S. Sandmann. Sysprof. http://www.daimi.au.dk/sandmann/sysprof. 21 October 2007.Google Scholar
- S. S. Shende and A. D. Malony. The Tau parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, 2006. Google Scholar
Digital Library
- SPEC Corporation. SPEC CPU2006 benchmark suite. http://www.spec.org/cpu2006. 3 November 2007.Google Scholar
- N. R. Tallent. Binary analysis for attribution and interpretation of performance measurements on fully-optimized code. M.S. thesis, Department of Computer Science, Rice University, May 2007.Google Scholar
- T. J. Tautges. MOAB-SD: integrated structured and unstructured mesh representation. Eng. Comput. (Lond.), 20(3):286--293, 2004.Google Scholar
Digital Library
- O. Waddell and J. M. Ashley. Visualizing the performance of higher-order programs. In Proceedings of the 1998 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, pages 75--82. ACM Press, 1998. Google Scholar
Digital Library
- X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D. Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 263--271, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
Index Terms
Binary analysis for measurement and attribution of program performance
Recommendations
Binary analysis for measurement and attribution of program performance
PLDI '09Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routine's calling ...
Diagnosing performance bottlenecks in emerging petascale applications
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and AnalysisCutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists ...
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Scalable Tools for High-End ComputingHPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully ...







Comments