Abstract
With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information, and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required.
- Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1--12. Google Scholar
Digital Library
- Buck, B. and Hollingsworth, J. 2000a. An API for runtime code patching. Int. J. High Perform. Comput. Appl. 14, 4, 317--329. Google Scholar
Digital Library
- Buck, B. and Hollingsworth, J. 2000b. Using hardware performance monitors to isolate memory bottlenecks. In Supercomput., 64--65. Google Scholar
Digital Library
- Burrows, M. and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124.Google Scholar
- Burtscher, M. 2004a. Vpc3: A fast and effective trace-compression algorithm. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems (New York). 167--176. Google Scholar
Digital Library
- Burtscher, M. 2004b. Vpc3 source code. http://www.csl.cornell.edu/burtscher/research/tracecom pression/.Google Scholar
- Chatterjee, S., Parker, E., Hanlon, P., and Lebeck, A. 2001. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 286--297. Google Scholar
Digital Library
- Chilimbi, T. 2001. Efficient representations and abstractions for quantifying and exploiting data reference locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 191--202. Google Scholar
Digital Library
- Chilimbi, T., Davidson, B., and Larus, J. 1999. Cache-Conscious structure definition. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 13--24. Google Scholar
Digital Library
- Chilimbi, T., Hill, M., and Larus, J. 1999b. Cache-Conscious structure layout. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1--12. Google Scholar
Digital Library
- Cifuentes, C. and Emmerik, M. 2000. UQBT: Adaptable binary translation at low cost. Comput. 33, 3 (Mar.), 60--66. Google Scholar
Digital Library
- DeRose, L., Ekanadham, K., Hollingsworth, J. K., and Sbaraglia, S. 2002. SIGMA: A simulator infrastructure to guide memory analysis. In Proceedings of the ACM/IEEE SC Conference. Google Scholar
Digital Library
- Ding, C. and Zhong, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst. 21, 4, 703--746. Google Scholar
Digital Library
- Grant, B., Philipose, M., Mock, M., Chambers, C., and Eggers, S. 1999. An evaluation of staged run-time optimizations in dyc. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 293--304. Google Scholar
Digital Library
- Havlak, P. and Kennedy, K. 1991. An implementation of interprocedural bounded regular section analysis. IEEE Trans. Parallel Distrib. Syst. 2, 3 (Jul.), 350--360. Google Scholar
Digital Library
- Horowitz, M., Martonosi, M., Mowry, T., and Smith, M. 1996. Informing memory operations: Providing memory performance feedback in modern processors. In Proceedings of the International Symposium on Computer Architecure, 260--270. Google Scholar
Digital Library
- Intel. 2004. Intel Itanium2 Processor Reference Manual for Software Development and Optimization Vol.1, Intel, Santa Clara, CA.Google Scholar
- Larus, J. and Ball, T. 1994. Rewriting executable files to measure program behavior. Softw. Pract. Experi. 24, 2 (Feb.), 197--218. Google Scholar
Digital Library
- Larus, J. and Schnarr, E. 1995. EEL: Machine-Independent executable editing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 291--300. Google Scholar
Digital Library
- Lebeck, A. and Wood, D. 1994. Cache profiling and the SPEC benchmarks: A case study. Comput. 27, 10 (Oct.), 15--26. Google Scholar
Digital Library
- Lebeck, A. and Wood, D. 1997. Active memory: A new abstraction for memory system simulation. ACM Trans. Model. Comput. Simul. 7, 1 (Jan.), 42--77. Google Scholar
Digital Library
- Manning, N. 2005. Sequitur source code. http://sequence.rutgers.edu/sequitur/sequitur.cc.Google Scholar
- Marathe, J. and Mueller, F. 2002. Detecting memory performance bottlenecks via binary rewriting. In Proceedings of the Workshop on Binary Translation.Google Scholar
- Marathe, J., Mueller, F., and de Supinski, B. R. 2005. A hybrid hardware/software approach to efficiently determine cache coherence bottlenecks. In International Conference on Supercomputing. accepted. Google Scholar
Digital Library
- Marathe, J., Mueller, F., Mohan, T., de Supinski, B. R., McKee, S. A., and Yoo, A. 2003. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In Proceedings of the International Symposium on Code Generation and Optimization, 289--300. Google Scholar
Digital Library
- Marathe, J., Nagarajan, A., and Mueller, F. 2004. Detailed cache coherence characterization for OpenMP benchmarks. In Proceedings of the International Conference on Supercomputing, 287--297. Google Scholar
Digital Library
- Mellor-Crummey, J., Fowler, R., and Whalley, D. 2001. Tools for application-oriented performance tuning. In Proceedings of the International Conference on Supercomputing, 154--165. Google Scholar
Digital Library
- Mohan, T., de Supinski, B. R., McKee, S. A., Mueller, F., Yoo, A., and Schulz, M. 2003. Identifying and exploiting spatial regularity in data memory references. Supercomput. Google Scholar
Digital Library
- Mowry, T. and Luk, C.-K. 1997. Predicting data cache misses in non-numeric applications through correlation profiling. In MICRO-30, 314--320. Google Scholar
Digital Library
- Mueller, F., Mohan, T., de Supinski, B. R., McKee, S. A., and Yoo, A. 2001. Partial data traces: Efficient generation and representation. In Workshop on Binary Translation. IEEE Technical Committee on Computer Architecture Newsletter.Google Scholar
- Nevill-Manning, C. G. and Witten, I. H. 1997a. Compression and explanation using hierarchical grammars. Comput. J. 40, 2--3.Google Scholar
Cross Ref
- Nevill-Manning, C. G. and Witten, I. H. 1997b. Linear-Time, incremental hierarchy inference for compression. In Proceedings of the Data Compression Conference, 3--11. Google Scholar
Digital Library
- Seward, J. 2005. Libbzip2 source code. http://www.bzip.org/index.html.Google Scholar
- Sites, R., Chernoff, A., Kirk, M., Marks, M., and Robinson, S. 1993. Binary translation. Commun. ACM 36, 2 (Feb.), 69--81. Google Scholar
Digital Library
- Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 196--205. Google Scholar
Digital Library
- Tendler, J. M., Dodson, J. S., Fields, Jr., J. S., Le, H., and Sinharoy, B. 2002. POWER4 system microarchitecture. IBM J. Res. Develop. 46, 1 (Jan.), 5--25. Google Scholar
Digital Library
- Ung, D. and Cifuentes, C. 2000. Optimising hot paths in a dynamic binary translator. In Proceedings of the Workshop on Binary Translation.Google Scholar
- Vetter, J. and Mueller, F. 2003. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. J. Parallel Distrib. Comput. 63, 9 (Sept.), 853--865. Google Scholar
Digital Library
- Weikle, D., McKee, S. A., Skadron, K., and Wulf, W. 2000. Caches as filters: A framework for the analysis of caching systems. In Proceedings of the Grace Murray Hopper Conference.Google Scholar
- Wulf, W. 1992. Evaluation of the WM architecture. In Proceedings of the International Symposium on Computer Architecture, 382--390. Google Scholar
Digital Library
- Zhong, Y., Orlovich, M., Shen, X., and Ding, C. 2004. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
Index Terms
METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies
Recommendations
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks
ICS '05: Proceedings of the 19th annual international conference on SupercomputingHigh-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as clusters of SMPs, nodes of chip-multiprocessors (CMP) or large-scale single-system image (SSI) SMPs. In such systems, performance is often affected by the sharing ...
Detailed cache coherence characterization for OpenMP benchmarks
ICS '04: Proceedings of the 18th annual international conference on SupercomputingPast work on studying cache coherence in shared-memory symmetric multiprocessors (SMPs) concentrates on studying aggregate events, often from an architecture point of view. However, this approach provides insufficient information about the exact sources ...
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques
Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. ...








Comments