skip to main content
article
Free Access

METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies

Published:01 April 2007Publication History
Skip Abstract Section

Abstract

With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information, and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required.

References

  1. Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Buck, B. and Hollingsworth, J. 2000a. An API for runtime code patching. Int. J. High Perform. Comput. Appl. 14, 4, 317--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Buck, B. and Hollingsworth, J. 2000b. Using hardware performance monitors to isolate memory bottlenecks. In Supercomput., 64--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Burrows, M. and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124.Google ScholarGoogle Scholar
  5. Burtscher, M. 2004a. Vpc3: A fast and effective trace-compression algorithm. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems (New York). 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Burtscher, M. 2004b. Vpc3 source code. http://www.csl.cornell.edu/burtscher/research/tracecom pression/.Google ScholarGoogle Scholar
  7. Chatterjee, S., Parker, E., Hanlon, P., and Lebeck, A. 2001. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 286--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chilimbi, T. 2001. Efficient representations and abstractions for quantifying and exploiting data reference locality. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chilimbi, T., Davidson, B., and Larus, J. 1999. Cache-Conscious structure definition. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chilimbi, T., Hill, M., and Larus, J. 1999b. Cache-Conscious structure layout. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cifuentes, C. and Emmerik, M. 2000. UQBT: Adaptable binary translation at low cost. Comput. 33, 3 (Mar.), 60--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. DeRose, L., Ekanadham, K., Hollingsworth, J. K., and Sbaraglia, S. 2002. SIGMA: A simulator infrastructure to guide memory analysis. In Proceedings of the ACM/IEEE SC Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ding, C. and Zhong, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst. 21, 4, 703--746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Grant, B., Philipose, M., Mock, M., Chambers, C., and Eggers, S. 1999. An evaluation of staged run-time optimizations in dyc. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 293--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Havlak, P. and Kennedy, K. 1991. An implementation of interprocedural bounded regular section analysis. IEEE Trans. Parallel Distrib. Syst. 2, 3 (Jul.), 350--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Horowitz, M., Martonosi, M., Mowry, T., and Smith, M. 1996. Informing memory operations: Providing memory performance feedback in modern processors. In Proceedings of the International Symposium on Computer Architecure, 260--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel. 2004. Intel Itanium2 Processor Reference Manual for Software Development and Optimization Vol.1, Intel, Santa Clara, CA.Google ScholarGoogle Scholar
  19. Larus, J. and Ball, T. 1994. Rewriting executable files to measure program behavior. Softw. Pract. Experi. 24, 2 (Feb.), 197--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Larus, J. and Schnarr, E. 1995. EEL: Machine-Independent executable editing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lebeck, A. and Wood, D. 1994. Cache profiling and the SPEC benchmarks: A case study. Comput. 27, 10 (Oct.), 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lebeck, A. and Wood, D. 1997. Active memory: A new abstraction for memory system simulation. ACM Trans. Model. Comput. Simul. 7, 1 (Jan.), 42--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Manning, N. 2005. Sequitur source code. http://sequence.rutgers.edu/sequitur/sequitur.cc.Google ScholarGoogle Scholar
  24. Marathe, J. and Mueller, F. 2002. Detecting memory performance bottlenecks via binary rewriting. In Proceedings of the Workshop on Binary Translation.Google ScholarGoogle Scholar
  25. Marathe, J., Mueller, F., and de Supinski, B. R. 2005. A hybrid hardware/software approach to efficiently determine cache coherence bottlenecks. In International Conference on Supercomputing. accepted. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Marathe, J., Mueller, F., Mohan, T., de Supinski, B. R., McKee, S. A., and Yoo, A. 2003. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In Proceedings of the International Symposium on Code Generation and Optimization, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marathe, J., Nagarajan, A., and Mueller, F. 2004. Detailed cache coherence characterization for OpenMP benchmarks. In Proceedings of the International Conference on Supercomputing, 287--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mellor-Crummey, J., Fowler, R., and Whalley, D. 2001. Tools for application-oriented performance tuning. In Proceedings of the International Conference on Supercomputing, 154--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mohan, T., de Supinski, B. R., McKee, S. A., Mueller, F., Yoo, A., and Schulz, M. 2003. Identifying and exploiting spatial regularity in data memory references. Supercomput. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mowry, T. and Luk, C.-K. 1997. Predicting data cache misses in non-numeric applications through correlation profiling. In MICRO-30, 314--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mueller, F., Mohan, T., de Supinski, B. R., McKee, S. A., and Yoo, A. 2001. Partial data traces: Efficient generation and representation. In Workshop on Binary Translation. IEEE Technical Committee on Computer Architecture Newsletter.Google ScholarGoogle Scholar
  32. Nevill-Manning, C. G. and Witten, I. H. 1997a. Compression and explanation using hierarchical grammars. Comput. J. 40, 2--3.Google ScholarGoogle ScholarCross RefCross Ref
  33. Nevill-Manning, C. G. and Witten, I. H. 1997b. Linear-Time, incremental hierarchy inference for compression. In Proceedings of the Data Compression Conference, 3--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Seward, J. 2005. Libbzip2 source code. http://www.bzip.org/index.html.Google ScholarGoogle Scholar
  35. Sites, R., Chernoff, A., Kirk, M., Marks, M., and Robinson, S. 1993. Binary translation. Commun. ACM 36, 2 (Feb.), 69--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 196--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tendler, J. M., Dodson, J. S., Fields, Jr., J. S., Le, H., and Sinharoy, B. 2002. POWER4 system microarchitecture. IBM J. Res. Develop. 46, 1 (Jan.), 5--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ung, D. and Cifuentes, C. 2000. Optimising hot paths in a dynamic binary translator. In Proceedings of the Workshop on Binary Translation.Google ScholarGoogle Scholar
  39. Vetter, J. and Mueller, F. 2003. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. J. Parallel Distrib. Comput. 63, 9 (Sept.), 853--865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Weikle, D., McKee, S. A., Skadron, K., and Wulf, W. 2000. Caches as filters: A framework for the analysis of caching systems. In Proceedings of the Grace Murray Hopper Conference.Google ScholarGoogle Scholar
  41. Wulf, W. 1992. Evaluation of the WM architecture. In Proceedings of the International Symposium on Computer Architecture, 382--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhong, Y., Orlovich, M., Shen, X., and Ding, C. 2004. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies

      Recommendations

      Reviews

      Olivier Louis Marie Lecarme

      A long time ago, computer hardware was designed in order to efficiently execute the code generated by compilers for higher-level programming languages. Now, computer hardware is designed in order to claim extraordinary performances, but the burden of reaching these performances is on compiler writers, whose task is more and more complicated, even impossible. One of the main difficulties they encounter is trying to use the memory cache efficiently. A major performance bottleneck is its optimal use: if a cache miss occurs, there must be transfers between cache and main memory, causing performances to degrade dramatically. Optimizing compilers try to organize data and instructions in order to minimize cache misses. However, normal optimizations occur on the source text of the program and do not include library routines. Thus, the data and control analysis that is done is very incomplete and cannot suggest really good optimizations. The idea presented in this paper is to analyze the running program in its entirety and not just the source text of a part of it. This is done by dynamic binary rewriting of the object program, to generate access traces. The program is instrumented and run for some time on actual data, and then it is restored in its normal state and run to completion. The instrumentation generates a huge amount of data, which needs to be highly compressed. Generated statistics demonstrate the cache inefficiencies. The paper shows interesting examples of the information that can be collected in this way and the transformations needed for improving the program. The problem, in my opinion, is that these ad hoc transformations must be done by hand and no part of the process seems to be suitable for automation. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!