Abstract
Cache miss stalls are one of the major sources of performance bottlenecks for multicore processors. A Hardware Performance Monitor (HPM) in the processor is useful for locating the cache misses, but is rarely used in the real world for various reasons. It would be better to find a simple approach to locate the sources of cache misses and apply runtime optimizations without relying on an HPM. This paper shows that pointer dereferencing in hot loops is a major source of cache misses in Java programs. Based on this observation, we devised a new approach to identify the instructions and objects that cause frequent cache misses. Our heuristic technique effectively identifies the majority of the cache misses in typical Java programs by matching the hot loops to simple idiomatic code patterns. On average, our technique selected only 2.8% of the load and store instructions generated by the JIT compiler and these instructions accounted for 47% of the L1D cache misses and 49% of the L2 cache misses caused by the JIT-compiled code. To prove the effectiveness of our technique in compiler optimizations, we prototyped object placement optimizations, which align objects in cache lines or collocate paired objects in the same cache line to reduce cache misses. For comparison, we also implemented the same optimizations based on the accurate information obtained from the HPM. Our results showed that our heuristic approach was as effective as the HPM-based approach and achieved comparable performance improvements in the SPECjbb2005 and SPECpower_ssj2008 benchmark programs.
- A. Adl-Tabatabai, R. L. Hudson, M. J. Serrano, and S. Subramoney, Prefetch injection based on hardware monitoring and object metadata, in Proceedings of the ACM Conference on Programming Language Design and Implementation, pp. 267--276, 2004. Google Scholar
Digital Library
- F. T. Schneider, M. Payer, and T. R. Gross, "Online optimizations driven by hardware performance monitoring", in Proceedings of the ACM Conference on Programming Language Design and Implementation, pp. 373--382, 2007. Google Scholar
Digital Library
- M. Serrano and X. Zhuang, "Placement Optimization Using Data Context Collected During Garbage Collection", In Proceedings of the International Symposium on Memory Management, pp. 69--78, 2009. Google Scholar
Digital Library
- J. Cuthbertson, S. Viswanathan, K. Bobrovsky, A. Astapchuk, E. Kaczmarek, and U. Srinivasan, "A Practical Approach to Hardware Performance Monitoring Based Dynamic Optimizations in a Production JVM", in Proceedings of the International Symposium on Code Generation and Optimization, pp. 190--199, 2009. Google Scholar
Digital Library
- M. Burtscher, A, Diwan and M. Hauswirth, "Static load classification for improving the value predictability of data cache misses"in Proceedings of the ACM Conference on Programming Language Design and Implementation", pp. 222--233, 2002. Google Scholar
Digital Library
- V. M. Panait, A. Sasturkar, and W. F. Wong, "Static Identification of Delinquent Loads", in Proceedings of the International Symposium on Code Generation and Optimization, pp. 303--314, 2004. Google Scholar
Digital Library
- T. M. Chilimbi, and J. R. Larus, "Using generational garbage collection to implement cache-conscious data placement", in Proceedings of the ACM International Symposium on Memory Management, pp. 37--48, 1998. Google Scholar
Digital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus, "Cache-conscious structure layout", in Proceedings of the ACM Conference on Programming Language Design and Implementation, pp. 1--12, 1999. Google Scholar
Digital Library
- W. Chen, S. Bhansali, T. M. Chilimbi, X. Gao, and W. Chuang, "Profile-guided proactive garbage collection for locality optimization", in Proceedings of ACM Conference on Programming Language Design and Implementation, pp. 332--340, 2006. Google Scholar
Digital Library
- M. L. Seidel and B. G. Zorn, "Segregating Heap Objects by Reference Behavior and Lifetime", in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 12--23, 1998. Google Scholar
Digital Library
- Y. Shuf, M. Gupta, R. Bordawekar, and J. P. Singh, "Exploiting prolific types for memory management and optimizations", in Proceedings of the ACM Symposium on Principles of Programming Languages, pp. 295--306, 2002. Google Scholar
Digital Library
- A. Jula and L. Rauchwerger, "Two memory allocators that use hints to improve locality", in Proceedings of the ACM International Symposium on Memory Management, pp. 109--118, 2009. Google Scholar
Digital Library
- J. Jeon, K. Shin, and H. Han, "Layout transformations for heap objects using static access patterns", in Proceedings of the International Conference on Compiler Construction, pp. 187--201, 2007. Google Scholar
Digital Library
- S. M. Blackburn et al., "The DaCapo Benchmarks: Java Benchmarking Development and Analysis", in Proceedings of the ACM conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 169--190, 2006. Google Scholar
Digital Library
- H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, "IBM POWER6 microarchitecture", IBM Journal of Research and Development, Vol. 51 (6), pp. 639--662, 2007. Google Scholar
Digital Library
- D. Siegwart and Martin Hirzel, "Improving locality with parallel hierarchical copying GC", in Proceedings of the International Symposium on Memory Management, pp. 52--63, 2006. Google Scholar
Digital Library
- H. Inoue and T. Nakatani, "How a Java VM Can Get More from a Hardware Performance Monitor", in Proceedings of the ACM Conference on Object Oriented Programming Systems Languages and Applications, pp. 137--154, 2009. Google Scholar
Digital Library
- M. Jump, S. M. Blackburn, and K. S. McKinley, "Dynamic object sampling for pretenuring", in Proceedings of the International Symposium on Memory Management, pp. 152--162, 2004. Google Scholar
Digital Library
- R. Odaira, K. Ogata, K. Kawachiya, T. Onodera, and T. Nakatani, "Efficient Runtime Tracking of Allocation Sites in Java", in Proceedings of the ACM International Conference on Virtual Execution Environments, pp. 109--120, 2010. Google Scholar
Digital Library
Index Terms
Identifying the sources of cache misses in Java programs without relying on hardware counters
Recommendations
Identifying the sources of cache misses in Java programs without relying on hardware counters
ISMM '12: Proceedings of the 2012 international symposium on Memory ManagementCache miss stalls are one of the major sources of performance bottlenecks for multicore processors. A Hardware Performance Monitor (HPM) in the processor is useful for locating the cache misses, but is rarely used in the real world for various reasons. ...
Hardware identification of cache conflict misses
MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on MicroarchitectureThis paper describes the Miss Classification Table, a simple mechanism that enables the processor or memory controller to identify each cache miss as either a conflict miss or a capacity (non-conflict) miss. The miss classification table works by ...
Reducing cache misses through programmable decoders
Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-...







Comments