Abstract
The instruction cache is a popular optimization target due to the cache's high impact on system performance and power and because of the cache's predictable temporal and spatial locality. This article is an in depth study on the interaction of code reordering (a long-known technique) and cache configuration (a relatively new technique). Experimental results show that code reordering coupled with cache configuration reveals additional energy savings as high as 10--15% for several benchmarks with reduced cache area as high as 48%. To exploit these additional benefits, we architect and evaluate several design exploration heuristics for combining these two methods.
- Albonesi, D. H. 2002. Selective cache ways: on demand cache resource allocation. J. Instruction Level Parallel.Google Scholar
- Altera. 2010. Nios embedded processor system development. http://www.altera.com/corporate/news_room/releases/products/nr-nios_delivers_goods.html.Google Scholar
- Arc International 2010. www.arccores.com.Google Scholar
- ARM. 2010. www.arm.com.Google Scholar
- Aydin, H. and Kaeli, D. 2000. Using cache line coloring to perform aggressive procedure inlining. ACM SIGARCH News 28, 1, 62--71. Google Scholar
Digital Library
- Bahar, I. Calder, B., and Grunwald, D. A. 1998. Comparison of software code reordering and victim buffers. In Proceedings of the 3rd Workshop of Interaction Between Compilers and Computer Architecture.Google Scholar
- Balasubramonian, R., Albonesi, D., Buyuktosunoglu, A., and Dwarkadas, S. 2000. Memory heirarchy reconfiguration for energy and performance in general-purpose processor architecture. In Proceedings of the 33rd International Symposium on Microarchitecture. Google Scholar
Digital Library
- Bartolini, S. and Prete, C. A. 2005. Optimizing instruction cache performance of embedded systems. ACM Trans. Embedd. Comput. Syst. 4, 4, 934--965. Google Scholar
Digital Library
- Benini, L., Macii, A., Macii, E., and Poncino, M. 1999. Selective instruction compression for memory energy reduction in embedded systems. In Proceedings of the International Symposium on Low Power Emedded Systems. Google Scholar
Digital Library
- Burger, D., Austin, T., and Bennet, S. 2000. Evaluating future microprocessors: The simplescalar toolset. Tech. rep. CS-TR-1308. Computer Science Department, University of Wisconsin-Madison.Google Scholar
- Calder, B. and Grunwald, D. 1994. Reducing branch costs via branch alignment. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems. Google Scholar
Digital Library
- Chen, J. and Leupen, B. 1997. Improving instruction locality with just-in-time code layout. In Proceedings of the USENIX Windows NT Workshop. Google Scholar
Digital Library
- Chen, Y. and Zhang, F. 2007. Code reordering on limited branch offset. ACM Trans. Architec. Code Optimz. 4, 2. Google Scholar
Digital Library
- Cohn, R., Goodwin, P., Lowney, G., and Rubin, N. 1997. Spike: An optimizer for Alpha/NT executables. In Proceedings of the USENIX Windows NT Workshop. Google Scholar
Digital Library
- Cohn. R. and Lowney, P. G. 2000. Design and analysis of profile-based optimization in Compaq's compilation tools for Alpha. J. Instruction Level Parallelism 2.Google Scholar
- Dinero I. 2010. http://www.cs.wisc.edu/~markhill/DineroIV/.Google Scholar
- EEMBC. 2010. The Embedded Microprocessor Benchmark Consortium. www.eembc.org.Google Scholar
- Ghosh, A. and Givargis, T. 2003. Cache optimization for embedded processor cores: an analytical approach. In Proceedings of the International Conference on Computer Aided Design. Google Scholar
Digital Library
- Givargis, T. and Vahid, F. 2002. Platune: a tuning framework for system-on-a-chip platforms. IEEE Trans. Comput. Aid. Design. Google Scholar
Digital Library
- Gloy, N., Blackwell, T., Smith, M. D., and Calder, B. 1997. Procedure placement using temporal ordering information. In Proceedings of the 30th Anual ACM/IEEE International Symposium on Microarchitecture. 303--313. Google Scholar
Digital Library
- Gordon-Ross, A., Cotterell, and Vahid, F. 2002. Exploiting fixed programs in embedded systems: A Loop cache example. Comput. Architec. Letters 1. Google Scholar
Digital Library
- Gordon-Ross, A., Lau, J., and Calder, B. 2008. Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI (GLSVLSI). Google Scholar
Digital Library
- Gordon-Ross, A. and Vahid, F. 2002. Dynamic loop caching meets preloaded loop caching—a hybrid approach. In Proceedings of the International Conference on Computer Design. Google Scholar
Digital Library
- Gordon-Ross, A., Vahid, F., and Dutt, N. 2009. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. IEEE Trans. VLSI. Google Scholar
Digital Library
- Hashemi, A., Kaeli, D., and Calder, B. 1997. Efficient procedure mapping using cache line coloring. In Proceedings of the International Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Hines, S., Whalley, D., and Tyson, G. 2007. Guaranteeing hits to improve the efficiency of a small instruction cache. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- Huang, X., Blackburn, S., Grove, D., and McKinley, K. 2006a. Fast and efficient partial code reordering: taking advantage of a dynamic recompiler. In Proceedings of the International Symposium on Memory Managment. Google Scholar
Digital Library
- Huang, X., Lewis, T., and McKinley, K. 2006b. Dyanmic code management: improving whole program code locality in managed runtimes. In Proceedings of the ACM International Conference on Virtual Execution Environments. Google Scholar
Digital Library
- Hwu, W. W. and Chang, P. 1989. Achieving high instruction cache performance with an optimizing compiler. In Proceedings of the 16th Annual Intl. Symposium on Computer Architecture. Google Scholar
Digital Library
- Kalmatianos, J. and Kaeli, D. 1999. Code reordering for multi-level cache hierarchies. Northeeastern University Computer Architecture Research Group. http://www.ece.neu.edu/info/architecture/publications. html.Google Scholar
- Kalmatianos and J., Kaeli, D. 2000. Accurate simulation and evaluation of code reordering. In Proceedings of the IEEE International Symposium on the Performance Analysis of Systems and Software. Google Scholar
Digital Library
- Kin, J., Gupta, M., and Mangione-Smith, W. The filter cache: an energy efficient memory structure. In Proceedings of the IEEE Micro. Google Scholar
Digital Library
- Lee, D., Baer, J., Bershad, B., and Anderson, T. 1999a. Reducing startup latency in web and desktop applications. In Proceedings of the Windows NT Symposium. Google Scholar
Digital Library
- Lee, L. H., Moyer, W., and Arends, J. 1999b. Low cost Embedded Program Loop Caching -- Revisited. Tech. rep. N CSE-TR-411-99, University of Michigan.Google Scholar
- Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. MediaBench: a tool for evaluating and synthesizing multimedia and communication systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture. Google Scholar
Digital Library
- Malik, A., Moyer, W., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design. Google Scholar
Digital Library
- McFarling. S. 1989. Program optimization for instruction caches. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III). Google Scholar
Digital Library
- MIPS Technologies. 2010. www.mips.com.Google Scholar
- Moseley, P., Debray, S., and Andrews, G. Checking program profiles. In Proceedings of the 3rd IEEE International Workshop of Source Code Analysis and Manipulation.Google Scholar
- Muth, R., Debray, S., Watterson, S., and de Bosschere, K. 2001. Alto: a link-time optimizer for the Compaq Alpha. Softw. Pract. Exper. 31, 6, 67--101. Google Scholar
Digital Library
- Palesi, M. and Givargis, T. 2002. Multi-objective design space exploration using genetic algorithms. In Proceedings of the International Workshop on Hardware/Software Codesign. Google Scholar
Digital Library
- Pettis, K. and Hansen, R. 1990. Profile guided code positioning. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Ramirez, A. 2005. Code placement for improving dynamic branch prediction accuracy. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Ramirez, A., Larriba-Pay, J. Navarro, C., Valero, M., and Torrellas, J. 2002. Software trace caches for commerial applications. Int. J. Parallel Program. 30, 5. Google Scholar
Digital Library
- Ramirez, A., Larriba-Pey, J., and Valero, M. 2000. The effect of code reordering on branch predition. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Ramirez, A., Larriba-Pey, J., and Valero, M. 2001. Instruction fetch architectures and code layout optimizations. Proc. IEEE 89, 11.Google Scholar
Cross Ref
- Ramirez, A., Larriba-Pay, J., and Valero, M. 2005. Software trace caches. IEEE Trans. Comput. 54, 1. Google Scholar
Digital Library
- Reinman, G. and Jouppi, N. P. 1999. Cacti2.0: An integraded cache timing and power model. Tech rep., COMPAQ Western Research Lab.Google Scholar
- Samples, A. D., and Hilfinger, P. N. 1988. Code reorganization for instruction caches. Techn. rep. UCB/CSD 88/447, University of California, Berkeley. Google Scholar
Digital Library
- Sanghai, K., Kaeli, D., Raikman, A., and Butler, K. 2007. A code layout framework for embedded processors with configurable memory hierarchy. In Proceedings of the Workshop on Optimizations for DSP and Embedded Systems (ODES).Google Scholar
- Scales, D. 1998. Efficient dynamic procedure placement. Tech. rep. WRL-98/5, Compaq WRL Research Lab.Google Scholar
- Scharz, B., Debray, S., Andrews, G., and Legendre, M. 2001. PLTO: a link-time optimizer for the Intel IA-32 architecture. In Proceedings of the Workshop on Binary Translation (WBT).Google Scholar
- Schmidt, W. J., Roediger, R. R., Mestad, C. S., Mendelson, B., Shavit-Lottem, I., and Bortnikov-and Sitnitsky, V. 1998. Profile-directed restructuring of operation system code. IBM Syst. J. 37, 2. Google Scholar
Digital Library
- Srivastava, A., and Wall, D. W. 1992. A practical system of intermodule code optimization at link-time. J. Program. Lang. 11, 1, 1--18.Google Scholar
- Su, C. and Despain, A. M. 1995. Cache design trade-offs for power and performance optimization: a case study. Proceedings of the International Symposium on Low Power Electronics and Design. Google Scholar
Digital Library
- Tensilica. 2010. Xtensa processor generator. http://www.tensilica.com/.Google Scholar
- Villarreal, J., Lysecky, R., Cotterell, S., and Vahid, F. 2001. Loop analysis of embedded applications. Tech. rep. UCR-CSR-01-03, University of California Riverside.Google Scholar
- Zhang, C. and Vahid, F. 2003. Cache configuration exploration on prototyping platforms. In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP- 03). Google Scholar
Digital Library
- Zhang, C., Vahid, F., and Najjar, W. 2003. A highly-configurable cache architecture for embedded eystems. In Proceedings of the 30th Annual International Symposium on Computer Architecture. Google Scholar
Digital Library
- Zhang, C. and Vahid, F. 2004a. Using a victim buffer in an application-specific memory hierarchy. In Proceedings of the Design, Automation and Test (DATE) Conference in Europe. Google Scholar
Digital Library
- Zhang, C. and Vahid, F. 2004b. A self-tuning cache architecture for embedded systems. In Proceedings of the Design, Automation and Test (DATE) Conference in Europe. Google Scholar
Digital Library
Index Terms
Combining code reordering and cache configuration
Recommendations
A first look at the interplay of code reordering and configurable caches
GLSVLSI '05: Proceedings of the 15th ACM Great Lakes symposium on VLSIThe instruction cache is a popular target for optimizations of microprocessor-based systems because of the cache's high impact on system performance and power, and because of the cache's predictable temporal and spatial locality. Optimization techniques ...
Fast configurable-cache tuning with a unified second-level cache
ISLPED '05: Proceedings of the 2005 international symposium on Low power electronics and designTuning a configurable cache subsystem to an application can greatly reduce memory hierarchy energy consumption. Previous tuning methods use a level one configurable cache only, or a second level with separate instruction and data configurable caches. We ...
Automatic Tuning of Two-Level Caches to Embedded Applications
DATE '04: Proceedings of the conference on Design, automation and test in Europe - Volume 1The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for optimizations. We present an automated method for tuning two-level caches to ...






Comments