Abstract
With the explosive proliferation of embedded systems, especially through countless portable devices and wireless equipment used, embedded systems have become indispensable to the modern society and people's life. Those devices are often battery driven. Therefore, low energy consumption in embedded processors is important and becomes critical in step with the system complexity. The on-chip instruction cache (I-cache) is usually the most energy-consuming component on the processor chip due to its large size and frequent access operations. To reduce such energy consumption, the existing loop cache approaches use a tiny decoded cache to filter the I-cache access and instruction decode activity for repeated loop iterations. However, such designs are effective for small and simple loops, and only suitable for DSP kernel-like applications. They are not effectual for many embedded applications where complex loops are common. In this article, we propose a decoded loop instruction cache (DLIC) that is small, hence energy efficient, yet can capture most loops, including large nested ones with branch executions, so that a significant amount of I-cache accesses and instruction decoding can be eradicated. The experiments on a set of embedded benchmarks show that our proposed DLIC scheme can reduce energy consumption by up to 87% as compared to normal cache-only design. On average, 66% energy can be saved on instruction fetching and decoding, while at a performance overhead of only 1.4%.
- Aa, T. V., Jayapala, M., Barat, F., Deconinck, G., Lauwereins, R., Catthoor, F., and Corporaal, H. 2004. Instruction buffering exploration for low energy VLIWs with instruction clusters. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC'04). 824--829. Google Scholar
Digital Library
- Anderson, T. and Agarwala, S. 2000. Effective hardware-based two-way loop cache for high performance low power processors. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers & Processors. 403--407. Google Scholar
Digital Library
- Bajwa, R. S., Hiraki, M., Kojima, H., Gorny, D. J., Nitta, K., Shridhar, A., Seki, K., and Sasaki, K. 1997. Instruction buffering to reduce power in processors for signal processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 5, 4, 417--424. Google Scholar
Digital Library
- Bellas, N. E., Hajj, I. N., and Polychronopoulos, C. D. 2000a. Using dynamic cache management techniques to reduce energy in general purpose processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 8, 6, 693--708. Google Scholar
Digital Library
- Bellas, N. E., Hajj, I. N., Polychronopoulos, C. D., and Stamoulis, G. 1999. Energy and performance improvements in microprocessor design using a loop cache. In Proceedings of the International Conference on Computer Design (ICCD'99). 378--383. Google Scholar
Digital Library
- Bellas, N. E., Hajj, I. N., Polychronopoulos, C. D., and Stamoulis, G. 2000b. Architectural and compiler techniques for energy reduction in high-performance microprocessors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 8, 3, 317--326. Google Scholar
Digital Library
- Burger, D. and Austin, T. 1997. The simplescalar tool set, version 2.0. Tech. rep. CS-TR-1997-1342, Department of Computer Science, University of Wisconsin, Madison, WI.Google Scholar
- Chang, Y.-J. 2006. Lazy BTB: Reduce BTB energy consumption using dynamic profiling. In Proceedings of the Asia and South Pacific Design Automation Conference(ASP-DAC'06). 917--922. Google Scholar
Digital Library
- Chang, Y.-J., Ruan, S.-J., and Lai, F. 2003. Design and analysis of low-power cache using two-level filter scheme. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 11, 4, 568--580. Google Scholar
Digital Library
- Dally, W. J., Balfour, J., Black-Shaffer, D., Chen, J., Harting, R. C., Parikh, V., Park, J., and Sheffield, D. 2008. Efficient embedded computing. IEEE Comput. 41, 7, 27--32. Google Scholar
Digital Library
- Ghose, K. and Kamble, M. B. 1999. Reducing power in superscalar processor caches using subbanking, multiple line buffers, and bit-line segmentation. In Proceedings of the International Symposium on Low Power Electronics and Design. 70--75. Google Scholar
Digital Library
- González, R., Cristal, A., Ortega, D., Veidenbaum, A., and Valero, M. 2004. A content aware integer register file organization. ACM SIGARCH Comput. Architect. News 32, 2, 314. Google Scholar
Digital Library
- Gordon-Ross, A. and Vahid, F. 2003. Frequent loop detection using efficient non-intrusive on-chip hardware. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES'03). 117--124. Google Scholar
Digital Library
- Gu, J., Guo, H., and Lee, P. 2011. An on-chip instruction cache design with one-bit tag for low power embedded systems. Microprocess. Microsyst. 35, 4, 382--391. Google Scholar
Digital Library
- Guan, X. and Fei, Y. 2008. Reducing power consumption of embedded processors through register file partitioning and compiler support. In Proceedings of the 19th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'08). 269--274. Google Scholar
Digital Library
- Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. Mibench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE 4th Annual Workshop on Workload Characterization. 83--94. Google Scholar
Digital Library
- Hu, Z. and Martonosi, M. 2000. Reducing register file power consumption by exploiting value lifetime characteristics. In Proceedings of the Workshop on Complexity Effectice Design (in conjunction with ISCA'00).Google Scholar
- Itoh, M., Higaki, S., Takeuchi, Y., Kitajima, A., Imai, M., Sato, J., and Shiomi, A. 2000. Peas-iii: An asip design environment. In Proceedings of the IEEE International Conference on Computer Design. 430--436. Google Scholar
Digital Library
- Kahn, R. and Weiss, S. 2008. Thrifty BTB: A comprehensive solution for dynamic power reduction in branch target buffers. Microprocess. Microsyst. 32, 8, 425--436. Google Scholar
Digital Library
- Kim, S., Vijaykrishnan, N., Kandemir, M., Sivasubramaniam, A., Irwin, M. J., and Geethanjali, E. 2001. Power-aware partitioned cache architectures. In Proceedings of the International Symposium on Low Power Electronics and Design. 64--67. Google Scholar
Digital Library
- Kin, J., Gupta, M., and Mangione-Simith, W. H. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. 184--193. Google Scholar
Digital Library
- Malik, A., Moyer, B., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED'00). 241--243. Google Scholar
Digital Library
- Manne, S., Klauser, A., and Grunwald, D. 1998. Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA'98). 132--141. Google Scholar
Digital Library
- Min, R., Xu, Z., Hu, Y., and ben Jone, W. 2004. Partial tag comparison: A new technology for power-efficient set-associative cache designs. In Proceedings of the 17th International Conference on VLSI Design. Google Scholar
Digital Library
- Montanaro, J., Witek, R. T., Anne, K., Black, A. J., Cooper, E. M., Dobberpuhl, D. W., Donahue, P. M., Eno, J., Hoeppner, G. W., Kruckmeyer, D., Lee, T. H., Lin, P. C. M., Madden, L., Murray, D., Pearce, M. H., Santhanam, S., Snyder, K. J., Stephang, R., and Thierauf, S. C. 1996. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor. IEEE J. Solid-State Circuits 31, 11, 1703--1714.Google Scholar
Cross Ref
- Nalluri, R., Garg, R., and Panda, P. R. 2007. Customization of register file banking architecture for low power. In Proceedings of the 20th International Conference on VLSI Design (VLSID'07). 239--244. Google Scholar
Digital Library
- Panwar, R. and Rennels, D. 1995. Reducing the frequency of tag compares for low power i-cache design. In Proceedings of the International Symposium on Low Power Electronics and Design. 57--62. Google Scholar
Digital Library
- Ravindran, R. A., Nagarkar, P. D., Dasika, G. S., Marsman, E. D., Senger, R. M., Mahlke, S. A., and Brown, R. B. 2005. Compiler managed dynamic instruction placement in a low-power code cache. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'05). 179--190. Google Scholar
Digital Library
- Rawlins, M. and Gordon-Ross, A. 2010. Lightweight runtime control flow analysis for adaptive loop caching. In Proceedings of the 20th Symposium on Great Lakes Symposium on VLSI (GLSVLSI'10). 239--244. Google Scholar
Digital Library
- Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, U. J., and Owens, J. D. 2000. Register organization for media processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA-6). 375--386.Google Scholar
- Scott, J., Lee, L. H., Arends, J., and Moyer, B. 1998. Designing the low-power m-core architecture. In Proceedings of the International Sympsium on Computer Architecture Power Driven Microarchitecture Workshop. 145--150.Google Scholar
- Solomon, B., Mendelson, A., Orenstein, D., Almog, Y., and Ronen, R. 2003. Micro-operation cache: A power aware frontend for the variable instruction length ISA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 11, 5, 801--811. Google Scholar
Digital Library
- Su, C.-L. and Despain, A. M. 1995. Cache design trade-offs for power and performance optimization: A case study. In Proceedings of the International Symposium on Low Power Electronics and Design. 63--68. Google Scholar
Digital Library
- Tang, W., Gupta, R., and Nicolau, A. 2002. Power savings in embedded processors through decode filer cache. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'02). 443--448. Google Scholar
Digital Library
- Thoziyoor, S., Muralimanohar, N., Ahn, J. H., and Jouppi, N. P. 2008. Cacti: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. Tech. rep. HPL-2008-20, HP Laboratories.Google Scholar
- Villarreal, J., Lysecky, R., Cotterell, S., and Vahid, F. 2002. A study on the loop behavior of embedded programs. Tech. rep. UCR-CSE-01-03, University of California, Riverside.Google Scholar
- Vivekanandarajah, K., Srikanthan, T., and Bhattacharyya, S. 2004. Decode filter cache for energy efficient instruction cache hierarchy in super scalar architectures. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC'04). 373--379. Google Scholar
Digital Library
- Wang, S., Hu, J., and Ziavras, S. G. 2008. BTB access filtering: A low energy and high performance design. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'08). 81--86. Google Scholar
Digital Library
- Zeng, H. and Ghose, K. 2006. Register file caching for energy efficiency. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED'06). 244--249. Google Scholar
Digital Library
- Zhang, C., Vahid, F., Yang, J., and Najjar, W. Marc. 2005. A way-halting cache for low-energy high-performance systems. ACM Trans. Architect. Code Optim. (TACO) 2, 1, 34--54. Google Scholar
Digital Library
- Zhang, W. and Allu, B. 2007. Reducing branch predictor leakage energy by exploiting loops. ACM Trans. Embed. Comput. Syst. (TECS) 6, 2, Article 11. Google Scholar
Digital Library
Index Terms
DLIC: Decoded loop instructions caching for energy-aware embedded processors
Recommendations
Enabling large decoded instruction loop caching for energy-aware embedded processors
CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systemsLow energy consumption in embedded processors is increasingly important in step with the system complexity. The on-chip instruction cache (I-cache) is usually a most energy consuming component on the processor chip due to its large size and frequent ...
Tiny instruction caches for low power embedded systems
Instruction caches have traditionally been used to improve software performance. Recently, several tiny instruction cache designs, including filter caches and dynamic loop caches, have been proposed to instead reduce software power. We propose several ...
Combining code reordering and cache configuration
The instruction cache is a popular optimization target due to the cache's high impact on system performance and power and because of the cache's predictable temporal and spatial locality. This article is an in depth study on the interaction of code ...






Comments