ABSTRACT
While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access. Access times are also easier to predict with simple memories since there is no possibility of a "miss." On the other hand, they are more difficult for the programmer to use since they are not automatically managed.In this paper, we present a software system that allows all or part of an SRAM or scratchpad memory to be automatically managed as a cache. This system provides the programming convenience of a cache for processors that lack dedicated caching hardware. It has been implemented for an actual processor and runs on real hardware. Our results show that a software-based instruction cache can be built that provides performance within 10% of a traditional hardware cache on many benchmarks while using a cheaper, simpler, SRAM memory. On these same benchmarks, energy consumption is up to 3% lower than it would be using a hardware cache.
- F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A post-compiler approach to scratchpad mapping of code. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 259--267, Sep 2004. Google Scholar
Digital Library
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12. ACM Press, 2000. Google Scholar
Digital Library
- R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 73--78, 2002. Google Scholar
Digital Library
- D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 83--94, 2000. Google Scholar
Digital Library
- D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), December 2000.Google Scholar
- D.R. Cheriton, G.A. Slavenburg, and P.D. Boyle. Softwarecontrolled caches in the VMP multiprocessor. In Proceedings of the 13th annual international symposium on Computer architecture, pages 366--374. IEEE Computer Society Press, 1986. Google Scholar
Digital Library
- B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 128--137. ACM Press, 1994. Google Scholar
Digital Library
- R.F. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. Technical Report SMLI 93-12, UWCSE 93-06-06, Sun Microsystems Laboratories, Inc. and the University of Washington, 1993. Google Scholar
Digital Library
- P.J. Denning. Virtual memory. ACM Computing Surveys, 2(3):153--189, 1970. Google Scholar
Digital Library
- G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J.A. Fisher. DELI: a new run-time control point. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages 257--268, Nov 2002. Google Scholar
Digital Library
- A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4), 2005. Google Scholar
Digital Library
- K. Ebcioglu and E.R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture, pages 26--37, Jun 1997. Google Scholar
Digital Library
- K. Ebcioglu, E.R. Altman, M. Gschwind, and S.W. Sathaye. Dynamic binary translation and optimization. IEEE Transactions on Computers, 50(6):529--548, 2001. Google Scholar
Digital Library
- A.E. Eichenberger, J.K. OBrien, K.M. OBrien, P.Wu, T. Chen, P.H. Oden, D.A. Prener, J.C. Shepherd, B. So, Z. Sura, A.Wang, T. Zhang, P. Zhao, M.K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal, 45(1):59--84, January 2006. Google Scholar
Digital Library
- D.R. Engler. VCODE: a retargetable, extensible, very fast dynamic code generation system. In Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, pages 160--170. ACM Press, 1996. Google Scholar
Digital Library
- M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, March-April 2006. Google Scholar
Digital Library
- S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li, and L.K. John. Using complete machine simulation for software power estimation: The SoftWatt approach. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, page 141, 2002. Google Scholar
Digital Library
- E.G. Hallnor and S.K. Reinhardt. A fully associative softwaremanaged cache design. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 107--116, 2000. Google Scholar
Digital Library
- K. Hazelwood and J.E. Smith. Exploring code cache eviction granularities in dynamic optimization systems. In CGO '04: Proceedings of the international symposium on Code generation and optimization, page 89, 2004. Google Scholar
Digital Library
- W.-M. W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery. The superblock: an effective technique for VLIW and superscalar compilation. Journal of Supercomputing, 7(1-2):229--248, 1993. Google Scholar
Digital Library
- B. Jacob and T. Mudge. Software-managed address translation. In HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, pages 156--167, Feb 1997. Google Scholar
Digital Library
- V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In USENIX Security Symposium, San Francisco, August 2002. Google Scholar
Digital Library
- C. Lee, M. Potkonjak, and W.H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335, 1997. Google Scholar
Digital Library
- P. Machanick, P. Salverda, and L. Pompe. Hardware-software tradeoffs in a direct Rambus implementation of the RAMpage memory hierarchy. ACM SIGPLAN Notices, 33(11):105--114, 1998. Google Scholar
Digital Library
- C. May. Mimic: A fast System/370 simulator. In SIGPLAN '87: Papers of the Symposium on Interpreters and interpretive techniques, pages 1--13, New York, NY, USA, 1987. ACM Press. Google Scholar
Digital Library
- J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue, J. Eno, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P.C.M. Lin, L. Madden, D. Murray, M.H. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf. A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor. IEEE JSSC, 31(11):1703--1714, November 1996.Google Scholar
Cross Ref
- C. Moritz, M. Frank, W. Lee, and S. Amarasinghe. Hot pages: Software caching for Raw microprocessors. Technical Report LCSTM-599, Massachusetts Institute of Technology Lab for Computer Science, 1999.Google Scholar
- H. Muller, D. May, J. Irwin, and D. Page. Novel caches for predictable computing. Technical Report CSTR-98-011, Department of Computer Science, University of Bristol, Oct 1998. Google Scholar
Digital Library
- P. Naur. The performance of a system for automatic segmentation of programs within an ALGOL compiler (GIER ALGOL). Communications of the ACM, 8(11):671--676, 1965. Google Scholar
Digital Library
- R.J. Pankhurst. Operating systems: Program overlay techniques. Communications of the ACM, 11(2):119--125, 1968. Google Scholar
Digital Library
- R.A. Ravindran, P.D. Nagarkar, G.S. Dasika, E.D. Marsman, R.M. Senger, S.A. Mahlke, and R.B. Brown. Compiler managed dynamic instruction placement in a low-power code cache. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 179--190, March 2005. Google Scholar
Digital Library
- P. Shivakumar and N.P. Jouppi. CACTI 3.0: An integrated cache timing, power and area model. Technical Report 2001/2, Compaq Western Research Laboratory, Dec 2001.Google Scholar
- T.R. Spacek. A proposal to establish a pseudo virtual memory via writable overlays. Communications of the ACM, 15(6):421--426, 1972. Google Scholar
Digital Library
- S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, pages 409--417, Mar 2002. Google Scholar
Digital Library
- M.B. Taylor, J. Kim, J.E. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, Mar 2002. Google Scholar
Digital Library
- M.B. Taylor, W. Lee, J.E. Miller, D.Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture, pages 2--13, Jun 2004. Google Scholar
Digital Library
- M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic overlay of scratchpad memory for energy minimization. In CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 104--109, 2004. Google Scholar
Digital Library
- S.J.E. Wilton and N.P. Jouppi. CACTI: An enhanced cache access and cycle time model. IEEE JSSC, 31(5):677--688, May 1996.Google Scholar
Cross Ref
- E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In Measurement and Modeling of Computer Systems, pages 68--79, 1996. Google Scholar
Digital Library
- S.-H. Yang, B. Falsafi, M.D. Powell, and T.N. Vijaykumar. Exploiting choice in resizable cache design to optimize deepsubmicron processor energy-delay. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pages 151--161, Feb 2002. Google Scholar
Digital Library
- M. Zhang and K. Asanovic. Highly associative caches for low-power processors. In Kool Chips Workshop, 33rd International Symposium on Microarchitecture, 2000.Google Scholar
Index Terms
Software-based instruction caching for embedded processors
Recommendations
Software-based instruction caching for embedded processors
Proceedings of the 2006 ASPLOS ConferenceWhile hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly ...
Software-based instruction caching for embedded processors
Proceedings of the 2006 ASPLOS ConferenceWhile hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly ...
Software-based instruction caching for embedded processors
Proceedings of the 2006 ASPLOS ConferenceWhile hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly ...








Comments