skip to main content
10.1145/1168857.1168894acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
Article

Software-based instruction caching for embedded processors

Published:20 October 2006Publication History

ABSTRACT

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access. Access times are also easier to predict with simple memories since there is no possibility of a "miss." On the other hand, they are more difficult for the programmer to use since they are not automatically managed.In this paper, we present a software system that allows all or part of an SRAM or scratchpad memory to be automatically managed as a cache. This system provides the programming convenience of a cache for processors that lack dedicated caching hardware. It has been implemented for an actual processor and runs on real hardware. Our results show that a software-based instruction cache can be built that provides performance within 10% of a traditional hardware cache on many benchmarks while using a cheaper, simpler, SRAM memory. On these same benchmarks, energy consumption is up to 3% lower than it would be using a hardware cache.

References

  1. F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A post-compiler approach to scratchpad mapping of code. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 259--267, Sep 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12. ACM Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 73--78, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 83--94, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), December 2000.Google ScholarGoogle Scholar
  6. D.R. Cheriton, G.A. Slavenburg, and P.D. Boyle. Softwarecontrolled caches in the VMP multiprocessor. In Proceedings of the 13th annual international symposium on Computer architecture, pages 366--374. IEEE Computer Society Press, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 128--137. ACM Press, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R.F. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. Technical Report SMLI 93-12, UWCSE 93-06-06, Sun Microsystems Laboratories, Inc. and the University of Washington, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P.J. Denning. Virtual memory. ACM Computing Surveys, 2(3):153--189, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J.A. Fisher. DELI: a new run-time control point. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages 257--268, Nov 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Ebcioglu and E.R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture, pages 26--37, Jun 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Ebcioglu, E.R. Altman, M. Gschwind, and S.W. Sathaye. Dynamic binary translation and optimization. IEEE Transactions on Computers, 50(6):529--548, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A.E. Eichenberger, J.K. OBrien, K.M. OBrien, P.Wu, T. Chen, P.H. Oden, D.A. Prener, J.C. Shepherd, B. So, Z. Sura, A.Wang, T. Zhang, P. Zhao, M.K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal, 45(1):59--84, January 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D.R. Engler. VCODE: a retargetable, extensible, very fast dynamic code generation system. In Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, pages 160--170. ACM Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, March-April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li, and L.K. John. Using complete machine simulation for software power estimation: The SoftWatt approach. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, page 141, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E.G. Hallnor and S.K. Reinhardt. A fully associative softwaremanaged cache design. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 107--116, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Hazelwood and J.E. Smith. Exploring code cache eviction granularities in dynamic optimization systems. In CGO '04: Proceedings of the international symposium on Code generation and optimization, page 89, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W.-M. W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery. The superblock: an effective technique for VLIW and superscalar compilation. Journal of Supercomputing, 7(1-2):229--248, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Jacob and T. Mudge. Software-managed address translation. In HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, pages 156--167, Feb 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In USENIX Security Symposium, San Francisco, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Lee, M. Potkonjak, and W.H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Machanick, P. Salverda, and L. Pompe. Hardware-software tradeoffs in a direct Rambus implementation of the RAMpage memory hierarchy. ACM SIGPLAN Notices, 33(11):105--114, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. May. Mimic: A fast System/370 simulator. In SIGPLAN '87: Papers of the Symposium on Interpreters and interpretive techniques, pages 1--13, New York, NY, USA, 1987. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue, J. Eno, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P.C.M. Lin, L. Madden, D. Murray, M.H. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf. A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor. IEEE JSSC, 31(11):1703--1714, November 1996.Google ScholarGoogle ScholarCross RefCross Ref
  27. C. Moritz, M. Frank, W. Lee, and S. Amarasinghe. Hot pages: Software caching for Raw microprocessors. Technical Report LCSTM-599, Massachusetts Institute of Technology Lab for Computer Science, 1999.Google ScholarGoogle Scholar
  28. H. Muller, D. May, J. Irwin, and D. Page. Novel caches for predictable computing. Technical Report CSTR-98-011, Department of Computer Science, University of Bristol, Oct 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Naur. The performance of a system for automatic segmentation of programs within an ALGOL compiler (GIER ALGOL). Communications of the ACM, 8(11):671--676, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R.J. Pankhurst. Operating systems: Program overlay techniques. Communications of the ACM, 11(2):119--125, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R.A. Ravindran, P.D. Nagarkar, G.S. Dasika, E.D. Marsman, R.M. Senger, S.A. Mahlke, and R.B. Brown. Compiler managed dynamic instruction placement in a low-power code cache. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 179--190, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Shivakumar and N.P. Jouppi. CACTI 3.0: An integrated cache timing, power and area model. Technical Report 2001/2, Compaq Western Research Laboratory, Dec 2001.Google ScholarGoogle Scholar
  33. T.R. Spacek. A proposal to establish a pseudo virtual memory via writable overlays. Communications of the ACM, 15(6):421--426, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, pages 409--417, Mar 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M.B. Taylor, J. Kim, J.E. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, Mar 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M.B. Taylor, W. Lee, J.E. Miller, D.Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture, pages 2--13, Jun 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic overlay of scratchpad memory for energy minimization. In CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 104--109, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S.J.E. Wilton and N.P. Jouppi. CACTI: An enhanced cache access and cycle time model. IEEE JSSC, 31(5):677--688, May 1996.Google ScholarGoogle ScholarCross RefCross Ref
  39. E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In Measurement and Modeling of Computer Systems, pages 68--79, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S.-H. Yang, B. Falsafi, M.D. Powell, and T.N. Vijaykumar. Exploiting choice in resizable cache design to optimize deepsubmicron processor energy-delay. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pages 151--161, Feb 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Zhang and K. Asanovic. Highly associative caches for low-power processors. In Kool Chips Workshop, 33rd International Symposium on Microarchitecture, 2000.Google ScholarGoogle Scholar

Index Terms

  1. Software-based instruction caching for embedded processors

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!