Abstract
The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. Because these cache hierarchies are designed to be general-purpose, they may not provide the best possible performance for a given application. In this paper, we determine a memory subsystem well suited for a given application and main memory by discovering a memory subsystem comprised of caches,scratchpads, and other components that are combined to provide better performance. We draw motivation from the superoptimization of instruction sequences, which successfully finds unusually clever instruction sequences for programs. Targeting both ASIC and FPGA devices, we show that it is possible to discover unusual memory subsystems that provide performance improvements over a typical memory subsystem.
- M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer. LEAP scratchpads: automatic memory and cache management for reconfigurable logic. In Proc. of 19th ACM/SIGDA Int'l Symp. on Field Programmable Gate Arrays, pages 25--28, 2011. Google Scholar
Digital Library
- R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. A dynamically tunable memory hierarchy. IEEE Trans. on Computers, 52(10):1243--1258, Oct. 2003. Google Scholar
Digital Library
- B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In Proc. of 8th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 139--149, 1998. Google Scholar
Digital Library
- S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proc. of 6th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 252--262, 1994. Google Scholar
Digital Library
- J. Chang, P. Ranganathan, D. A. Roberts, M. A. Shah, and J. Sontag. Data storage apparatus and methods, Mar. 2012. US Patent App. 2012/0131278.Google Scholar
- T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In Proc. of ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 13--24, 1999. Google Scholar
Digital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proc. of ACM Conf. on Programming Language Design and Implementation, pages 1--12, 1999. Google Scholar
Digital Library
- G. Dueck and T. Scheuer. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. Journal of Computational Physics, 90(1):161--175, 1990. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In Proc. of 40th Symp. on Foundations of Computer Science, pages 285--297, 1999. Google Scholar
Digital Library
- A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. on Design Automation of Electronic Systems, 9(4):419--440, Oct. 2004. Google Scholar
Digital Library
- A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic tuning of two-level caches to embedded applications. In Proc. of the Conf. on Design, Automation and Test in Europe, page 10208, 2004. Google Scholar
Digital Library
- A. Gordon-Ross, F. Vahid, and N. Dutt. Fast configurable-cache tuning with a unified second-level cache. In Proc. of Int'l Symp. on Low Power Electronics and Design, pages 323--326, 2005. Google Scholar
Digital Library
- T. Granlund and R. Kenner. Eliminating branches using a superoptimizer and the GNU C compiler. In Proc. of ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 341--352, 1992. Google Scholar
Digital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A free, commercially representative embedded benchmark suite. In Proc. of 4th Int'l Workshop on Workload Characterization, pages 3--14, 2001. Google Scholar
Digital Library
- T. C. Hu, A. B. Kahng, and C.-W. A. Tsao. Old bachelor acceptance: A new class of non-monotone threshold accepting methods. ORSA Journal on Computing, 7(4):417--425, 1995.Google Scholar
Cross Ref
- E. Ïpek, S. A. McKee, R. Caruana, B. R. de Supinski, and M. Schulz. Efficiently exploring architectural design spaces via predictive modeling. In Proc. of 12th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 195--206, 2006. Google Scholar
Digital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. of 17th Int'l Symp. on Computer Architecture, pages 364--373, 1990. Google Scholar
Digital Library
- S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simmulated annealing. Science, 220(4598):671--680, 1983.Google Scholar
Cross Ref
- B. C. Lee and D. M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proc. of 12th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 185--194, 2006. Google Scholar
Digital Library
- H. Massalin. Superoptimizer: a look at the smallest program. In Proc. of 2nd Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 122--126, 1987. Google Scholar
Digital Library
- A. Naz. Split Array and Scalar Data Caches: A Comprehensive Study of Data Cache Organization. PhD thesis, Univ. of North Texas, 2007. Google Scholar
Digital Library
- N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In Proc. of ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 89--100, 2007. Google Scholar
Digital Library
- P. Panda, N. Dutt, and A. Nicolau. Local memory exploration and optimization in embedded systems. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 18(1):3--13, 1999. Google Scholar
Digital Library
- A. Putnam, S. Eggers, D. Bennett, E. Dellinger, J. Mason, H. Styles, P. Sundararajan, and R.Wittig. Performance and power of cache-based reconfigurable computing. ACM SIGARCH Computer Architecture News, 37(3):395--405, 2009. Google Scholar
Digital Library
- P. Ranjan Panda, N. D. Dutt, A. Nicolau, F. Catthoor, A. Vandecappelle, E. Brockmeyer, C. Kulkarni, and E. De Greef. Data memory organization and optimizations in application-specific systems. IEEE Design & Test of Computers, 18(3):56--68, 2001. Google Scholar
Digital Library
- Raspberry Pi. http://www.raspberrypi.org.Google Scholar
- E. Schkufza, R. Sharma, and A. Aiken. Stochastic superoptimization. In Proc. of 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 305--316, 2013. Google Scholar
Digital Library
- S. Sen, S. Chatterjee, and N. Dumir. Towards a theory of cacheefficient algorithms. Journal of the ACM, 49(6):828--858, Nov. 2002. Google Scholar
Digital Library
- K. T. Sundararajan, T. M. Jones, and N. P. Topham. Smart cache: A self adaptive cache architecture for energy efficiency. In Proc. of Int'l Conf. on Embedded Computer Systems, pages 41--50, 2011.Google Scholar
Cross Ref
- S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. HP Laboratories, 2, Apr. 2008.Google Scholar
- A. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In Proc. of 13th Int'l Conf. on Supercomputing, pages 145--154, 1999. Google Scholar
Digital Library
- J. G. Wingbermuehle, R. D. Chamberlain, and R. K. Cytron. ScalaPipe: A streaming application generator. In Proc. of 2012 Symp. on Application Accelerators in High-Performance Computing, pages 244--254, July 2012. Google Scholar
Digital Library
- J. G. Wingbermuehle, R. K. Cytron, and R. D. Chamberlain. Optimization of application-specific memories. Computer Architecture Letters, Apr. 2013.Google Scholar
- W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 23(1): 20--24, Mar. 1995. Google Scholar
Digital Library
- C. Zhang and F. Vahid. Using a victim buffer in an application-specific memory hierarchy. In Proc. of Design, Automation and Test in Europe Conference and Exhibition, pages 220--225, 2004. Google Scholar
Digital Library
Index Terms
Superoptimization of memory subsystems
Recommendations
Superoptimization of memory subsystems
LCTES '14: Proceedings of the 2014 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systemsThe disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. Because these cache hierarchies are designed to be general-purpose, they may not provide the best ...
Superoptimized Memory Subsystems for Streaming Applications
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysBecause main memory is many times slower than modern processor cores, deep, multi-level cache hierarchies are ubiquitous in computers today. Similarly, applications deployed on ASICs and FPGAs are often hindered by slow external memories. Therefore, to ...
Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses
Modern CPUs often use large physically indexed caches that are direct-mapped or have low associativities. Such caches do not interact well with virtual memory systems. An improperly placed physical page will end up in a wrong place in the cache, causing ...







Comments