Abstract
Application-specific system-on-chip platforms create the opportunity to customize the cache configuration for optimal performance with minimal chip area. Simulation, in particular trace-driven simulation, is widely used to estimate cache hit rates. However, simulation is too slow to be deployed in design space exploration, especially when there are hundreds of design points and the traces are huge. In this article, we propose a novel analytical approach for design space exploration of instruction caches. Given the program control flow graph (CFG) annotated only with basic block and control flow edge execution counts, we first model the cache states at each point of the CFG in a probabilistic manner. Then, we exploit the structural similarities among related cache configurations to estimate the cache hit rates for multiple cache configurations in one pass. Experimental results indicate that our analysis is 28--2,500 times faster compared to the fastest known cache simulator while maintaining high accuracy (0.2% average error) in estimating cache hit rates for a large set of popular benchmarks. Moreover, compared to a state-of-the-art cache design space exploration technique, our approach achieves 304--8,086 times speedup and saves up to 62% (average 7%) energy for the evaluated benchmarks.
- Arnold, R., Mueller, F., Whalley, D., and Harmon, M. 1994. Bounding worst-case instruction cache performance. In Proceedings of the Real-Time Systems Symposium. 172--181.Google Scholar
- Austin, T., Larson, E., and Ernst, D. 2002. Simplescalar: An infrastructure for computer system modeling. IEEE Computer 35, 2, 59--67. Google Scholar
Digital Library
- Ball, T. 1994. Efficiently counting program events with support for on-line queries. ACM Trans. Program. Lang. Syst. 16, 5, 1399--1410. Google Scholar
Digital Library
- Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00). 83--94. Google Scholar
Digital Library
- Ghosh, A. and Givargis, T. 2004. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. Des. Autom. Electron. Syst. 9, 4, 419--440. Google Scholar
Digital Library
- Gordon-Ross, A., Viana, P., Vahid, F., Najjar, W., and Barros, E. 2007. A one-shot configurable-cache tuner for improved energy and performance. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'07). 755--760. Google Scholar
Digital Library
- Guillon, C., Rustello, F., Bidault, T., and Bouchez, F. 2004. Procedure placement using temporal-ordering information: Dealing with code size expansion. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'04). 268--279. Google Scholar
Digital Library
- Guthaus, M. R., RingeNberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. Mibench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization. 3--14. Google Scholar
Digital Library
- Haque, M. S., Janapsatya, A., and Parameswaran, S. 2009. SuSeSim: A fast simulation strategy to find optimal l1 cache configuration for embedded systems. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'09). 295--304. Google Scholar
Digital Library
- Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in cpu caches. IEEE Trans. Comput. 38, 12, 1612--1630. Google Scholar
Digital Library
- Li, X. F., Mitra, T., Negi, H. S., and Roychoundhury, A. 2004. Design space exploration of caches using compressed traces. In Proceedings of the 18th Annual International Conference on Supercomputing (ICS'04). 116--125. Google Scholar
Digital Library
- Li, Y., Callahan, T., Darnell, E., Harr, R., Kurkure, U., and Stockwood, J. 2000. Hardware-software co-design of embedded reconfigurable architectures. In Proceedings of the 37th Annual Design Automation Conference (DAC'00). 507--512. Google Scholar
Digital Library
- Liang, Y. and Mitra, T. 2008a. Cache modeling in probabilistic execution time analysis. In Proceedings of the 45th Annual Design Automation Conference (DAC'08). 319--324. Google Scholar
Digital Library
- Liang, Y. and Mitra, T. 2008b. Static analysis for fast and accurate design space exploration of caches. In Proceedings of the 6th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'08). 103--108. Google Scholar
Digital Library
- Liang, Y. and Mitra, T. 2010a. lnstruction cache locking using temporal reuse profile. In Proceedings of the 47th Design Automation Conference (DAC'10). 344--349. Google Scholar
Digital Library
- Liang, Y. and Mitra, T. 2010b. Improved procedure placement for set associative caches. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES'10). 147--156. Google Scholar
Digital Library
- Mattson, R. L., Gecsel, J., Slute, D. R., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google Scholar
Digital Library
- Montanaro, J., Witek, R. T. Anne, K., Black, A. J., Cooper, E. M., Dobberpuhl, D. W., Donahue, P. M., Eno, J., Farell, A., Hoeppner, G. W., Kruckmeyer, D., Lee, T. H., Lin, P. C. M, Madden, L., Murray, D., Pearce, M. H., Santhanam, S., Snyder, K. J., Stephany, R., and Thieruf, S. C. 1997. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor. Digital Tech. J. 9, 1. Google Scholar
Digital Library
- Steven, J. E. W. and Norman, P. J. 1996. Cacti: An enhanced cache access and cycle time model. IEEE J. Solid-State Circuits 31, 677--688.Google Scholar
Cross Ref
- Sugumar, R. A. and Abraham, S. G. 1995. Set-associative cache simulation using generalized binomial trees. ACM Trans. Comput. Syst. 13, 1. Google Scholar
Digital Library
- Uhlig, R. A. and Mudge, T. N. 1997. Trace-driven memory simulation: A survey. ACM Comput. Surv. 29, 2, 128--170. Google Scholar
Digital Library
- Wang, W. H. and Baer, J. L. 1991. Efficient trace-driven simulation methods for cache performance analysis. ACM Trans. Comput. Syst. 9, 3, 222--241. Google Scholar
Digital Library
- Wu, Z. and Wolf, W. 1999, Iterative cache simulation of embedded CPUs with trace Stripping. In Proceedings of the 7th International Workshop on Hardware/Software Codesign (CODES'99). 95--99. Google Scholar
Digital Library
- Zhang, C. and Vahid, F. 2003. Cache configuratoin exploration on prototying platforms. In Proceeding of the 14th IEEE International Workshop on Rapid System Prototyping. 164. Google Scholar
Digital Library
- Zhang, C., Vahid, F., and Najjar, W. 2003. A highly configurable cache architecture for embedded systems. SIGARCH Comput. Archit. News 31, 2, 136--146. Google Scholar
Digital Library
- Zitzler, E., Deb, K., and Thiele, L. 2000. Comparison of multiobjective evolutionary algorithms: Empirical results. Evol. Comput. 8, 2, 173--195. Google Scholar
Digital Library
Index Terms
An analytical approach for fast and accurate design space exploration of instruction caches
Recommendations
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...
Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses
Modern CPUs often use large physically indexed caches that are direct-mapped or have low associativities. Such caches do not interact well with virtual memory systems. An improperly placed physical page will end up in a wrong place in the cache, causing ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...






Comments