Abstract
Poor placement of data blocks in memory may negatively impact application performance because of an increase in the cache conflict miss rate [18]. For dynamically allocated structures this placement is typically determined by the memory allocator. Cache index-oblivious allocators may inadvertently place blocks on a restricted fraction of the available cache indexes, artificially and needlessly increasing the conflict miss rate. While some allocators are less vulnerable to this phenomena, no general-purpose malloc allocator is index-aware and methodologically addresses this concern. We demonstrate that many existing state-of-the-art allocators are index-oblivious, admitting performance pathologies for certain block sizes. We show that a simple adjustment within the allocator to control the spacing of blocks can provide better index coverage, which in turn reduces the superfluous conflict miss rate in various applications, improving performance with no observed negative consequences. The result is an index-aware allocator. Our technique is general and can easily be applied to most memory allocators and to various processor architectures.
Furthermore, we can reduce inter-thread and inter-process conflict misses for processors where threads concurrently share the level-1 cache such as the Sun UltraSPARC-T2™ and Intel "Nehalem" by coloring the placement of blocks so that allocations for different threads and processes start on different cache indexes.
- The Solaris Schedctl Facility. US Patent #5937187.Google Scholar
- hoard.org, 2010 (accessed 2010-1-14). hoard.org.Google Scholar
- Amino-CBBS, 2010 (accessed 2010-1-5). amino-cbbs.sourceforge.net.Google Scholar
- BioPerf, 2010 (accessed 2010-5-19). http://www.bioperf.org/.Google Scholar
- dlmalloc, 2010 (accessed 2010-7-2). http://gee.cs.oswego.edu/dl/html/malloc.html.Google Scholar
- tcmalloc: version 1.6, 2010 (accessed November 9, 2010). http://code.google.com/p/google-perftools/.Google Scholar
- D. Bader, Y. Li, T. Li, and V. Sachdeva. Bioperf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications. IEEE Workload Characterization Symposium, 0:163--173, 2005.Google Scholar
Cross Ref
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: a scalable memory allocator for multithreaded applications. In Proc. ninth international conference on Architectural Support for Programming Languages and Operating Systems, pages 117--128, New York, NY, USA, 2000. ACM. Google Scholar
Digital Library
- J. Bonwick. The slab allocator: an object-caching kernel memory allocator. In USTC'94: Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference, pages 6--6, Berkeley, CA, USA, 1994. USENIX Association. Google Scholar
Digital Library
- B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. SIGPLAN Not., 33(11):139--149, 1998. Google Scholar
Digital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Making pointer-based data structures cache conscious. Computer, 33(12):67--74, 2000. Google Scholar
Digital Library
- Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. SIGARCH Comput. Archit. News, 32(2):76, 2004. Google Scholar
Digital Library
- D. Dice. Dave Dice's blog, 2010 (accessed 2010-7-21). thttp://blogs.sun.com/dave/entry/solaris_scheduling_and_cpuids.Google Scholar
- D. Dice and A. Garthwaite. Mostly lock-free malloc. In Proc. 3rd International Symposium on Memory Management, pages 163--174, New York, NY, USA, 2002. ACM. Google Scholar
Digital Library
- D. Dice, Y. Lev, V. J. Marathe, M. Moir, D. Nussbaum, and M. Oleszewski. Simplifying concurrent algorithms by exploiting hardware transactional memory. In SPAA '10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages 325--334, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- J. Evans. A scalable concurrent malloc(3) implementation for freebsd, 2006.Google Scholar
- A. González, M. Valero, N. Topham, and J. M. Parcerisa. Eliminating cache conflict misses through xor-based placement functions. In ICS '97: Proceedings of the 11th international conference on Supercomputing, pages 76--83, New York, NY, USA, 1997. ACM. Google Scholar
Digital Library
- M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches. IEEE Trans. Comput., 38(12):1612--1630, 1989. Google Scholar
Digital Library
- M. S. Johnstone and P. R. Wilson. The memory fragmentation problem: Solved? In ISMM, pages 26--36, 1998. Google Scholar
Digital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput. Archit. News, 18(3a):364--373, 1990. Google Scholar
Digital Library
- R. E. Kessler and M. D. Hill. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst., 10(4):338--359, 1992. Google Scholar
Digital Library
- T. Kistler and M. Franz. Automated data-member layout of heap objects to improve memory-hierarchy performance. ACM Trans. Program. Lang. Syst., 22(3):490--505, 2000. Google Scholar
Digital Library
- K. C. Knowlton. A fast storage allocator. Commun. ACM, 8(10):623--624, 1965. Google Scholar
Digital Library
- C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 129--142, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- A. R. Lebeck and D. A. Wood. Cache profiling and the spec benchmarks: A case study. Computer, 27(10):15--26, 1994. Google Scholar
Digital Library
- V. B. Lvin, G. Novark, E. D. Berger, and B. G. Zorn. Archipelago: trading address space for reliability and security. SIGOPS Oper. Syst. Rev., 42(2):115--124, 2008. Google Scholar
Digital Library
- M. M. Michael. Scalable Lock-free Dynamic Memory Allocation. In Proc. ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35--46, 2004. Google Scholar
Digital Library
- R. Min and Y. Hu. Improving performance of large physically indexed caches by decoupling memory addresses from cache addresses. IEEE Trans. Comput., 50(11):1191--1201, 2001. Google Scholar
Digital Library
- E. Petrank and D. Rawitz. The hardness of cache conscious data placement. In POPL '02: Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 101--112, New York, NY, USA, 2002. ACM. Google Scholar
Digital Library
- T. Romer, D. Lee, B. N. Bershad, and J. B. Chen. Dynamic page mapping policies for cache conflict resolution on standard hardware. In In 1st USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 255--266, 1994.Google Scholar
- D. Sanchez and K. Christos. The zcache: Decoupling ways and associativity. In MICRO 43: Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2010. Google Scholar
Digital Library
- A. Seznec. A case for two-way skewed-associative caches. In ISCA '93: Proceedings of the 20th annual international symposium on Computer architecture, pages 169--178, New York, NY, USA, 1993. ACM. Google Scholar
Digital Library
- D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, 1985. Google Scholar
Digital Library
- V. Čakarević, P. Radojković, J. Verdú, A. Pajuelo, F. J. Cazorla, M. Nemirovsky, and M. Valero. Characterizing the resource-sharing levels in the ultrasparc t2 processor. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 481--492, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- Z. Wang and R. B. Lee. A novel cache architecture with enhanced performance and security. In MICRO 41: Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, pages 83--93, Washington, DC, USA, 2008. IEEE Computer Society. Google Scholar
Digital Library
- C. B. Zilles. Benchmark health considered harmful. SIGARCH Comput. Archit. News, 29(3):4--5, 2001. Google Scholar
Digital Library
- M. Zukowski, S. Héman, and P. Boncz. Architecture-conscious hashing. In DaMoN '06: Proceedings of the 2nd international workshop on Data management on new hardware, page 6, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
Index Terms
Cache index-aware memory allocation
Recommendations
Cache index-aware memory allocation
ISMM '11: Proceedings of the international symposium on Memory managementPoor placement of data blocks in memory may negatively impact application performance because of an increase in the cache conflict miss rate [18]. For dynamically allocated structures this placement is typically determined by the memory allocator. Cache ...
Reducing traffic generated by conflict misses in caches
CF '04: Proceedings of the 1st conference on Computing frontiersOff-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a ...
Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on MicroarchitectureRecent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use ...







Comments