Abstract
We propose a novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features. First, it introduces and makes use of a notion of a memory container, which is defined as a unit of memory that comprises the minimum number of page frames that can cover all the banks of the memory organization, by exclusively assigning a container to a core so that each core achieves bank parallelism as much as possible. Second, it orchestrates page frame allocation so that pages that threads access are dispersed randomly across multiple banks so that each thread's access pattern is randomized. The development of M3 is based on a tool that we develop to fully understand the architectural characteristics of the underlying memory organization. Using an extension of this tool, we observe that the same application that accesses pages in a random manner outperforms one that accesses pages in a regular pattern such as sequential or same ordered accesses. This is because such randomized accesses reduces inter-thread access interference on the row-buffer in memory banks. We implement M3 in the Linux kernel version 2.6.32 on the Intel Xeon system that has 16 cores and 32GB DRAM. Performance evaluation with various workloads show that M3 improves the overall performance for memory intensive benchmarks by up to 85% with an average of about 40%.
- AMD Multi-core, http://www.amd.com.Google Scholar
- ARM Cortex-A9 Processor, http://www.arm.com.Google Scholar
- Intel Multi-core Technology, http://www.intel.com.Google Scholar
- J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber. Future Scaling of Processor-Memory Interfaces. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 42:1--42:12, 2009. Google Scholar
Digital Library
- Benchmark. Linux Benchmark Suite Home Page. http://lbs.sourceforge.net/.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT '08, pages 72--81, 2008. Google Scholar
Digital Library
- M. J. Bligh, M. Dobson, D. Hart, and G. Huizenga. Linux on NUMA Systems. In Proceedings of the Linux Symposium, pages 295--306, 2004.Google Scholar
- G. S. Brodal, E. D. Demaine, and J. I. Munro. Fast Allocation and Deallocation with an Improved Buddy System. Acta Informatica, 41:273--291, March 2005.Google Scholar
Digital Library
- H. Choi, J. Lee, and W. Sung. Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems. In Proceedings of the 2011 IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS '11, pages 66--75, 2011. Google Scholar
Digital Library
- M. Correa, A. Zorzo, and R. Scheer. Operating System Multilevel Load Balancing. In Proceedings of the 2006 ACM Symposium on Applied Computing, SAC '06, pages 1467--1471, 2006. Google Scholar
Digital Library
- B. K. Ganesh Balakrishnan, Ralph M. Begun. Understanding Intel Xeon 5600 Series Memory Performance and Optimization in IBM System x and BladeCenter Platforms. White paper, IBM, May 2010.Google Scholar
- Hewlett-Packard. DDR3 Memory Technology, Technology Brief, 3rd edition. White paper, HP, April 2012.Google Scholar
- JEDEC. JEDEC Standard : DDR3 SDRAM Specification. White paper, JEDEC, July 2012. http://www.jedec.org/standards-documents/docs/jesd-79-3d.Google Scholar
- M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA-18 '12, pages 1--12, 2012. Google Scholar
Digital Library
- D. Kaseridis, J. Stuecheli, and L. K. John. Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core era. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 24--35, 2011. Google Scholar
Digital Library
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-43 '10, pages 65--76, 2010. Google Scholar
Digital Library
- C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-level Parallelism in the Presence of Prefetching. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-42 '09, pages 327--336, 2009. Google Scholar
Digital Library
- Micron Technology. DDR3 SDRAM RDIMM : MT18JSF25672PD 2GB. White paper, Micron, July 2010. http://www.micron.com/products/dram-modules/.Google Scholar
- S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 374--385, 2011. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40 '07, pages 146--160, 2007. Google Scholar
Digital Library
- I. P. Page and J. Hagins. Improving the Performance of Buddy Systems. IEEE Transactions on Computers, 35:441--447, May 1986. Google Scholar
Digital Library
- J. L. Peterson and T. A. Norman. Buddy Systems. Communications of the ACM, 20:421--431, June 1977. Google Scholar
Digital Library
- K. K. Pusukuri, R. Gupta, and L. N. Bhuyan. Thread Tranquilizer: Dynamically Reducing Performance Variation. ACM Transactions on Architecture and Code Optimization, 8(4):46:1--46:21, January 2012. Google Scholar
Digital Library
- Ramspeed. Ramspeed Benchmark. http://alasir.com/software/ramspeed/.Google Scholar
- S. Rixner. Memory Controller Optimizations for Web Servers. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-37 '04, pages 355--366, 2004. Google Scholar
Digital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA-27 '00, pages 128--138, 2000. Google Scholar
Digital Library
- STREAM. STREAM Benchmark. http://www.cs.virginia.edu/stream.Google Scholar
- K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis. Micro-pages: Increasing DRAM Efficiency with Locality-Aware Data Placement. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-15 '10, pages 219--230, 2010. Google Scholar
Digital Library
- SysBench. Sysbench: A System Performance Benchmark. http://sysbench.sourceforge.net/.Google Scholar
- UnixBench. UnixBench: A Fundamental High-level Linux Benchmark Suite. http://www.tux.org/pub/tux/benchmarks/System/unixbench/.Google Scholar
- D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A Memory System Simulator. ACM SIGARCH Computer Architecture News, 33(4):100--107, November 2005. Google Scholar
Digital Library
- W. Wang, T. Dey, J. Mars, L. Tang, J. W. Davidson, and M. L. Soffa. Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS '12, pages 156--167, 2012. Google Scholar
Digital Library
- D. H. Yoon, M. K. Jeong, and M. Erez. Adaptive Granularity Memory Systems: A Tradeoff between Storage Efficiency and Throughput. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA-38 '11, pages 295--306, 2011. Google Scholar
Digital Library
- Z. Zhang, Z. Zhu, and X. Zhang. A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-33 '00, pages 32--41, 2000. Google Scholar
Digital Library
- H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Minirank: Adaptive DRAM Architecture for Improving Memory Power Efficiency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-41 '08, pages 210--221, 2008. Google Scholar
Digital Library
Index Terms
Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems
Recommendations
Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsWe propose a novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features. First, it introduces and makes use of a notion of a memory container, which is defined as a unit of memory ...
Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems
ASPLOS '13We propose a novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features. First, it introduces and makes use of a notion of a memory container, which is defined as a unit of memory ...
NVM duet: unified working memory and persistent store architecture
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsEmerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising ...







Comments