Abstract
Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.
- M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the problems and opportunities posed by multiple on-chip memory controllers. In PACT'10. Google Scholar
Digital Library
- M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/power aware meta-scheduler for multi-core systems. In SC'08. Google Scholar
Digital Library
- S. Blagodurov, S. Zhuravlev, and A. Fedorova. Contention-aware scheduling on multicore systems. ACM Trans. Comput. Syst., 2010. Google Scholar
Digital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA'05. Google Scholar
Digital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 2008. Google Scholar
Digital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In ATEC'05. Google Scholar
Digital Library
- A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In PACT'07. Google Scholar
Digital Library
- D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO 42, 2009. Google Scholar
Digital Library
- A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based QoS techniques for cache/memory in CMP platforms. In ICS'09. Google Scholar
Digital Library
- Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual, January 2011.Google Scholar
- Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT'08. Google Scholar
Digital Library
- R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google Scholar
Digital Library
- D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys'10. Google Scholar
Digital Library
- H. Li, H. L. Sudarsan, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In ICPP'93. Google Scholar
Digital Library
- T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In SC'07. Google Scholar
Digital Library
- Z. Majo and T. R. Gross. Memory system performance in a NUMA multicore multiprocessor. In SYSTOR'11. Google Scholar
Digital Library
- J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In PPoPP'06. Google Scholar
Digital Library
- J. Mars, L. Tang, and M. L. Soffa. Directly characterizing cross core interference through contention synthesis. In HiPEAC'11. Google Scholar
Digital Library
- J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In CGO'10. Google Scholar
Digital Library
- D. Molka, D. Hackenberg, R. Schöne, and M. S. Müller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT'09. Google Scholar
Digital Library
- T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In ASPLOS'09. Google Scholar
Digital Library
- T. Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC. In OOPSLA'09. Google Scholar
Digital Library
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO 39, 2006. Google Scholar
Digital Library
- J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore processors. In EuroSys'10. Google Scholar
Digital Library
- A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In SC'10. Google Scholar
Digital Library
- D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In ASPLOS '09. Google Scholar
Digital Library
- M. M. Tikir and J. K. Hollingsworth. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing, 2008. Google Scholar
Digital Library
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS'96. Google Scholar
Digital Library
- S. Zhuralev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS'10. Google Scholar
Digital Library
Index Terms
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
Recommendations
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11: Proceedings of the international symposium on Memory managementMultiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Reducing last level cache pollution in NUMA multicore systems for improving cache performance
ICCSA'12: Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IIINon-uniform memory architecture (NUMA) system has numerous nodes with shared last level cache (LLC). Their shared LLC has brought many benefits in the cache utilization. However, LLC can be seriously polluted by tasks that cause huge I/O traffic for a ...
Memory system performance of UNIX on CC-NUMA multiprocessors
This study characterizes the performance of a variant of UNIX SVR4 on a large shared-memory multiprocessor and analyzes the effects of possible OS and architectural changes. We use a nonintrusive cache miss monitor to trace the execution of an OS-...







Comments