Abstract
NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid the cost of remote accesses. Contrary to older systems, modern NUMA hardware has much smaller remote wire delays, and so remote access costs per se are not the main concern for performance, as we discovered in this work. Instead, congestion on memory controllers and interconnects, caused by memory traffic from data-intensive applications, hurts performance a lot more. Because of that, memory placement algorithms must be redesigned to target traffic congestion. This requires an arsenal of techniques that go beyond optimizing locality. In this paper we describe Carrefour, an algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 3.6 relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux. Carrefour never hurts performance by more than 4% when memory placement cannot be improved. We present the design of Carrefour, the challenges of implementing it on modern hardware, and draw insights about hardware support that would help optimize system software on future NUMA systems.
- AMD64 Technology Lightweight Profiling Specification, Aug. 2010. http://support.amd.com/us/Processor_TechDocs/43724.pdf.Google Scholar
- AutoNUMA: the other approach to NUMA scheduling. LWN.net, Mar. 2012. http://lwn.net/Articles/488709/.Google Scholar
- M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the problems and opportunities posed by multiple on-chip memory controllers. In PACT, 2010. Google Scholar
Digital Library
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, and A. Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In SOSP, 2009. Google Scholar
Digital Library
- S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX ATC, 2011. Google Scholar
Digital Library
- W. Bolosky, R. Fitzgerald, and M. Scott. Simple but Effective Techniques for NUMA Memory Management. In SOSP, 1989. Google Scholar
Digital Library
- S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: an operating system for many cores. In OSDI, 2008. Google Scholar
Digital Library
- S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. A software approach to unifying multicore caches. Technical Report MIT-CSAIL-TR-2011-032, 2011.Google Scholar
- T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In USENIX SEDMS, 1993. Google Scholar
Digital Library
- CSU Face Identification Evaluation System. http://www.cs.colostate.edu/evalfacerec/index10.php.Google Scholar
- B. Dally. Power, programmability, and granularity: The challenges of exascale computing. http://techtalks.tv/talks/54110.Google Scholar
- P. Drongowski and B. Center. Instruction-based sampling: A new performance analysis technique for amd family 10h processors. 2007.Google Scholar
- M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In ASPLOS, 2012. Google Scholar
Digital Library
- B. Gamsa, O. Krieger, and M. Stumm. Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System. In OSDI, 1999. Google Scholar
Digital Library
- A. Kamali. Sharing aware scheduling on multicore systems. In MSc Thesis, Simon Fraser Univ., 2010.Google Scholar
- R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):pp. 54--66, 2008. Google Scholar
Digital Library
- R. Lachaize, B. Lepers, and V. Quéma. MemProf: A Memory Profiler for NUMA Multicore Systems. In USENIX ATC, 2012. Google Scholar
Digital Library
- R. P. LaRowe, Jr., C. S. Ellis, and M. A. Holliday. Evaluation of NUMA Memory Management Through Modeling and Measurements. IEEE TPDS, 3:686--701, 1991. Google Scholar
Digital Library
- Z. Majo and T. R. Gross. Memory management in numa multicore systems: Trapped between cache contention and interconnect overhead. In ISMM, 2011. Google Scholar
Digital Library
- Z. Majo and T. R. Gross. Memory System Performance in a NUMA Multicore Multiprocessor. In SYSTOR, 2011. Google Scholar
Digital Library
- A. Merkel, J. Stoess, and F. Bellosa. Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors. In EuroSys, 2010. Google Scholar
Digital Library
- Metis MapReduce Library. http://pdos.csail.mit.edu/metis/.Google Scholar
- Z. Metreveli, N. Zeldovich, and F. Kaashoek. Cphash: a cachepartitioned hash table. In PPoPP, 2012. Google Scholar
Digital Library
- NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.Google Scholar
- I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction execution. Proc. VLDB Endow., 3:928--939, September 2010. ISSN 2150-8097. Google Scholar
Digital Library
- PARSEC Benchmark Suite. http://parsec.cs.princeton.edu/.Google Scholar
- A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys, 2010. Google Scholar
Digital Library
- T.-I. Salomie, I. E. Subasu, J. Giceva, and G. Alonso. Database engines on multicores, why parallelize when you can distribute? In EuroSys, 2011. Google Scholar
Digital Library
- X. Song, H. Chen, R. Chen, Y. Wang, and B. Zang. A case for scaling applications to many-core with OS clustering. In EuroSys, 2011. Google Scholar
Digital Library
- D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In EuroSys, 2007. Google Scholar
Digital Library
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS, 1996. Google Scholar
Digital Library
- J. Zhou and B. Demsky. Memory management for many-core processors with software configurable locality policies. In ISMM, 2012. Google Scholar
Digital Library
- S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Contention on Multicore Processors via Scheduling. In ASPLOS, 2010. Google Scholar
Digital Library
Index Terms
Traffic management: a holistic approach to memory placement on NUMA systems
Recommendations
Traffic management: a holistic approach to memory placement on NUMA systems
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsNUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for ...
Traffic management: a holistic approach to memory placement on NUMA systems
ASPLOS '13NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for ...
Towards Write-Activity-Aware Page Table Management for Non-volatile Main Memories
Non-volatile memories such as phase change memory (PCM) and memristor are being actively studied as an alternative to DRAM-based main memory in embedded systems because of their properties, which include low power consumption and high density. Though ...







Comments