skip to main content
research-article

Traffic management: a holistic approach to memory placement on NUMA systems

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid the cost of remote accesses. Contrary to older systems, modern NUMA hardware has much smaller remote wire delays, and so remote access costs per se are not the main concern for performance, as we discovered in this work. Instead, congestion on memory controllers and interconnects, caused by memory traffic from data-intensive applications, hurts performance a lot more. Because of that, memory placement algorithms must be redesigned to target traffic congestion. This requires an arsenal of techniques that go beyond optimizing locality. In this paper we describe Carrefour, an algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 3.6 relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux. Carrefour never hurts performance by more than 4% when memory placement cannot be improved. We present the design of Carrefour, the challenges of implementing it on modern hardware, and draw insights about hardware support that would help optimize system software on future NUMA systems.

References

  1. AMD64 Technology Lightweight Profiling Specification, Aug. 2010. http://support.amd.com/us/Processor_TechDocs/43724.pdf.Google ScholarGoogle Scholar
  2. AutoNUMA: the other approach to NUMA scheduling. LWN.net, Mar. 2012. http://lwn.net/Articles/488709/.Google ScholarGoogle Scholar
  3. M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the problems and opportunities posed by multiple on-chip memory controllers. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, and A. Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX ATC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Bolosky, R. Fitzgerald, and M. Scott. Simple but Effective Techniques for NUMA Memory Management. In SOSP, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: an operating system for many cores. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. A software approach to unifying multicore caches. Technical Report MIT-CSAIL-TR-2011-032, 2011.Google ScholarGoogle Scholar
  9. T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In USENIX SEDMS, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. CSU Face Identification Evaluation System. http://www.cs.colostate.edu/evalfacerec/index10.php.Google ScholarGoogle Scholar
  11. B. Dally. Power, programmability, and granularity: The challenges of exascale computing. http://techtalks.tv/talks/54110.Google ScholarGoogle Scholar
  12. P. Drongowski and B. Center. Instruction-based sampling: A new performance analysis technique for amd family 10h processors. 2007.Google ScholarGoogle Scholar
  13. M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In ASPLOS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Gamsa, O. Krieger, and M. Stumm. Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System. In OSDI, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kamali. Sharing aware scheduling on multicore systems. In MSc Thesis, Simon Fraser Univ., 2010.Google ScholarGoogle Scholar
  16. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):pp. 54--66, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Lachaize, B. Lepers, and V. Quéma. MemProf: A Memory Profiler for NUMA Multicore Systems. In USENIX ATC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. P. LaRowe, Jr., C. S. Ellis, and M. A. Holliday. Evaluation of NUMA Memory Management Through Modeling and Measurements. IEEE TPDS, 3:686--701, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Z. Majo and T. R. Gross. Memory management in numa multicore systems: Trapped between cache contention and interconnect overhead. In ISMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Z. Majo and T. R. Gross. Memory System Performance in a NUMA Multicore Multiprocessor. In SYSTOR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Merkel, J. Stoess, and F. Bellosa. Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors. In EuroSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Metis MapReduce Library. http://pdos.csail.mit.edu/metis/.Google ScholarGoogle Scholar
  23. Z. Metreveli, N. Zeldovich, and F. Kaashoek. Cphash: a cachepartitioned hash table. In PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.Google ScholarGoogle Scholar
  25. I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction execution. Proc. VLDB Endow., 3:928--939, September 2010. ISSN 2150-8097. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. PARSEC Benchmark Suite. http://parsec.cs.princeton.edu/.Google ScholarGoogle Scholar
  27. A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T.-I. Salomie, I. E. Subasu, J. Giceva, and G. Alonso. Database engines on multicores, why parallelize when you can distribute? In EuroSys, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Song, H. Chen, R. Chen, Y. Wang, and B. Zang. A case for scaling applications to many-core with OS clustering. In EuroSys, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In EuroSys, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Zhou and B. Demsky. Memory management for many-core processors with software configurable locality policies. In ISMM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Contention on Multicore Processors via Scheduling. In ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Traffic management: a holistic approach to memory placement on NUMA systems

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 4
    ASPLOS '13
    April 2013
    540 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2499368
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
      March 2013
      574 pages
      ISBN:9781450318709
      DOI:10.1145/2451116

    Copyright © 2013 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 16 March 2013

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!