Abstract
Many recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not~portable, and (3) optimizations are not~composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This paper presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations.
- OpenMP Application Programming Interface, Version 3.1, July 2011.Google Scholar
- Intel(R) Threading Building Blocks Reference Manual, 2012.Google Scholar
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In SPAA’00. Google Scholar
Digital Library
- Z. Anderson. Efficiently combining parallel software using finegrained, language-level, hierarchical resource management policies. In OOPSLA’12. Google Scholar
Digital Library
- S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore processors. In USENIX ATC’11. Google Scholar
Digital Library
- W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for NUMA memory management. In SOSP’89. Google Scholar
Digital Library
- Q. Chen, M. Guo, and Z. Huang. CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS’12. Google Scholar
Digital Library
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quéma, and M. Roth. Traffic management: A holistic approach to memory placmeent on NUMA systems. In ASPLOS’13. Google Scholar
Digital Library
- T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. Characterizing multi-threaded applications based on shared-resource contention. In ISPASS’11. Google Scholar
Digital Library
- T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Trans. Archit. Code Optim., 2013. Google Scholar
Digital Library
- D. Dice, V. J. Marathe, and N. Shavit. Flat-combining NUMA locks. In SPAA’11. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI’98. Google Scholar
Digital Library
- E. Gidron, I. Keidar, D. Perelman, and Y. Perez. SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools. In SPAA’12. Google Scholar
Digital Library
- Y. Guo, J. Zhao, V. Cavé, and V. Sarkar. SLAW: A scalable localityaware adaptive work-stealing scheduler. In IPDPS’10.Google Scholar
- D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO’09. Google Scholar
Digital Library
- T. Harris, M. Maas, and V. J. Marathe. Callisto: Co-scheduling parallel runtime systems. In EuroSys’14. Google Scholar
Digital Library
- A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In PPoPP’11. Google Scholar
Digital Library
- R. Lachaize, B. Lepers, and V. Quéma. MemProf: a memory profiler for NUMA multicore systems. In USENIX ATC’12. Google Scholar
Digital Library
- X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In PPoPP’14. Google Scholar
Digital Library
- Z. Majo and T. R. Gross. (Mis)Understanding the NUMA memory system performance of multithreaded workloads. In IISWC’13.Google Scholar
- J. Marathe, V. Thakkar, and F. Mueller. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. J. Par. Distrib. Comput., 2010. Google Scholar
Digital Library
- C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMArelated performance problems on multi-core platforms. In ISPASS’10.Google Scholar
- A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In PPoPP’13. Google Scholar
Digital Library
- D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguadé. A case for user-level dynamic page migration. In ICS’00. Google Scholar
Digital Library
- T. Ogasawara. NUMA-aware memory manager with dominant-threadbased copying GC. In OOPSLA’09. Google Scholar
Digital Library
- H. Pan, B. Hindman, and K. Asanovi´c. Composing parallel software efficiently with Lithe. In PLDI’10. Google Scholar
Digital Library
- A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys’10. Google Scholar
Digital Library
- C. Reddy and U. Bondhugula. Effective automatic computation placement and dataallocation for parallelization of regular programs. In ICS’14. Google Scholar
Digital Library
- E. C. Reed, N. Chen, and R. E. Johnson. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In SPLASH’11 Workshops. Google Scholar
Digital Library
- A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In IPDPS’08.Google Scholar
- H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. In SPAA’14. Google Scholar
Digital Library
- D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys’07. Google Scholar
Digital Library
- M. M. Tikir and J. K. Hollingsworth. Hardware monitors for dynamic page migration. J. Par. Distrib. Comput., 2008. Google Scholar
Digital Library
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS’96. Google Scholar
Digital Library
- E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP’10. Google Scholar
Digital Library
Index Terms
A library for portable and composable data locality optimizations for NUMA systems
Recommendations
A library for portable and composable data locality optimizations for NUMA systems
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMany recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three ...
A Library for Portable and Composable Data Locality Optimizations for NUMA Systems
Special Issue on PPoPP 2015 and Regular PapersMany recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three ...
Scale-out NUMA
ASPLOS '14Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...






Comments