Abstract
Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models, for both computing and memory resources, while achieving near-optimal data locality. Our run-time algorithms for NUMA-aware task and data placement are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables fine-grained dynamic control over the placement of application data. We demonstrate that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5× compared to a NUMA-aware hierarchical work-stealing baseline.
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proc. of the 5th ACM SIGPLAN Symp. on Princ. and Pract. of Par. Programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to nonuniform cluster computing. In Proc. of the 20th Annual ACM SIGPLAN Conf. on OOP, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach. Topology-aware and dependence-aware scheduling and memory allocation for task-parallel languages. ACM TACO, 11(3):30:1--30:25, Aug. 2014. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP API Version 4.0, July 2013.Google Scholar
- J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. Hierarchical task-based programming with StarSs. Intl. J. on HPC Arch., 23(3):284--299, 2009. Google Scholar
Digital Library
- A. Pop and A. Cohen. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM TACO, 9(4):53:1--53:25, Jan. 2013. Google Scholar
Digital Library
- P. Pratikakis, H. Vandierendonck, S. Lyberis, and D. S. Nikolopoulos. A programming model for deterministic task parallelism. In Proc. of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '11, pages 7--12, New York, NY, USA, 2011. Google Scholar
Digital Library
Recommendations
NUMA-aware scheduling and memory allocation for data-flow task-parallel applications
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingDynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform ...
Adaptive NUMA-aware data placement and task scheduling for analytical workloads in main-memory column-stores
Non-uniform memory access (NUMA) architectures pose numerous performance challenges for main-memory column-stores in scaling up analytics on modern multi-socket multi-core servers. A NUMA-aware execution engine needs a strategy for data placement and ...
Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators
Specialized implementations of ILUPACK's iterative solver for NUMA platforms.Specialized implementations of ILUPACK's iterative solver for many-core accelerators.Exploitation of task parallelism via OmpSs runtime (dynamic schedule).Exploitation of task ...






Comments