Abstract
Multicore multiprocessors use a Non Uniform Memory Architecture (NUMA) to improve their scalability. However, NUMA introduces performance penalties due to remote memory accesses. Without efficiently managing data layout and thread mapping to cores, scientific applications may suffer performance loss, even if they are optimized for NUMA. In this paper, we present algorithms and a runtime system that optimize the execution of OpenMP applications on NUMA architectures. By collecting information from hardware counters, the runtime system directs thread placement and reduces performance penalties by minimizing the critical path of OpenMP parallel regions. The runtime system uses a scalable algorithm that derives placement decisions with negligible overhead. We evaluate our algorithms and the runtime system with four NPB applications implemented in OpenMP. On average the algorithms achieve between 8.13% and 25.68% performance improvement, compared to the default Linux thread placement scheme. The algorithms miss the optimal thread placement in only 8.9% of the cases.
- AMD. BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors. AMD, 2010.Google Scholar
- Blagodurov, S., Zhuravlev, S., Fedorova, A., and Kamali, A. A Case for NUMA-Aware Contention Management on Multicore Systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 557--558. Google Scholar
Digital Library
- Curtis-Maury, M., Shah, A., Blagojevic, F., Nikolopoulos, D. S., de Supinski, B. R., and Schulz, M. Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2008), PACT '08, ACM, pp. 250--259. Google Scholar
Digital Library
- Klug, T., Ott, M., Weidendorfer, J., Trinitis, C., and München, T. U. autopin -- Automated Optimization of Thread-to-Core Pinning on Multicore Systems.Google Scholar
- Li, D., de Supinski, B., Schulz, M., Cameron, K., and Nikolopoulos, D. Hybrid MPI/OpenMP Power-Aware Computing. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on (April 2010), pp. 1--12.Google Scholar
Cross Ref
- Majo, Z., and Gross, T. R. Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (New York, NY, USA, 2011), ISMM '11, ACM, pp. 11--20. Google Scholar
Digital Library
- Mattson, T. G., Riepen, M., Lehnig, T., Brett, P., Haas, W., Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., and Dighe, S. The 48-Core SCC Processor: The Programmer's View. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Washington, DC, USA, 2010), SC '10, IEEE Computer Society, pp. 1--11. Google Scholar
Digital Library
- Mccurdy, C., and Vetter, J. Memphis: Finding and Fixing NUMA-Related Performance Problems on Multi-core Platforms. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS) (2010).Google Scholar
Cross Ref
- Mizell, D., and Maschhoff, K. Early Experiences with Large-Scale Cray XMT Systems. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (may 2009), pp. 1--9. Google Scholar
Digital Library
- Ribeiro, C., Mehaut, J.-F., Carissimi, A., Castro, M., and Fernandes, L. Memory Affinity for Hierarchical Shared Memory Multiprocessors. In Computer Architecture and High Performance Computing, 2009. SBAC-PAD '09. 21st International Symposium on (Oct. 2009), pp. 59--66. Google Scholar
Digital Library
- Singh, K., Curtis-Maury, M., McKee, S. A., Blagojevi?, F., Nikolopoulos, D. S., de Supinski, B. R., and Schulz, M. Comparing Scalability Prediction Strategies on an SMP of CMPs. In Proceedings of the 16th International Euro-Par Conference on Parallel Processing: Part I. Google Scholar
Digital Library
- Terboven, C., an Mey, D., Schmidl, D., Jin, H., and Reichstein, T. Data and Thread Affinity in OpenMP Programs. In Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? (New York, NY, USA, 2008), MAW '08, ACM, pp. 377--384. Google Scholar
Digital Library
- Ware, M., Rajamani, K., Floyd, M., Brock, B., Rubio, J., Rawson, F., and Carter, J. Architecting for Power Management: The IBM POWER7 Approach. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on (Jan. 2010), pp. 1--11.Google Scholar
- Zhuravlev, S., Blagodurov, S., and Fedorova, A. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2010), ASPLOS '10, ACM, pp. 129--142. Google Scholar
Digital Library
Index Terms
Critical path-based thread placement for NUMA systems
Recommendations
Efficient thread/page/parallelism autotuning for NUMA systems
ICS '19: Proceedings of the ACM International Conference on SupercomputingCurrent multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Access (NUMA) effects: memory performance depends on the location of the data and the thread. This complexity means that thread- and data-mappings have a ...
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11: Proceedings of the international symposium on Memory managementMultiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Memory system performance in a NUMA multicore multiprocessor
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and StorageModern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory ...






Comments