skip to main content
column

Critical path-based thread placement for NUMA systems

Published:08 October 2012Publication History
Skip Abstract Section

Abstract

Multicore multiprocessors use a Non Uniform Memory Architecture (NUMA) to improve their scalability. However, NUMA introduces performance penalties due to remote memory accesses. Without efficiently managing data layout and thread mapping to cores, scientific applications may suffer performance loss, even if they are optimized for NUMA. In this paper, we present algorithms and a runtime system that optimize the execution of OpenMP applications on NUMA architectures. By collecting information from hardware counters, the runtime system directs thread placement and reduces performance penalties by minimizing the critical path of OpenMP parallel regions. The runtime system uses a scalable algorithm that derives placement decisions with negligible overhead. We evaluate our algorithms and the runtime system with four NPB applications implemented in OpenMP. On average the algorithms achieve between 8.13% and 25.68% performance improvement, compared to the default Linux thread placement scheme. The algorithms miss the optimal thread placement in only 8.9% of the cases.

References

  1. AMD. BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors. AMD, 2010.Google ScholarGoogle Scholar
  2. Blagodurov, S., Zhuravlev, S., Fedorova, A., and Kamali, A. A Case for NUMA-Aware Contention Management on Multicore Systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 557--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Curtis-Maury, M., Shah, A., Blagojevic, F., Nikolopoulos, D. S., de Supinski, B. R., and Schulz, M. Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2008), PACT '08, ACM, pp. 250--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Klug, T., Ott, M., Weidendorfer, J., Trinitis, C., and München, T. U. autopin -- Automated Optimization of Thread-to-Core Pinning on Multicore Systems.Google ScholarGoogle Scholar
  5. Li, D., de Supinski, B., Schulz, M., Cameron, K., and Nikolopoulos, D. Hybrid MPI/OpenMP Power-Aware Computing. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on (April 2010), pp. 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  6. Majo, Z., and Gross, T. R. Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (New York, NY, USA, 2011), ISMM '11, ACM, pp. 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mattson, T. G., Riepen, M., Lehnig, T., Brett, P., Haas, W., Kennedy, P., Howard, J., Vangal, S., Borkar, N., Ruhl, G., and Dighe, S. The 48-Core SCC Processor: The Programmer's View. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Washington, DC, USA, 2010), SC '10, IEEE Computer Society, pp. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mccurdy, C., and Vetter, J. Memphis: Finding and Fixing NUMA-Related Performance Problems on Multi-core Platforms. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS) (2010).Google ScholarGoogle ScholarCross RefCross Ref
  9. Mizell, D., and Maschhoff, K. Early Experiences with Large-Scale Cray XMT Systems. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (may 2009), pp. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ribeiro, C., Mehaut, J.-F., Carissimi, A., Castro, M., and Fernandes, L. Memory Affinity for Hierarchical Shared Memory Multiprocessors. In Computer Architecture and High Performance Computing, 2009. SBAC-PAD '09. 21st International Symposium on (Oct. 2009), pp. 59--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Singh, K., Curtis-Maury, M., McKee, S. A., Blagojevi?, F., Nikolopoulos, D. S., de Supinski, B. R., and Schulz, M. Comparing Scalability Prediction Strategies on an SMP of CMPs. In Proceedings of the 16th International Euro-Par Conference on Parallel Processing: Part I. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Terboven, C., an Mey, D., Schmidl, D., Jin, H., and Reichstein, T. Data and Thread Affinity in OpenMP Programs. In Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? (New York, NY, USA, 2008), MAW '08, ACM, pp. 377--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ware, M., Rajamani, K., Floyd, M., Brock, B., Rubio, J., Rawson, F., and Carter, J. Architecting for Power Management: The IBM POWER7 Approach. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on (Jan. 2010), pp. 1--11.Google ScholarGoogle Scholar
  14. Zhuravlev, S., Blagodurov, S., and Fedorova, A. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2010), ASPLOS '10, ACM, pp. 129--142. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Critical path-based thread placement for NUMA systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!