skip to main content
research-article

A library for portable and composable data locality optimizations for NUMA systems

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Many recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not~portable, and (3) optimizations are not~composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This paper presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations.

References

  1. OpenMP Application Programming Interface, Version 3.1, July 2011.Google ScholarGoogle Scholar
  2. Intel(R) Threading Building Blocks Reference Manual, 2012.Google ScholarGoogle Scholar
  3. U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In SPAA’00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Anderson. Efficiently combining parallel software using finegrained, language-level, hierarchical resource management policies. In OOPSLA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for NUMA-aware contention management on multicore processors. In USENIX ATC’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective techniques for NUMA memory management. In SOSP’89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Q. Chen, M. Guo, and Z. Huang. CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quéma, and M. Roth. Traffic management: A holistic approach to memory placmeent on NUMA systems. In ASPLOS’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. Characterizing multi-threaded applications based on shared-resource contention. In ISPASS’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Trans. Archit. Code Optim., 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Dice, V. J. Marathe, and N. Shavit. Flat-combining NUMA locks. In SPAA’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI’98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Gidron, I. Keidar, D. Perelman, and Y. Perez. SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools. In SPAA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Guo, J. Zhao, V. Cavé, and V. Sarkar. SLAW: A scalable localityaware adaptive work-stealing scheduler. In IPDPS’10.Google ScholarGoogle Scholar
  15. D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Harris, M. Maas, and V. J. Marathe. Callisto: Co-scheduling parallel runtime systems. In EuroSys’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In PPoPP’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Lachaize, B. Lepers, and V. Quéma. MemProf: a memory profiler for NUMA multicore systems. In USENIX ATC’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In PPoPP’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Z. Majo and T. R. Gross. (Mis)Understanding the NUMA memory system performance of multithreaded workloads. In IISWC’13.Google ScholarGoogle Scholar
  21. J. Marathe, V. Thakkar, and F. Mueller. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. J. Par. Distrib. Comput., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMArelated performance problems on multi-core platforms. In ISPASS’10.Google ScholarGoogle Scholar
  23. A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. In PPoPP’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguadé. A case for user-level dynamic page migration. In ICS’00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Ogasawara. NUMA-aware memory manager with dominant-threadbased copying GC. In OOPSLA’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Pan, B. Hindman, and K. Asanovi´c. Composing parallel software efficiently with Lithe. In PLDI’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Reddy and U. Bondhugula. Effective automatic computation placement and dataallocation for parallelization of regular programs. In ICS’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. C. Reed, N. Chen, and R. E. Johnson. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In SPLASH’11 Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In IPDPS’08.Google ScholarGoogle Scholar
  31. H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. In SPAA’14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys’07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. M. Tikir and J. K. Hollingsworth. Hardware monitors for dynamic page migration. J. Par. Distrib. Comput., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS’96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP’10. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A library for portable and composable data locality optimizations for NUMA systems

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 50, Issue 8
              PPoPP '15
              August 2015
              290 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2858788
              • Editor:
              • Andy Gill
              Issue’s Table of Contents
              • cover image ACM Conferences
                PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
                January 2015
                290 pages
                ISBN:9781450332057
                DOI:10.1145/2688500

              Copyright © 2015 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 24 January 2015

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!