Abstract
Constantly increasing hardware parallelism poses more and more challenges to programmers and language designers. One approach to harness the massive parallelism is to move to task-based programming models that rely on runtime systems for dependency analysis and scheduling. Such models generally benefit from the existence of a global address space. This paper presents the parallel memory allocator of the Myrmics runtime system, in which multiple allocator instances organized in a tree hierarchy cooperate to implement a global address space with dynamic region support on distributed memory machines. The Myrmics hierarchical memory allocator is step towards improved productivity and performance in parallel programming. Productivity is improved through the use of dynamic regions in a global address space, which provide a convenient shared memory abstraction for dynamic and irregular data structures. Performance is improved through scaling on manycore systems without system-wide cache coherency. We evaluate the stand-alone allocator on an MPI-based x86 cluster and find that it scales well for up to 512 worker cores, while it can outperform Unified Parallel C by a factor of 3.7-10.7x.
- C. Arens. The Bowyer-Watson Algorithm; An efficient Implementation in a Database Environment. Technical report, Delft University of Technology, January 2002.Google Scholar
- E. Ayguadé X. Teruel, P. Unnikrishnan, and G. Zhang. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3): 404--418, 2009. Google Scholar
Digital Library
- E. D. Berger, B. G. Zorn, and K. S. McKinley. Reconsidering Custom Memory Allocation. In OOPSLA '02: Proc. 2002 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, pages 1--12. Google Scholar
Digital Library
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. SIGPLAN Not., 35:117--128, November 2000. Google Scholar
Digital Library
- R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In PPoPP '95: Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207--216. Google Scholar
Digital Library
- J. Bonwick. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USTC '94: Proc. 1994 USENIX Summer Technical Conference, pages 87--98. Google Scholar
Digital Library
- B. L. Chamberlain, D. Callahan, and H. P. Zima. Paralle Programmability and the Chapel Language. IJHPCA, 21(3):291--312, 2007. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. A. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In OOPSLA '05: Proc. 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 519--538. Google Scholar
Digital Library
- J. Dubinski. A Parallel Tree Code. New Astronomy, 1(2):133--147, 1996.Google Scholar
Cross Ref
- T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specifications v1.1.1. October 2003.Google Scholar
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the Memory Hierarchy. In SC '06: Proc. 2006 ACM/IEEE Conference on High Performance Networking and Computing. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The Implementation of the Cilk-5 Multithreaded Language. In PLDI '98: Proc. 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 212--223. Google Scholar
Digital Library
- D. Gay and A. Aiken. Language Support for Regions. In PLDI '01: Proc. 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 70--80. Google Scholar
Digital Library
- D. E. Gay. Memory Management with Explicit Regions. PhD thesis, UC Berkeley, Berkeley, CA, USA, 2001. Google Scholar
Digital Library
- D. Grove, O. Tardieu, D. Cunningham, B. Herta, I. Peshansky, and V. Saraswat. A Performance Model for X10 Applications. In X10 '11: Proc. ACM SIGPLAN 2011 X10 Workshop. Google Scholar
Digital Library
- D. R. Hanson. Fast Allocation and Deallocation of Memory Based on Object Lifetimes. Software Practice and Experience, 20:5--12, January 1990. Google Scholar
Digital Library
- P. N. Hilfinger, D. O. Bonachea, K. Datta, D. Gay, S. L. Graham, B. R. Liblit, G. Pike, J. Z. Su, and K. A. Yelick. Titanium Language Reference Manual, Version 2.19. Technical Report UCB/EECS-2005-15, EECS Berkeley, November 2005.Google Scholar
Digital Library
- J. Howard, S. Dighe, Y. Hoskote, S. R. Vangal, and D. Finan. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In ISSCC '10: Proc. 2010 IEEE International Solid-State Circuits Conference, pages 108--109.Google Scholar
- R. L. Hudson, B. Saha, A.-R. Adl-Tabatabai, and B. C. Hertzberg. McRT-Malloc: A Scalable Transactional Memory Allocator. In ISMM '06: Proc. 2006 International Symposium on Memory Management, pages 74--83. Google Scholar
Digital Library
- P. Husbands, C. Iancu, and K. Yelick. A Performance Analysis of the Berkeley UPC Compiler. In ICS '03: Proc. 17th International Conference on Supercomputing, pages 63--73. Google Scholar
Digital Library
- M. S. Johnstone and P. R. Wilson. The Memory Fragmentation Problem: Solved? SIGPLAN Notices, 34:26--36, October 1998. Google Scholar
Digital Library
- S. Kahan and P. Konecny. MAMA!: A Memory Allocator for Multithreaded Architectures. In PPoPP '06: Proc. 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 178--186. Google Scholar
Digital Library
- A. Kukanov and M. Voss. The Foundations for Scalable Multi- Core Software in Intel Threading Building Blocks. Intel Technology Journal, 11(4), Nov. 2007.Google Scholar
Cross Ref
- E. A. Lee. The Problem with Threads. Computer, 39(5):33--42, May 2006. Google Scholar
Digital Library
- L. Linardakis. Decoupling Method for Parallel Delaunay Two- Dimensional Mesh Generation. PhD thesis, College of William & Mary, Williamsburg, VA, USA, 2007. Google Scholar
Digital Library
- M. M. Michael. Scalable Lock-Free Dynamic Memory Allocation. SIGPLAN Notices, 39:35--46, June 2004. Google Scholar
Digital Library
- R. W. Numrich and J. Reid. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum, 17:1--31, August 1998. Google Scholar
Digital Library
- OpenMP ARB. OpenMP Application Program Interface, v. 3.1. www.openmp.org, July 2011.Google Scholar
- P. Pratikakis, H. Vandierendonck, S. Lyberis, and D. S. Nikolopoulos. A Programming Model for Deterministic Task Parallelism. In MSPC '11: Proc. 2011 ACM SIGPLAN workshop on Memory Systems Performance and Correctness, pages 7--12. Google Scholar
Digital Library
- M. Tofte and J.-P. Talpin. Region-Based Memory Management. Information and Computation, 132(2):109--176, 1997. Google Scholar
Digital Library
Index Terms
The myrmics memory allocator: hierarchical,message-passing allocation for global address spaces
Recommendations
A locality-improving dynamic memory allocator
MSP '05: Proceedings of the 2005 workshop on Memory system performanceIn general-purpose applications, most data is dynamically allocated. The memory manager therefore plays a crucial role in application performance by determining the spatial locality of heap objects. Previous general-purpose allocators have focused on ...
Fast dynamic memory allocator for massively parallel architectures
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsDynamic memory allocation in massively parallel systems often suffers from drastic performance decreases due to the required global synchronization. This is especially true when many allocation or deallocation requests occur in parallel. We propose a ...
The myrmics memory allocator: hierarchical,message-passing allocation for global address spaces
ISMM '12: Proceedings of the 2012 international symposium on Memory ManagementConstantly increasing hardware parallelism poses more and more challenges to programmers and language designers. One approach to harness the massive parallelism is to move to task-based programming models that rely on runtime systems for dependency ...







Comments