Abstract
This paper introduces a programming interface called PARRAY (or Parallelizing ARRAYs) that supports system-level succinct programming for heterogeneous parallel systems like GPU clusters. The current practice of software development requires combining several low-level libraries like Pthread, OpenMP, CUDA and MPI. Achieving productivity and portability is hard with different numbers and models of GPUs. PARRAY extends mainstream C programming with novel array types of distinct features: 1) the dimensions of an array type are nested in a tree, conceptually reflecting the memory hierarchy; 2) the definition of an array type may contain references to other array types, allowing sophisticated array types to be created for parallelization; 3) threads also form arrays that allow programming in a Single-Program-Multiple-Codeblock (SPMC) style to unify various sophisticated communication patterns. This leads to shorter, more portable and maintainable parallel codes, while the programmer still has control over performance-related features necessary for deep manual optimization. Although the source-to-source code generator only faithfully generates low-level library calls according to the type information, higher-level programming and automatic performance optimization are still possible through building libraries of sub-programs on top of PARRAY. The case study on cluster FFT illustrates a simple 30-line code that 2x outperforms Intel Cluster MKL on the Tianhe-1A system with 7168 Fermi GPUs and 14336 CPUs.
- ReferencesGoogle Scholar
- CUDA CUFFT Library, Version 2.3. NVIDIA Corp., 2009.Google Scholar
- N. Akira and M. Satoshi. Auto-tuning 3D FFT library for cuda GPUs. In SC'09, pages 1--10. ACM, 2009. Google Scholar
Digital Library
- K. Brown and et al. A heterogeneous parallel framework for domainspecific languages. In PACT'11, 2011. Google Scholar
Digital Library
- H. Chafi and et al. A domain-specific approach to heterogeneous parallelism. In PPoPP'11, 2011. Google Scholar
Digital Library
- B. Chamberlain, D. Callahan, and H. P. Zima. Parallel programmability and the Chapel language. IJHPCA, 21(3):291--312, 2007. Google Scholar
Digital Library
- B. Chamberlain and et al. The high-level parallel language ZPL improves productivity and performance. In IJHPCA'04, 2004.Google Scholar
- P. Charles and et al. X10: An object-oriented approach to nonuniform cluster computing. In OOPSLA'05, 2005. Google Scholar
Digital Library
- Y. Chen, X. Cui, and H. Mei. Large-scale FFT on GPU clusters. In ACM Inter. Conf. on Supercomputing (ICS'10), pages 50--59, 2010. Google Scholar
Digital Library
- M. Fatica. Accelerating linpack with CUDA on heterogenous clusters. GPGPU'09, June 2009. Google Scholar
Digital Library
- B. Francois. Incremental migration of C and Fortran applications to GPGPU using HMPP. Technical report, hipeac, 2010.Google Scholar
- B. Ganesh and et al. Programming for parallelism and locality with hierarchically tiled arrays. In PPoPP'06, pages 48--57, 2006. Google Scholar
Digital Library
- N. Govindaraju and et al. High performance discrete fourier transforms on graphics processors. SC'08, November 2008. Google Scholar
Digital Library
- G. Hains and L. M. R. Mullin. Parallel functional programming with arrays. Comput. J., 36(3):238--245, 1993.Google Scholar
Cross Ref
- C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985. Google Scholar
Digital Library
- C. A. R. Hoare and et al. Laws of programming. Communications of the ACM, 30(8):672--686, 1987. Google Scholar
Digital Library
- C. A. R. Hoare and J. He. Unifying Theories of Programming. Prentice Hall, 1998.Google Scholar
- J. J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: A nonuniform memory access programming model for highperformance computers. The Journal of Supercomputing, 10(2), 1996. Google Scholar
Digital Library
- B. J'onsson and A. Tarski. Boolean algebras with operators, part I. American Journal of Mathematics, 73:891--939, 1951.Google Scholar
Cross Ref
- K. Kandalla and et al. High-performance and scalable non-blocking All-to-All with collective offload on infiniband clusters: A study with parallel 3D FFT. In ISC'11, 2011.Google Scholar
- A. Nukada and et al. Bandwidth intensive 3-D FFT kernel for GPUs using cuda. In SC'08, pages 1--11, 2008. Google Scholar
Digital Library
- R. Numerich and J. Reid. Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum, 17(2):1C31, 1998. Google Scholar
Digital Library
- D. Pekurovsky. http://www.sdsc.edu/us/resources/p3dfft.php.Google Scholar
- V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture. http://www.cs.berkeley.edu/, May 2008.Google Scholar
- K. Yelick and et al. Titanium: A high-performance Java dialect. In In ACM, pages 10--11, 1998.Google Scholar
Cross Ref
- Y. Zheng and et al. Extending Unified Parallel C for GPU computing. In SIAM Conf on Parallel Processing for Scientific Computing, 2010Google Scholar
Index Terms
PARRAY: a unifying array representation for heterogeneous parallelism
Recommendations
PARRAY: a unifying array representation for heterogeneous parallelism
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingThis paper introduces a programming interface called PARRAY (or Parallelizing ARRAYs) that supports system-level succinct programming for heterogeneous parallel systems like GPU clusters. The current practice of software development requires combining ...
The STAPL pArray
MEDEA '07: Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architectureThe Standard Template Adaptive Parallel Library (STAPL) is a parallel programming framework that extends C++ and STL with support for parallelism. STAPL provides parallel data structures (pContainers) and generic parallel algorithms (pAlgorithms), and a ...
Using distributed memory parallel computers and GPU clusters for multidimensional Monte Carlo integration
The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on various distributed memory parallel computers and clusters of multicore nodes using recently developed parallel versions of linear ...







Comments