Abstract
Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.
- AMD Accelerated Parallel Processing OpenCL Programming Guide. AMD, 2013.Google Scholar
- C. Andreetta, V. Begot, J. Berthold, M. Elsman, T. Henriksen, M.-B. Nordfang, and C. Oancea. A financial benchmark for GPGPU compilation. Technical Report no 2015/02, University of Copenhagen, 2015. Extended version of CPC’15 paper.Google Scholar
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: a language and compiler for algorithmic choice. PLDI. ACM, 2009. Google Scholar
Digital Library
- L. Bergstrom and J. H. Reppy. Nested data-parallelism on the GPU. ICFP. ACM, 2012. Google Scholar
Digital Library
- R. S. Bird. An introduction to the theory of lists. In Logic of Programming and Calculi of Discrete Design, Nato ASI Series. Springer New York, 1987. Google Scholar
Digital Library
- K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domainspecific languages. PACT. ACM, 2011. Google Scholar
Digital Library
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. PPoPP. ACM, 2011. Google Scholar
Digital Library
- H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. PPoPP. ACM, 2011. Google Scholar
Digital Library
- M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. DAMP. ACM, 2011. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC. IEEE, 2009. Google Scholar
Digital Library
- M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989. Google Scholar
Digital Library
- A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. NOVA: A functional language for data parallelism. ARRAY. ACM, 2014. Google Scholar
Digital Library
- D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: from lists to streams to nothing at all. ICFP. ACM, 2007. Google Scholar
Digital Library
- D. Cunningham, R. Bordawekar, and V. Saraswat. GPU programming in a high level language: compiling X10 to CUDA. X10. ACM, 2011.Google Scholar
Digital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. GPGPU. ACM, 2010. Google Scholar
Digital Library
- J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, and Q. Wu. Parallel programming using skeleton functions. PARLE. Springer, 1993. Google Scholar
Digital Library
- F. de Mesmay, A. Rimmel, Y. Voronenko, and M. Püschel. Banditbased optimization on graphs with application to library performance tuning. ICML. ACM, 2009. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communication of the ACM, 51(1), 2008. Google Scholar
Digital Library
- C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI. ACM, 2012. Google Scholar
Digital Library
- C. H. González and B. B. Fraguela. An algorithm template for domainbased parallel irregular algorithms. International Journal of Parallel Programming, 42(6):948–967, 2014. Google Scholar
Digital Library
- T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. FERRY: databasesupported program execution. SIGMOD. ACM, 2009. Google Scholar
Digital Library
- T. D. Han and T. S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1), Jan. 2011. Google Scholar
Digital Library
- M. Harris. Optimizing Parallel Reduction in CUDA. Nvidia, 2007.Google Scholar
- E. Holk, W. E. Byrd, N. Mahajan, J. Willcock, A. Chauhan, and A. Lumsdaine. Declarative parallel programming for GPUs. PARCO. IOS Press, 2011.Google Scholar
- A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable stream programming on graphics engines. ASPLOS. ACM, 2011. Google Scholar
Digital Library
- S. P. Jones, A. Tolmach, and T. Hoare. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Haskell Workshop’01, 2001.Google Scholar
- R. Karrenberg and S. Hack. Whole-function vectorization. CGO. IEEE, 2011. Google Scholar
Digital Library
- H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on GPUs. MICRO. IEEE, 2014. Google Scholar
Digital Library
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. PPoPP. ACM, 2009. Google Scholar
Digital Library
- T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. ICFP. ACM, 2013. Google Scholar
Digital Library
- Nvidia OpenCL Best Practices Guide. Nvidia, 2011.Google Scholar
- A. Panyala, D. Chavarria-Miranda, and S. Krishnamoorthy. On the use of term rewriting for performance optimization of legacy HPC applications. ICPP. IEEE, 2012. Google Scholar
Digital Library
- P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. ASPLOS. ACM, 2013. Google Scholar
Digital Library
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. IEEE special issue on “Program Generation, Optimization, and Adaptation”, 93(2), 2005.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI. ACM, 2013. Google Scholar
Digital Library
- R. Reyes, I. López-Rodríguez, J. Fumero, and F. de Sande. accULL: an OpenACC implementation with CUDA and OpenCL support. Euro-Par. Springer, 2012. Google Scholar
Digital Library
- C. Rodrigues, T. Jablin, A. Dakkak, and W.-M. Hwu. Triolet: A programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing. PPoPP. ACM, 2014. Google Scholar
Digital Library
- D. B. Skillicorn. Architecture-independent parallel computation. IEEE Computer, 23(12):38–50, 1990. Google Scholar
Digital Library
- D. G. Spampinato and M. Püschel. A basic linear algebra compiler. CGO. ACM, 2014. Google Scholar
Digital Library
- M. Steuwer. Improving Programmability and Performance Portability on Many-Core Processors. PhD thesis, University of Muenster, Germany, 2015.Google Scholar
- M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - a portable skeleton library for high-level GPU programming. HIPS Workshop. IEEE, 2011. Google Scholar
Digital Library
- J. Svensson, M. Sheeran, and K. Claessen. Obsidian: A domain specific embedded language for parallel programming of graphics processors. IFL. Springer, 2008. Google Scholar
Digital Library
- W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. CC. Springer, 2002. Google Scholar
Digital Library
- S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4), 2013. Google Scholar
Digital Library
- H. Xi and F. Pfenning. Dependent types in practical programming. POPL. ACM, 1999. Google Scholar
Digital Library
- Y. Zhang and F. Mueller. HiDP: A hierarchical data parallel language. CGO. IEEE, 2013. Google Scholar
Digital Library
Index Terms
Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
Recommendations
Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional ProgrammingComputers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL'19: Proceedings of the International Workshop on OpenCLA key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...






Comments