skip to main content
research-article

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Published:29 August 2015Publication History
Skip Abstract Section

Abstract

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.

References

  1. AMD Accelerated Parallel Processing OpenCL Programming Guide. AMD, 2013.Google ScholarGoogle Scholar
  2. C. Andreetta, V. Begot, J. Berthold, M. Elsman, T. Henriksen, M.-B. Nordfang, and C. Oancea. A financial benchmark for GPGPU compilation. Technical Report no 2015/02, University of Copenhagen, 2015. Extended version of CPC’15 paper.Google ScholarGoogle Scholar
  3. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: a language and compiler for algorithmic choice. PLDI. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bergstrom and J. H. Reppy. Nested data-parallelism on the GPU. ICFP. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. S. Bird. An introduction to the theory of lists. In Logic of Programming and Calculi of Discrete Design, Nato ASI Series. Springer New York, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. A heterogeneous parallel framework for domainspecific languages. PACT. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. PPoPP. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. PPoPP. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. DAMP. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. IISWC. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. NOVA: A functional language for data parallelism. ARRAY. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: from lists to streams to nothing at all. ICFP. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Cunningham, R. Bordawekar, and V. Saraswat. GPU programming in a high level language: compiling X10 to CUDA. X10. ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. GPGPU. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, and Q. Wu. Parallel programming using skeleton functions. PARLE. Springer, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. de Mesmay, A. Rimmel, Y. Voronenko, and M. Püschel. Banditbased optimization on graphs with application to library performance tuning. ICML. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communication of the ACM, 51(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. H. González and B. B. Fraguela. An algorithm template for domainbased parallel irregular algorithms. International Journal of Parallel Programming, 42(6):948–967, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. FERRY: databasesupported program execution. SIGMOD. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. D. Han and T. S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1), Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Harris. Optimizing Parallel Reduction in CUDA. Nvidia, 2007.Google ScholarGoogle Scholar
  24. E. Holk, W. E. Byrd, N. Mahajan, J. Willcock, A. Chauhan, and A. Lumsdaine. Declarative parallel programming for GPUs. PARCO. IOS Press, 2011.Google ScholarGoogle Scholar
  25. A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable stream programming on graphics engines. ASPLOS. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. P. Jones, A. Tolmach, and T. Hoare. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Haskell Workshop’01, 2001.Google ScholarGoogle Scholar
  27. R. Karrenberg and S. Hack. Whole-function vectorization. CGO. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on GPUs. MICRO. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. PPoPP. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising purely functional GPU programs. ICFP. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nvidia OpenCL Best Practices Guide. Nvidia, 2011.Google ScholarGoogle Scholar
  32. A. Panyala, D. Chavarria-Miranda, and S. Krishnamoorthy. On the use of term rewriting for performance optimization of legacy HPC applications. ICPP. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. Portable performance on heterogeneous architectures. ASPLOS. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. IEEE special issue on “Program Generation, Optimization, and Adaptation”, 93(2), 2005.Google ScholarGoogle Scholar
  35. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Reyes, I. López-Rodríguez, J. Fumero, and F. de Sande. accULL: an OpenACC implementation with CUDA and OpenCL support. Euro-Par. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Rodrigues, T. Jablin, A. Dakkak, and W.-M. Hwu. Triolet: A programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing. PPoPP. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. B. Skillicorn. Architecture-independent parallel computation. IEEE Computer, 23(12):38–50, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. G. Spampinato and M. Püschel. A basic linear algebra compiler. CGO. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Steuwer. Improving Programmability and Performance Portability on Many-Core Processors. PhD thesis, University of Muenster, Germany, 2015.Google ScholarGoogle Scholar
  41. M. Steuwer, P. Kegel, and S. Gorlatch. SkelCL - a portable skeleton library for high-level GPU programming. HIPS Workshop. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Svensson, M. Sheeran, and K. Claessen. Obsidian: A domain specific embedded language for parallel programming of graphics processors. IFL. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. CC. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM TACO, 9(4), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. H. Xi and F. Pfenning. Dependent types in practical programming. POPL. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Y. Zhang and F. Mueller. HiDP: A hierarchical data parallel language. CGO. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 9
        ICFP '15
        September 2015
        436 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2858949
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming
          August 2015
          436 pages
          ISBN:9781450336697
          DOI:10.1145/2784731

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 August 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!