skip to main content
article

Futhark: purely functional GPU-programming with nested parallelism and in-place array updates

Published:14 June 2017Publication History
Skip Abstract Section

Abstract

Futhark is a purely functional data-parallel array language that offers a machine-neutral programming model and an optimising compiler that generates OpenCL code for GPUs.

This paper presents the design and implementation of three key features of Futhark that seek a suitable middle ground with imperative approaches.

First, in order to express efficient code inside the parallel constructs, we introduce a simple type system for in-place updates that ensures referential transparency and supports equational reasoning.

Second, we furnish Futhark with parallel operators capable of expressing efficient strength-reduced code, along with their fusion rules.

Third, we present a flattening transformation aimed at enhancing the degree of parallelism that (i) builds on loop interchange and distribution but uses higher-order reasoning rather than array-dependence analysis, and (ii) still allows further locality-of-reference optimisations. Finally, an evaluation on 16 benchmarks demonstrates the impact of the language and compiler features and shows application-level performance competitive with hand-written GPU code.

Skip Supplemental Material Section

Supplemental Material

References

  1. C. Andreetta, V. Bégot, J. Berthold, M. Elsman, F. Henglein, T. Henriksen, M.-B. Nordfang, and C. E. Oancea. FinPar: A Parallel Financial Benchmark. ACM Trans. Archit. Code Optim. (TACO), 13(2):18:1–18:27, June 2016. ISSN 1544- 3566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: A Java-compatible and Synthesizable Language for Heterogeneous Architectures. In Procs. of ACM Int. Conf. on Object Oriented Prog. Systems Languages and Applications, OOPSLA ’10, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Procs of Int. Conf. on Parallel Architecture and Compilation (PACT), PACT ’15, 2015. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Barendsen and S. Smetsers. Conventional and Uniqueness Typing in Graph Rewrite Systems. In Found. of Soft. Tech. and Theoretical Comp. Sci. (FSTTCS), volume 761 of LNCS, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Barendsen and S. Smetsers. Uniqueness Typing for Functional Languages with Graph Rewriting Semantics. Mathematical Structures in Computer Science, 6(6):579–612, 1996.Google ScholarGoogle Scholar
  6. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: A CPU and GPU Math Compiler in Python. In S. van der Walt and J. Millman, editors, Procs. of the 9th Python in Science Conference, 2010.Google ScholarGoogle Scholar
  7. L. Bergstrom and J. Reppy. Nested Data-parallelism on the GPU. In Procs of 17th ACM SIGPLAN Int. Conf. on Functional Prog., ICFP’12, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Bergstrom, M. Fluet, M. Rainey, J. Reppy, S. Rosen, and A. Shaw. Data-only Flattening for Nested Data Parallelism. In Procs. of the 18th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP ’13, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. S. Bird. Algebraic Identities for Program Calculation. Computer Journal, 32(2):122–126, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. E. Blelloch. Scans as Primitive Parallel Operations. Computers, IEEE Transactions, 38(11):1526–1538, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. E. Blelloch. Vector models for data-parallel computing, volume 75. MIT press Cambridge, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. E. Blelloch, J. C. Hardwick, J. Sipelstein, M. Zagha, and S. Chatterjee. Implementation of a Portable Nested Data-Parallel Language. Journal of parallel and distributed computing, 21(1):4–14, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, C. De Sa, C. Aberger, and K. Olukotun. Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns. In Procs. of Int. Symp. on Code Generation and Optimization, CGO 2016, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an Embedded Data Parallel Language. In Procs. of ACM Symp. on Principles and Practice of Parallel Programming, PPoPP ’11, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In Procs. of the sixth workshop on Declarative aspects of multicore programming. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Chatarasi, J. Shirako, and V. Sarkar. Polyhedral Optimizations of Explicitly Parallel Programs. In Procs. of Int. Conf. on Parallel Architecture and Compilation (PACT). IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Procs. of IEEE Int. Symp. on Workload Characterization (IISWC), Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Claessen, M. Sheeran, and B. J. Svensson. Expressive Array Constructs in an Embedded GPU Kernel Programming Language. In Procs. of Workshop on Declarative Aspects of Multicore Programming (DAMP). ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. NOVA: A Functional Language for Data Parallelism. In Procs. of Int. Workshop on Libraries, Languages, and Compilers for Array Prog., ARRAY’14, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, Neural Information Processing Systems, 2011.Google ScholarGoogle Scholar
  21. D. Cunningham, R. Bordawekar, and V. Saraswat. GPU Programming in a High Level Language: Compiling X10 to CUDA. In Procs. of the ACM SIGPLAN X10 Workshop, X10 ’11, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a High-level Language for GPUs: (via Language Support for Architectures and Compilers). In Procs. of ACM SIGPLAN Int. Conf. on Programming Language Design and Implementation, PLDI’12, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Elsman and M. Dybdal. Compiling a Subset of APL Into a Typed Intermediate Language. In Procs. Int. Workshop on Lib. Lang. and Compilers for Array Prog. (ARRAY). ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Fahndrich and R. DeLine. Adoption and Focus: Practical Linear Types for Imperative Programming. In Procs. of Int. Conf. on Programming Language Design and Implementation, PLDI ’02, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In Procs. of Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid Hexagonal/Classical Tiling for GPUs. In Procs. Int. Symp. on Code Generation and Optimization, CGO ’14. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and M. S. Lam. Interprocedural Parallelization Analysis in SUIF. Trans. on Prog. Lang. and Sys. (TOPLAS), 27(4):662– 731, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Henriksen and C. E. Oancea. A T2 Graph-reduction Approach to Fusion. In Procs. of the 2nd ACM SIGPLAN Workshop on Functional High-performance Computing, FHPC ’13, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Henriksen and C. E. Oancea. Bounds Checking: An Instance of Hybrid Analysis. In Procs. of ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY’14, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Henriksen, M. Elsman, and C. E. Oancea. Size Slicing: A Hybrid Approach to Size Inference in Futhark. In Procs. of the 3rd ACM SIGPLAN Workshop on Functional Highperformance Computing, FHPC’14, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Henriksen, M. Dybdal, H. Urms, A. S. Kiehn, D. Gavin, H. Abelskov, M. Elsman, and C. Oancea. APL on GPUs: A TAIL from the Past, Scribbled in Futhark. In Procs. of the 5th Int. Workshop on Functional High-Performance Computing, FHPC’16, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Henriksen, K. F. Larsen, and C. E. Oancea. Design and GPGPU Performance of Futhark’s Redomap Construct. In Procs. of the 3rd ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY’16, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Hoare. The Rust Programming Language, June 2013.Google ScholarGoogle Scholar
  34. A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable Stream Programming on Graphics Engines. In Procs. of Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Ishizaki, A. Hayashi, G. Koblents, and V. Sarkar. Compiling and Optimizing Java 8 Programs for GPU Execution. In Procs. of Int. Conf. on Parallel Architecture and Compilation, PACT ’15, 2015. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Kumar, G. E. Blelloch, and R. Harper. Parallel Functional Arrays. In Procs. of the 44th ACM SIGPLAN Symp. on Principles of Programming Languages, POPL’17, 2017. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-Aware Mapping of Nested Parallel Patterns on GPUs. In Procs. of the 47th Annual IEEE/ACM Int. Symp. on Microarchitecture, MICRO-47, 2014. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising Purely Functional GPU Programs. In Procs. of the ACM SIGPLAN Int. Conf. on Functional Programming, ICFP ’13, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. E. Meijer, M. Fokkinga, and R. Paterson. Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire. In Proc. 5th ACM Conf. on Functional Programming Languages and Computer Architecture (FPCA), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. E. Oancea and L. Rauchwerger. Logical Inference Techniques for Loop Parallelization. In Procs. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI’12, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. E. Oancea and L. Rauchwerger. Scalable Conditional Induction Variables (CIV) Analysis. In Procs. of the 13th IEEE/ACM Int. Symp. on Code Generation and Optimization, CGO’15, 2015. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Peyton Jones, W. Partain, and A. Santos. Let-floating: Moving Bindings to Give Faster Programs. In Procs. of the First ACM SIGPLAN Int. Conf. on Functional Programming, ICFP’96, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop Transformations: Convexity, Pruning and Optimization. In Procs. of the 38th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’11, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Price and S. McIntosh-Smith. Oclgrind: An extensible OpenCL device simulator. In Procs. of the 3rd Int. Workshop on OpenCL. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Procs. of the 34th ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI’13, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. C. Reddy, M. Kruse, and A. Cohen. Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU. In Procs. of Int. Conf. on Parallel Architectures and Compilation, PACT’16, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. Steuwer, C. Fensch, S. Lindley, and C. Dubach. Generating Performance Portable Code Using Rewrite Rules: From High-level Functional Expressions to High-performance OpenCL Code. In Procs. of the ACM SIGPLAN Int. Conf. on Functional Programming, ICFP’15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. Steuwer, T. Remmelg, and C. Dubach. Lift: A Functional Data-parallel IR for High-performance GPU Code Generation. In Procs. of Int. Symp. on Code Generation and Optimization, CGO’17, 2017. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.Google ScholarGoogle Scholar
  51. A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embed. Comput. Syst., 13(4s):134:1– 134:25, Apr. 2014. ISSN 1539-9087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. J. Svensson. Obsidian: GPU Kernel Programming in Haskell. PhD thesis, Chalmers University of Technology, 2011.Google ScholarGoogle Scholar
  53. D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. Technical report, October 2006.Google ScholarGoogle Scholar
  54. J. A. Tov and R. Pucella. Practical Affine Types. In Procs. of the ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’11, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim. (TACO), 9(4):54:1–54:23, Jan. 2013. ISSN 1544-3566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Procs. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI’10, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 52, Issue 6
          PLDI '17
          June 2017
          708 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3140587
          Issue’s Table of Contents
          • cover image ACM Conferences
            PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
            June 2017
            708 pages
            ISBN:9781450349888
            DOI:10.1145/3062341

          Copyright © 2017 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 June 2017

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!