Abstract
Futhark is a purely functional data-parallel array language that offers a machine-neutral programming model and an optimising compiler that generates OpenCL code for GPUs.
This paper presents the design and implementation of three key features of Futhark that seek a suitable middle ground with imperative approaches.
First, in order to express efficient code inside the parallel constructs, we introduce a simple type system for in-place updates that ensures referential transparency and supports equational reasoning.
Second, we furnish Futhark with parallel operators capable of expressing efficient strength-reduced code, along with their fusion rules.
Third, we present a flattening transformation aimed at enhancing the degree of parallelism that (i) builds on loop interchange and distribution but uses higher-order reasoning rather than array-dependence analysis, and (ii) still allows further locality-of-reference optimisations. Finally, an evaluation on 16 benchmarks demonstrates the impact of the language and compiler features and shows application-level performance competitive with hand-written GPU code.
Supplemental Material
Available for Download
This archive contains the test harness used to produce the results in the paper 'Futhark: Purely Functional GPU-Programming with Nested Parallelism and In-Place Array Updates', as well as the source code to the Futhark compiler. The file README.md contains detailed instructions on how to set up and use the harness.
- C. Andreetta, V. Bégot, J. Berthold, M. Elsman, F. Henglein, T. Henriksen, M.-B. Nordfang, and C. E. Oancea. FinPar: A Parallel Financial Benchmark. ACM Trans. Archit. Code Optim. (TACO), 13(2):18:1–18:27, June 2016. ISSN 1544- 3566. Google Scholar
Digital Library
- J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: A Java-compatible and Synthesizable Language for Heterogeneous Architectures. In Procs. of ACM Int. Conf. on Object Oriented Prog. Systems Languages and Applications, OOPSLA ’10, 2010. ACM. Google Scholar
Digital Library
- R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Procs of Int. Conf. on Parallel Architecture and Compilation (PACT), PACT ’15, 2015. IEEE Computer Society. Google Scholar
Digital Library
- E. Barendsen and S. Smetsers. Conventional and Uniqueness Typing in Graph Rewrite Systems. In Found. of Soft. Tech. and Theoretical Comp. Sci. (FSTTCS), volume 761 of LNCS, 1993. Google Scholar
Digital Library
- E. Barendsen and S. Smetsers. Uniqueness Typing for Functional Languages with Graph Rewriting Semantics. Mathematical Structures in Computer Science, 6(6):579–612, 1996.Google Scholar
- J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: A CPU and GPU Math Compiler in Python. In S. van der Walt and J. Millman, editors, Procs. of the 9th Python in Science Conference, 2010.Google Scholar
- L. Bergstrom and J. Reppy. Nested Data-parallelism on the GPU. In Procs of 17th ACM SIGPLAN Int. Conf. on Functional Prog., ICFP’12, 2012. ACM. Google Scholar
Digital Library
- L. Bergstrom, M. Fluet, M. Rainey, J. Reppy, S. Rosen, and A. Shaw. Data-only Flattening for Nested Data Parallelism. In Procs. of the 18th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP ’13, 2013. ACM. Google Scholar
Digital Library
- R. S. Bird. Algebraic Identities for Program Calculation. Computer Journal, 32(2):122–126, 1989. Google Scholar
Digital Library
- G. E. Blelloch. Scans as Primitive Parallel Operations. Computers, IEEE Transactions, 38(11):1526–1538, 1989. Google Scholar
Digital Library
- G. E. Blelloch. Vector models for data-parallel computing, volume 75. MIT press Cambridge, 1990. Google Scholar
Digital Library
- G. E. Blelloch, J. C. Hardwick, J. Sipelstein, M. Zagha, and S. Chatterjee. Implementation of a Portable Nested Data-Parallel Language. Journal of parallel and distributed computing, 21(1):4–14, 1994. Google Scholar
Digital Library
- K. J. Brown, H. Lee, T. Rompf, A. K. Sujeeth, C. De Sa, C. Aberger, and K. Olukotun. Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns. In Procs. of Int. Symp. on Code Generation and Optimization, CGO 2016, 2016. ACM. Google Scholar
Digital Library
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an Embedded Data Parallel Language. In Procs. of ACM Symp. on Principles and Practice of Parallel Programming, PPoPP ’11, 2011. ACM. Google Scholar
Digital Library
- M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In Procs. of the sixth workshop on Declarative aspects of multicore programming. ACM, 2011. Google Scholar
Digital Library
- P. Chatarasi, J. Shirako, and V. Sarkar. Polyhedral Optimizations of Explicitly Parallel Programs. In Procs. of Int. Conf. on Parallel Architecture and Compilation (PACT). IEEE, 2015. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Procs. of IEEE Int. Symp. on Workload Characterization (IISWC), Oct 2009. Google Scholar
Digital Library
- K. Claessen, M. Sheeran, and B. J. Svensson. Expressive Array Constructs in an Embedded GPU Kernel Programming Language. In Procs. of Workshop on Declarative Aspects of Multicore Programming (DAMP). ACM, 2012. Google Scholar
Digital Library
- A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. NOVA: A Functional Language for Data Parallelism. In Procs. of Int. Workshop on Libraries, Languages, and Compilers for Array Prog., ARRAY’14, 2014. ACM. Google Scholar
Digital Library
- R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, Neural Information Processing Systems, 2011.Google Scholar
- D. Cunningham, R. Bordawekar, and V. Saraswat. GPU Programming in a High Level Language: Compiling X10 to CUDA. In Procs. of the ACM SIGPLAN X10 Workshop, X10 ’11, 2011. ACM. Google Scholar
Digital Library
- C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a High-level Language for GPUs: (via Language Support for Architectures and Compilers). In Procs. of ACM SIGPLAN Int. Conf. on Programming Language Design and Implementation, PLDI’12, 2012. ACM. Google Scholar
Digital Library
- M. Elsman and M. Dybdal. Compiling a Subset of APL Into a Typed Intermediate Language. In Procs. Int. Workshop on Lib. Lang. and Compilers for Array Prog. (ARRAY). ACM, 2014. Google Scholar
Digital Library
- M. Fahndrich and R. DeLine. Adoption and Focus: Practical Linear Types for Imperative Programming. In Procs. of Int. Conf. on Programming Language Design and Implementation, PLDI ’02, 2002. ACM. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In Procs. of Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, 2006. ACM. Google Scholar
Digital Library
- T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid Hexagonal/Classical Tiling for GPUs. In Procs. Int. Symp. on Code Generation and Optimization, CGO ’14. ACM, 2014. Google Scholar
Digital Library
- M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and M. S. Lam. Interprocedural Parallelization Analysis in SUIF. Trans. on Prog. Lang. and Sys. (TOPLAS), 27(4):662– 731, 2005. Google Scholar
Digital Library
- T. Henriksen and C. E. Oancea. A T2 Graph-reduction Approach to Fusion. In Procs. of the 2nd ACM SIGPLAN Workshop on Functional High-performance Computing, FHPC ’13, 2013. ACM. Google Scholar
Digital Library
- T. Henriksen and C. E. Oancea. Bounds Checking: An Instance of Hybrid Analysis. In Procs. of ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY’14, 2014. ACM. Google Scholar
Digital Library
- T. Henriksen, M. Elsman, and C. E. Oancea. Size Slicing: A Hybrid Approach to Size Inference in Futhark. In Procs. of the 3rd ACM SIGPLAN Workshop on Functional Highperformance Computing, FHPC’14, 2014. ACM. Google Scholar
Digital Library
- T. Henriksen, M. Dybdal, H. Urms, A. S. Kiehn, D. Gavin, H. Abelskov, M. Elsman, and C. Oancea. APL on GPUs: A TAIL from the Past, Scribbled in Futhark. In Procs. of the 5th Int. Workshop on Functional High-Performance Computing, FHPC’16, 2016. ACM. Google Scholar
Digital Library
- T. Henriksen, K. F. Larsen, and C. E. Oancea. Design and GPGPU Performance of Futhark’s Redomap Construct. In Procs. of the 3rd ACM SIGPLAN Int. Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY’16, 2016. ACM. Google Scholar
Digital Library
- G. Hoare. The Rust Programming Language, June 2013.Google Scholar
- A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: Portable Stream Programming on Graphics Engines. In Procs. of Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, 2011. ACM. Google Scholar
Digital Library
- K. Ishizaki, A. Hayashi, G. Koblents, and V. Sarkar. Compiling and Optimizing Java 8 Programs for GPU Execution. In Procs. of Int. Conf. on Parallel Architecture and Compilation, PACT ’15, 2015. IEEE Computer Society. Google Scholar
Digital Library
- K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002. Google Scholar
Digital Library
- A. Kumar, G. E. Blelloch, and R. Harper. Parallel Functional Arrays. In Procs. of the 44th ACM SIGPLAN Symp. on Principles of Programming Languages, POPL’17, 2017. ACM. Google Scholar
Digital Library
- H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun. Locality-Aware Mapping of Nested Parallel Patterns on GPUs. In Procs. of the 47th Annual IEEE/ACM Int. Symp. on Microarchitecture, MICRO-47, 2014. IEEE Computer Society. Google Scholar
Digital Library
- T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier. Optimising Purely Functional GPU Programs. In Procs. of the ACM SIGPLAN Int. Conf. on Functional Programming, ICFP ’13, 2013. ACM. Google Scholar
Digital Library
- E. Meijer, M. Fokkinga, and R. Paterson. Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire. In Proc. 5th ACM Conf. on Functional Programming Languages and Computer Architecture (FPCA), 1991. Google Scholar
Digital Library
- C. E. Oancea and L. Rauchwerger. Logical Inference Techniques for Loop Parallelization. In Procs. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI’12, 2012. ACM. Google Scholar
Digital Library
- C. E. Oancea and L. Rauchwerger. Scalable Conditional Induction Variables (CIV) Analysis. In Procs. of the 13th IEEE/ACM Int. Symp. on Code Generation and Optimization, CGO’15, 2015. IEEE Computer Society. Google Scholar
Digital Library
- S. Peyton Jones, W. Partain, and A. Santos. Let-floating: Moving Bindings to Give Faster Programs. In Procs. of the First ACM SIGPLAN Int. Conf. on Functional Programming, ICFP’96, 1996. ACM. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop Transformations: Convexity, Pruning and Optimization. In Procs. of the 38th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’11, 2011. ACM. Google Scholar
Digital Library
- J. Price and S. McIntosh-Smith. Oclgrind: An extensible OpenCL device simulator. In Procs. of the 3rd Int. Workshop on OpenCL. ACM, 2015. Google Scholar
Digital Library
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Procs. of the 34th ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI’13, 2013. ACM. Google Scholar
Digital Library
- C. Reddy, M. Kruse, and A. Cohen. Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU. In Procs. of Int. Conf. on Parallel Architectures and Compilation, PACT’16, 2016. ACM. Google Scholar
Digital Library
- M. Steuwer, C. Fensch, S. Lindley, and C. Dubach. Generating Performance Portable Code Using Rewrite Rules: From High-level Functional Expressions to High-performance OpenCL Code. In Procs. of the ACM SIGPLAN Int. Conf. on Functional Programming, ICFP’15, 2015. Google Scholar
Digital Library
- M. Steuwer, T. Remmelg, and C. Dubach. Lift: A Functional Data-parallel IR for High-performance GPU Code Generation. In Procs. of Int. Symp. on Code Generation and Optimization, CGO’17, 2017. IEEE Press. Google Scholar
Digital Library
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.Google Scholar
- A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embed. Comput. Syst., 13(4s):134:1– 134:25, Apr. 2014. ISSN 1539-9087. Google Scholar
Digital Library
- J. Svensson. Obsidian: GPU Kernel Programming in Haskell. PhD thesis, Chalmers University of Technology, 2011.Google Scholar
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. Technical report, October 2006.Google Scholar
- J. A. Tov and R. Pucella. Practical Affine Types. In Procs. of the ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’11, 2011. ACM. Google Scholar
Digital Library
- S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim. (TACO), 9(4):54:1–54:23, Jan. 2013. ISSN 1544-3566. Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Procs. of the ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI’10, 2010. ACM. Google Scholar
Digital Library
Index Terms
Futhark: purely functional GPU-programming with nested parallelism and in-place array updates
Recommendations
Incremental flattening for nested data parallelism
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingCompilation techniques for nested-parallel applications that can adapt to hardware and dataset characteristics are vital for unlocking the power of modern hardware. This paper proposes such a technique, which builds on flattening and is applied in the ...
Futhark: purely functional GPU-programming with nested parallelism and in-place array updates
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationFuthark is a purely functional data-parallel array language that offers a machine-neutral programming model and an optimising compiler that generates OpenCL code for GPUs.
This paper presents the design and implementation of three key features of ...
Static interpretation of higher-order modules in Futhark: functional GPU programming in the large
We present a higher-order module system for the purely functional data-parallel array language Futhark. The module language has the property that it is completely eliminated at compile time, yet it serves as a powerful tool for organizing libraries and ...






Comments