Abstract
Purely functional, embedded array programs are a good match for SIMD hardware, such as GPUs. However, the naive compilation of such programs quickly leads to both code explosion and an excessive use of intermediate data structures. The resulting slow-down is not acceptable on target hardware that is usually chosen to achieve high performance.
In this paper, we discuss two optimisation techniques, sharing recovery and array fusion, that tackle code explosion and eliminate superfluous intermediate structures. Both techniques are well known from other contexts, but they present unique challenges for an embedded language compiled for execution on a GPU. We present novel methods for implementing sharing recovery and array fusion, and demonstrate their effectiveness on a set of benchmarks.
- R. Atkey, S. Lindley, and J. Yallop. Unembedding domain-specific languages. In Haskell Symposium, 2009. Google Scholar
Digital Library
- E. Axelsson. A generic abstract syntax model for embedded languages. In ICFP: International Conference on Functional Programming. ACM, 2012. Google Scholar
Digital Library
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. of High Performance Computing Networking, Storage and Analysis. ACM, 2009. Google Scholar
Digital Library
- L. Bergstrom and J. Reppy. Nested data-parallelism on the GPU. In ICFP: International Conference on Functional Programming. ACM, 2012. Google Scholar
Digital Library
- G. E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, Nov. 1990.Google Scholar
- G. E. Blelloch. NESL: A nested data-parallel language. Technical Report CMU-CS-95-170, Carnegie Mellon University, 1995.Google Scholar
- J. F. Canny. A Computational Approach to Edge Detection. Pattern Analysis and Machine Intelligence, (6), 1986. Google Scholar
Digital Library
- M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In DAMP: Declarative Aspects of Multicore Programming. ACM, 2011. Google Scholar
Digital Library
- S. Chatterjee, G. E. Blelloch, and M. Zagha. Scan primitives for vector computers. In Proc. of Supercomputing. IEEE Computer Society Press, 1990. Google Scholar
Digital Library
- K. Claessen, M. Sheeran, and J. Svensson. Obsidian: GPU programming in Haskell. In IFL: Implementation and Application of Functional Languages, 2008.Google Scholar
- K. Claessen, M. Sheeran, and B. J. Svensson. Expressive array constructs in an embedded GPU kernel programming language. In DAMP: Declarative Aspects and Applications of Multicore Programming. ACM, 2012. Google Scholar
Digital Library
- D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion from lists to streams to nothing at all. In ICFP: International Conference on Functional Programming. ACM, 2007. Google Scholar
Digital Library
- C. Elliott. Programming graphics processors functionally. In Haskell Workshop. ACM Press, 2004. Google Scholar
Digital Library
- A. Gill. Type-Safe Observable Sharing in Haskell. In Haskell Symposium, 2009. Google Scholar
Digital Library
- A. Gill, J. Launchbury, and S. L. Peyton Jones. A short cut to deforestation. In FPCA: Functional Programming Languages and Computer Architecture. ACM, 1993. Google Scholar
Digital Library
- G. Keller and M. M. T. Chakravarty. On the distributed implementation of aggregate data structures by program transformation. In Workshop on High-Level Parallel Programming Models and Supportive Environments. Springer-Verlag, 1999. Google Scholar
Digital Library
- G. Keller, M. M. T. Chakravarty, R. Leshchinskiy, S. L. Peyton Jones, and B. Lippmeier. Regular, Shape-polymorphic, Parallel Arrays in Haskell. In ICFP: International Conference on Functional Programming. ACM, 2010. Google Scholar
Digital Library
- B. Larsen. Simple optimizations for an applicative array language for graphics processors. In DAMP: Declarative Aspects of Multicore Programming. ACM, 2011. Google Scholar
Digital Library
- B. Lippmeier and G. Keller. Efficient Parallel Stencil Convolution in Haskell. In Haskell Symposium. ACM, 2011. Google Scholar
Digital Library
- B. Lippmeier, M. Chakravarty, G. Keller, and S. Peyton Jones. Guiding parallel array fusion with indexed types. In Haskell Symposium. ACM, 2012. Google Scholar
Digital Library
- W. Ma and G. Agrawal. An integer programming framework for optimizing shared memory use on GPUs. In PACT: Parallel Architectures and Compilation Techniques. ACM, 2010. Google Scholar
Digital Library
- G. Mainland and G. Morrisett. Nikola: Embedding compiled GPU functions in Haskell. In Haskell Symposium. ACM, 2010. Google Scholar
Digital Library
- G. Mainland, R. Leshchinskiy, and S. Peyton Jones. Exploiting vector instructions with generalized stream fusion. In ICFP: International Conference on Functional Programming. ACM, 2013. Google Scholar
Digital Library
- K. Matsuzaki and K. Emoto. Implementing fusion-equipped parallel skeletons by expression templates. In IFL: Implementation and Application of Functional Languages. Springer-Verlag, 2010. Google Scholar
Digital Library
- E. Meijer, M. M. Fokkinga, and R. Paterson. Functional programming with bananas, lenses, envelopes and barbed wire. In FPCA: Functional Programming and Computer Architecture, 1991. Google Scholar
Digital Library
- S. Peyton Jones and S. Marlow. Secrets of the Glasgow Haskell Compiler inliner. J. Funct. Program., 12(5), July 2002. Google Scholar
Digital Library
- S. Peyton Jones, S. Marlow, and C. Elliott. Stretching the Storage Manager: Weak Pointers and Stable Names in Haskell. In IFL: Implementation of Functional Languages. Springer Heidelberg, 2000. Google Scholar
Digital Library
- T. Rompf, A. K. Sujeeth, N. Amin, K. J. Brown, V. Jovanovic, H. Lee, M. Odersky, and K. Olukotun. Optimizing data structures in highlevel programs: New directions for extensible compilers based on staging. In POPL: Symposium on Principles of Programming Languages. ACM, 2013. Google Scholar
Digital Library
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS: Parallel and Distributed Processing. IEEE Computer Society, 2009. Google Scholar
Digital Library
- S. Sato and H. Iwasaki. A skeletal parallel framework with fusion optimizer for GPGPU programming. In APLAS: Asian Symposium on Programming Languages and Systems. Springer-Verlag, 2009. Google Scholar
Digital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. Eurographics Association, 2007. Google Scholar
Digital Library
- J. Stam. Stable fluids. In SIGGRAPH: Computer graphics and Interactive Techniques. ACM Press, 1999. Google Scholar
Digital Library
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3), 2009. Google Scholar
Digital Library
Index Terms
Optimising purely functional GPU programs
Recommendations
Optimising purely functional GPU programs
ICFP '13: Proceedings of the 18th ACM SIGPLAN international conference on Functional programmingPurely functional, embedded array programs are a good match for SIMD hardware, such as GPUs. However, the naive compilation of such programs quickly leads to both code explosion and an excessive use of intermediate data structures. The resulting slow-...
Accelerating Haskell array codes with multicore GPUs
DAMP '11: Proceedings of the sixth workshop on Declarative aspects of multicore programmingCurrent GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge.
To raise ...
Type-safe runtime code generation: accelerate to LLVM
Haskell '15: Proceedings of the 2015 ACM SIGPLAN Symposium on HaskellEmbedded languages are often compiled at application runtime; thus, embedded compile-time errors become application runtime errors. We argue that advanced type system features, such as GADTs and type families, play a crucial role in minimising such ...







Comments