skip to main content
research-article

Optimising purely functional GPU programs

Published:25 September 2013Publication History
Skip Abstract Section

Abstract

Purely functional, embedded array programs are a good match for SIMD hardware, such as GPUs. However, the naive compilation of such programs quickly leads to both code explosion and an excessive use of intermediate data structures. The resulting slow-down is not acceptable on target hardware that is usually chosen to achieve high performance.

In this paper, we discuss two optimisation techniques, sharing recovery and array fusion, that tackle code explosion and eliminate superfluous intermediate structures. Both techniques are well known from other contexts, but they present unique challenges for an embedded language compiled for execution on a GPU. We present novel methods for implementing sharing recovery and array fusion, and demonstrate their effectiveness on a set of benchmarks.

References

  1. R. Atkey, S. Lindley, and J. Yallop. Unembedding domain-specific languages. In Haskell Symposium, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Axelsson. A generic abstract syntax model for embedded languages. In ICFP: International Conference on Functional Programming. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. of High Performance Computing Networking, Storage and Analysis. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Bergstrom and J. Reppy. Nested data-parallelism on the GPU. In ICFP: International Conference on Functional Programming. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, Nov. 1990.Google ScholarGoogle Scholar
  6. G. E. Blelloch. NESL: A nested data-parallel language. Technical Report CMU-CS-95-170, Carnegie Mellon University, 1995.Google ScholarGoogle Scholar
  7. J. F. Canny. A Computational Approach to Edge Detection. Pattern Analysis and Machine Intelligence, (6), 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In DAMP: Declarative Aspects of Multicore Programming. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chatterjee, G. E. Blelloch, and M. Zagha. Scan primitives for vector computers. In Proc. of Supercomputing. IEEE Computer Society Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Claessen, M. Sheeran, and J. Svensson. Obsidian: GPU programming in Haskell. In IFL: Implementation and Application of Functional Languages, 2008.Google ScholarGoogle Scholar
  11. K. Claessen, M. Sheeran, and B. J. Svensson. Expressive array constructs in an embedded GPU kernel programming language. In DAMP: Declarative Aspects and Applications of Multicore Programming. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion from lists to streams to nothing at all. In ICFP: International Conference on Functional Programming. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Elliott. Programming graphics processors functionally. In Haskell Workshop. ACM Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Gill. Type-Safe Observable Sharing in Haskell. In Haskell Symposium, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gill, J. Launchbury, and S. L. Peyton Jones. A short cut to deforestation. In FPCA: Functional Programming Languages and Computer Architecture. ACM, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Keller and M. M. T. Chakravarty. On the distributed implementation of aggregate data structures by program transformation. In Workshop on High-Level Parallel Programming Models and Supportive Environments. Springer-Verlag, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Keller, M. M. T. Chakravarty, R. Leshchinskiy, S. L. Peyton Jones, and B. Lippmeier. Regular, Shape-polymorphic, Parallel Arrays in Haskell. In ICFP: International Conference on Functional Programming. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Larsen. Simple optimizations for an applicative array language for graphics processors. In DAMP: Declarative Aspects of Multicore Programming. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Lippmeier and G. Keller. Efficient Parallel Stencil Convolution in Haskell. In Haskell Symposium. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Lippmeier, M. Chakravarty, G. Keller, and S. Peyton Jones. Guiding parallel array fusion with indexed types. In Haskell Symposium. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Ma and G. Agrawal. An integer programming framework for optimizing shared memory use on GPUs. In PACT: Parallel Architectures and Compilation Techniques. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Mainland and G. Morrisett. Nikola: Embedding compiled GPU functions in Haskell. In Haskell Symposium. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Mainland, R. Leshchinskiy, and S. Peyton Jones. Exploiting vector instructions with generalized stream fusion. In ICFP: International Conference on Functional Programming. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Matsuzaki and K. Emoto. Implementing fusion-equipped parallel skeletons by expression templates. In IFL: Implementation and Application of Functional Languages. Springer-Verlag, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Meijer, M. M. Fokkinga, and R. Paterson. Functional programming with bananas, lenses, envelopes and barbed wire. In FPCA: Functional Programming and Computer Architecture, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Peyton Jones and S. Marlow. Secrets of the Glasgow Haskell Compiler inliner. J. Funct. Program., 12(5), July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Peyton Jones, S. Marlow, and C. Elliott. Stretching the Storage Manager: Weak Pointers and Stable Names in Haskell. In IFL: Implementation of Functional Languages. Springer Heidelberg, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Rompf, A. K. Sujeeth, N. Amin, K. J. Brown, V. Jovanovic, H. Lee, M. Odersky, and K. Olukotun. Optimizing data structures in highlevel programs: New directions for extensible compilers based on staging. In POPL: Symposium on Principles of Programming Languages. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS: Parallel and Distributed Processing. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Sato and H. Iwasaki. A skeletal parallel framework with fusion optimizer for GPGPU programming. In APLAS: Asian Symposium on Programming Languages and Systems. Springer-Verlag, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. Eurographics Association, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Stam. Stable fluids. In SIGGRAPH: Computer graphics and Interactive Techniques. ACM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimising purely functional GPU programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!