ABSTRACT
Programming languages using functions on collections of values, such as map, reduce, scan and filter, have been used for over fifty years. Such collections have proven to be particularly useful in the context of parallelism because such functions are naturally parallel. However, if implemented naively they lead to the generation of temporary intermediate collections that can significantly increase memory usage and runtime. To avoid this pitfall, many approaches use "fusion" to combine operations and avoid temporary results. However, most of these approaches involve significant changes to a compiler and are limited to a small set of functions, such as maps and reduces.
In this paper we present a library-based approach that fuses widely used operations such as scans, filters, and flattens. In conjunction with existing techniques, this covers most of the common operations on collections. Our approach is based on a novel technique which parallelizes over blocks, with streams within each block. We demonstrate the approach by implementing libraries targeting multicore parallelism in two languages: Parallel ML and C++, which have very different semantics and compilers. To help users understand when to use the approach, we define a cost semantics that indicates when fusion occurs and how it reduces memory allocations. We present experimental results for a dozen benchmarks that demonstrate significant reductions in both time and space. In most cases the approach generates code that is near optimal for the machines it is running on.
- Frances E. Allen and John Cocke. 1971. A Catalogue of Optimizing Transformations. IBM Thomas J. Watson Research Center.Google Scholar
- Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google Scholar
- John W. Backus. 1978. Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21, 8 (1978), 613--641. Google Scholar
Digital Library
- Guy E. Blelloch. 1992. NESL: A Nested Data-Parallel Language. Technical Report CMU-CS-92-103. School of Computer Science, Carnegie Mellon University.Google Scholar
- Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. 2020. ParlayLib - A Toolkit for Parallel Algorithms on Shared-Memory Multi-core Machines. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). Google Scholar
Digital Library
- Guy. E. Blelloch and Siddhartha Chatterjee. 1990. Vcode: a data-parallel intermediate language. In IEEE Frontiers of Massively Parallel Computation. 471--480.Google Scholar
- Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In SIAM SDM.Google Scholar
- Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data Parallel Haskell: A Status Report. In Workshop on Declarative Aspects of Multicore Programming (DAMP). 10--18.Google Scholar
Digital Library
- Siddhartha Chatterjee, Guy E. Blelloch, and Allan L. Fisher. 1991. Size and Access Inference for Data-Parallel Programs. In ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI). 130--144.Google Scholar
- Siddhartha Chatterjee, Guy E. Blelloch, and Marco Zagha. 1990. Scan Primitives for Vector Computers. In 1990 ACM/IEEE Conference on Supercomputing (SC). 666--675.Google Scholar
- E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (June 1970), 377--387.Google Scholar
Digital Library
- Duncan Coutts, Roman Leshchinskiy, and Don Stewart. 2007. Stream Fusion: From Lists to Streams to Nothing at All. In ACM SIGPLAN International Conference on Functional Programming (ICFP). 315--326.Google Scholar
Digital Library
- Alain Darte. 1999. On the complexity of loop fusion. In IEEE Int. Conference on Parallel Architectures and Compilation Techniques (PACT).Google Scholar
Cross Ref
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (2008), 107--113.Google Scholar
Digital Library
- Kento Emoto and Kiminori Matsuzaki. 2014. An automatic fusion mechanism for variable-length list skeletons in SkeTo. International Journal of Parallel Programming 42, 4 (2014), 546--563.Google Scholar
Digital Library
- Andrew Gill, John Launchbury, and Simon L. Peyton Jones. 1993. A Short Cut to Deforestation. In Proc. Conference on Functional Programming Languages and Computer Architecture (FPCA). Google Scholar
Digital Library
- Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 556--571.Google Scholar
- Kenneth E. Iverson. 1962. A Programming Language. Wiley, New York.Google Scholar
Digital Library
- Guy L. Steele Jr. and W. Daniel Hillis. 1986. Connection Machine LISP: Fine-Grained Parallel Symbolic Processing. In ACM Conference on LISP and Functional Programming (LFP). 279--297.Google Scholar
- Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon L. Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in Haskell. In ACM SIGPLAN international conference on Functional programming (ICFP). ACM, 261--272.Google Scholar
Digital Library
- Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Int. Workshop on Languages and Compilers for Parallel Computing.Google Scholar
- Ben Lippmeier, Manuel M. T. Chakravarty, Gabriele Keller, and Simon L. Peyton Jones. 2012. Guiding parallel array fusion with indexed types. In ACM SIGPLAN Symposium on Haskell. 25--36.Google Scholar
- J. David MacDonald and Kellogg S. Booth. 1990. Heuristics for ray tracing using space subdivision. Vis. Comput. 6, 3 (1990), 153--166. Google Scholar
Digital Library
- Geoffrey Mainland, Roman Leshchinskiy, and Simon Peyton Jones. 2017. Exploiting vector instructions with generalized stream fusion. Commun. ACM 60, 5 (2017), 83--91.Google Scholar
Digital Library
- Kiminori Matsuzaki and Kento Emoto. 2009. Implementing fusion-equipped parallel skeletons by expression templates. In International Symposium on Implementation and Application of Functional Languages. Springer, 72--89.Google Scholar
- Trevor L. McDonell, Manuel M.T. Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising Purely Functional GPU Programs. In ACM SIGPLAN International Conference on Functional Programming (ICFP). 49--60.Google Scholar
- Eric Niebler, Casey Carter, and Christopher Di Bella. 2018. The One Ranges Proposal. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0896r4.pdf.Google Scholar
- John R. Rose and Guy L. Steele Jr. 1987. C*: An Extended C Language. In Proceedings of the C++ Workshop. Santa Fe, NM, USA, November 1987. USENIX Association, 361--398.Google Scholar
- J. T. Schwartz, R.B.K Dewar, E. Dubinsky, and E. Schonberg. 1986. Programming with Sets: An Introduction to SETL. Springer-Verlag, New York.Google Scholar
Digital Library
- Julian Shun, Guy E. Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: the Problem-Based Benchmark Suite. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).Google Scholar
Digital Library
- Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. In ACM SIGPLAN International Conference on Functional Programming (ICFP). 205--217.Google Scholar
Digital Library
- Josef Svenningsson. 2002. Shortcut Fusion for Accumulating Parameters & Zip-like Functions. In Proc ACM SIGPLAN International Conference on Functional Programming (ICFP). Google Scholar
Digital Library
- Bo Joel Svensson and Josef Svenningsson. 2014. Defunctionalizing Push Arrays. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-Performance Computing (Gothenburg, Sweden) (FHPC '14). Association for Computing Machinery, New York, NY, USA, 43--52. Google Scholar
Digital Library
- Philip Wadler. 1990. Deforestation: Transforming Programs to Eliminate Trees. Theor. Comput. Sci. 73, 2 (1990), 231--248.Google Scholar
Digital Library
- Joe Warren. 1984. A Hierarchical Basis for Reordering Transformations. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL).Google Scholar
Digital Library
- Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google Scholar
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.Google Scholar
Digital Library
Index Terms
- Parallel block-delayed sequences
Recommendations
A Comparison of 12 Parallel FORTRAN Dialects
A simple program that approximates pi by numerical quadrature is rewritten to run on nine commercially available processors to illustrate the compilations that arise in parallel programming in FORTRAN. The machines used are the Alliant FX/8, BBN ...
Parallel heap: A practical priority queue for fine-to-medium-grained applications on small multiprocessors
SPDP '95: Proceedings of the 7th IEEE Symposium on Parallel and Distributeed ProcessingWe present an efficient implementation of the parallel heap data structure on a bus-based Silicon Graphics multiprocessor GTX/4D. Parallel heap is theoretically the first heap-based data structure to have implemented an optimally scalable parallel ...
Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks
FRONTIERS '96: Proceedings of the 6th Symposium on the Frontiers of Massively Parallel ComputationHigh Performance Fortran (HPF) compilers and communication libraries with the standardized Message Passing Interface (MPI) are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports ...





Comments