skip to main content
10.1145/3503221.3508434acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

Parallel block-delayed sequences

Published:28 March 2022Publication History

ABSTRACT

Programming languages using functions on collections of values, such as map, reduce, scan and filter, have been used for over fifty years. Such collections have proven to be particularly useful in the context of parallelism because such functions are naturally parallel. However, if implemented naively they lead to the generation of temporary intermediate collections that can significantly increase memory usage and runtime. To avoid this pitfall, many approaches use "fusion" to combine operations and avoid temporary results. However, most of these approaches involve significant changes to a compiler and are limited to a small set of functions, such as maps and reduces.

In this paper we present a library-based approach that fuses widely used operations such as scans, filters, and flattens. In conjunction with existing techniques, this covers most of the common operations on collections. Our approach is based on a novel technique which parallelizes over blocks, with streams within each block. We demonstrate the approach by implementing libraries targeting multicore parallelism in two languages: Parallel ML and C++, which have very different semantics and compilers. To help users understand when to use the approach, we define a cost semantics that indicates when fusion occurs and how it reduces memory allocations. We present experimental results for a dozen benchmarks that demonstrate significant reductions in both time and space. In most cases the approach generates code that is near optimal for the machines it is running on.

References

  1. Frances E. Allen and John Cocke. 1971. A Catalogue of Optimizing Transformations. IBM Thomas J. Watson Research Center.Google ScholarGoogle Scholar
  2. Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google ScholarGoogle Scholar
  3. John W. Backus. 1978. Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21, 8 (1978), 613--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guy E. Blelloch. 1992. NESL: A Nested Data-Parallel Language. Technical Report CMU-CS-92-103. School of Computer Science, Carnegie Mellon University.Google ScholarGoogle Scholar
  5. Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. 2020. ParlayLib - A Toolkit for Parallel Algorithms on Shared-Memory Multi-core Machines. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Guy. E. Blelloch and Siddhartha Chatterjee. 1990. Vcode: a data-parallel intermediate language. In IEEE Frontiers of Massively Parallel Computation. 471--480.Google ScholarGoogle Scholar
  7. Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In SIAM SDM.Google ScholarGoogle Scholar
  8. Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data Parallel Haskell: A Status Report. In Workshop on Declarative Aspects of Multicore Programming (DAMP). 10--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Siddhartha Chatterjee, Guy E. Blelloch, and Allan L. Fisher. 1991. Size and Access Inference for Data-Parallel Programs. In ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI). 130--144.Google ScholarGoogle Scholar
  10. Siddhartha Chatterjee, Guy E. Blelloch, and Marco Zagha. 1990. Scan Primitives for Vector Computers. In 1990 ACM/IEEE Conference on Supercomputing (SC). 666--675.Google ScholarGoogle Scholar
  11. E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (June 1970), 377--387.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Duncan Coutts, Roman Leshchinskiy, and Don Stewart. 2007. Stream Fusion: From Lists to Streams to Nothing at All. In ACM SIGPLAN International Conference on Functional Programming (ICFP). 315--326.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alain Darte. 1999. On the complexity of loop fusion. In IEEE Int. Conference on Parallel Architectures and Compilation Techniques (PACT).Google ScholarGoogle ScholarCross RefCross Ref
  14. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (2008), 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kento Emoto and Kiminori Matsuzaki. 2014. An automatic fusion mechanism for variable-length list skeletons in SkeTo. International Journal of Parallel Programming 42, 4 (2014), 546--563.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Andrew Gill, John Launchbury, and Simon L. Peyton Jones. 1993. A Short Cut to Deforestation. In Proc. Conference on Functional Programming Languages and Computer Architecture (FPCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 556--571.Google ScholarGoogle Scholar
  18. Kenneth E. Iverson. 1962. A Programming Language. Wiley, New York.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guy L. Steele Jr. and W. Daniel Hillis. 1986. Connection Machine LISP: Fine-Grained Parallel Symbolic Processing. In ACM Conference on LISP and Functional Programming (LFP). 279--297.Google ScholarGoogle Scholar
  20. Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon L. Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in Haskell. In ACM SIGPLAN international conference on Functional programming (ICFP). ACM, 261--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Int. Workshop on Languages and Compilers for Parallel Computing.Google ScholarGoogle Scholar
  22. Ben Lippmeier, Manuel M. T. Chakravarty, Gabriele Keller, and Simon L. Peyton Jones. 2012. Guiding parallel array fusion with indexed types. In ACM SIGPLAN Symposium on Haskell. 25--36.Google ScholarGoogle Scholar
  23. J. David MacDonald and Kellogg S. Booth. 1990. Heuristics for ray tracing using space subdivision. Vis. Comput. 6, 3 (1990), 153--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Geoffrey Mainland, Roman Leshchinskiy, and Simon Peyton Jones. 2017. Exploiting vector instructions with generalized stream fusion. Commun. ACM 60, 5 (2017), 83--91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kiminori Matsuzaki and Kento Emoto. 2009. Implementing fusion-equipped parallel skeletons by expression templates. In International Symposium on Implementation and Application of Functional Languages. Springer, 72--89.Google ScholarGoogle Scholar
  26. Trevor L. McDonell, Manuel M.T. Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising Purely Functional GPU Programs. In ACM SIGPLAN International Conference on Functional Programming (ICFP). 49--60.Google ScholarGoogle Scholar
  27. Eric Niebler, Casey Carter, and Christopher Di Bella. 2018. The One Ranges Proposal. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0896r4.pdf.Google ScholarGoogle Scholar
  28. John R. Rose and Guy L. Steele Jr. 1987. C*: An Extended C Language. In Proceedings of the C++ Workshop. Santa Fe, NM, USA, November 1987. USENIX Association, 361--398.Google ScholarGoogle Scholar
  29. J. T. Schwartz, R.B.K Dewar, E. Dubinsky, and E. Schonberg. 1986. Programming with Sets: An Introduction to SETL. Springer-Verlag, New York.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Julian Shun, Guy E. Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: the Problem-Based Benchmark Suite. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. In ACM SIGPLAN International Conference on Functional Programming (ICFP). 205--217.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Josef Svenningsson. 2002. Shortcut Fusion for Accumulating Parameters & Zip-like Functions. In Proc ACM SIGPLAN International Conference on Functional Programming (ICFP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Bo Joel Svensson and Josef Svenningsson. 2014. Defunctionalizing Push Arrays. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-Performance Computing (Gothenburg, Sweden) (FHPC '14). Association for Computing Machinery, New York, NY, USA, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Philip Wadler. 1990. Deforestation: Transforming Programs to Eliminate Trees. Theor. Comput. Sci. 73, 2 (1990), 231--248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Joe Warren. 1984. A Hierarchical Basis for Reordering Transformations. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google ScholarGoogle Scholar
  37. Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallel block-delayed sequences

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader