skip to main content
research-article

The HdpH DSLs for scalable reliable computation

Published:03 September 2014Publication History
Skip Abstract Section

Abstract

The statelessness of functional computations facilitates both parallelism and fault recovery. Faults and non-uniform communication topologies are key challenges for emergent large scale parallel architectures. We report on HdpH and HdpH-RS, a pair of Haskell DSLs designed to address these challenges for irregular task-parallel computations on large distributed-memory architectures. Both DSLs share an API combining explicit task placement with sophisticated work stealing. HdpH focuses on scalability by making placement and stealing topology aware whereas HdpH-RS delivers reliability by means of fault tolerant work stealing.

We present operational semantics for both DSLs and investigate conditions for semantic equivalence of HdpH and HdpH-RS programs, that is, conditions under which topology awareness can be transparently traded for fault tolerance. We detail how the DSL implementations realise topology awareness and fault tolerance. We report an initial evaluation of scalability and fault tolerance on a 256-core cluster and on up to 32K cores of an HPC platform.

References

  1. J. Allen. Effective Akka. O'Reilly, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Armstrong, R. Virding, C. Wikström, and M. Williams. Concurrent Programming in ERLANG. Prentice Hall, 2nd edition, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. A. Barroso, J. Clidaras, and U. Hölzle. The Datacenter as a Computer. Morgan & Claypool, 2nd edition, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In USENIX 1997 Annual Technical Conference, Anaheim, CA, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. B. Borwein, R. Ferguson, and M. J. Mossinghoff. Sign changes in sums of the Liouville function. Mathematics of Computation, 77 (263): 1681--1694, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  6. F. Cappello. Fault tolerance in petascale/exascale systems. Int. Journal HPC Applications, 23 (3): 212--226, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. M. T. Chakravarty, R. Leshchinskiy, S. L. Peyton Jones, G. Keller, and S. Marlow. Data parallel Haskell: a status report. In DAMP 2007, Nice, France, pages 10--18. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Epstein, A. P. Black, and S. L. Peyton-Jones. Towards Haskell in the cloud. In Haskell 2011, Tokyo, Japan, pages 118--129. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Foltzer et al. A meta-scheduler for the Par-monad: composable scheduling for the heterogeneous cloud. In ICFP 2012, Copenhagen, Denmark, pages 235--246. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. GAP Group. GAP -- groups, algorithms, and programming, 2007. http://www.gap-system.org.Google ScholarGoogle Scholar
  11. M. Geck and J. Müller. James' conjecture for Hecke algebras of exceptional type, I. J. Algebra, 321 (11): 3274--3298, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  12. R. H. Halstead Jr. Multilisp: A language for concurrent symbolic computation. ACM Trans. Prog. Lang. Syst., 7 (4): 501--538, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hoff. Netflix: Continually test by failing servers with Chaos Monkey. http://highscalability.com, December 2010.Google ScholarGoogle Scholar
  14. V. Janjic and K. Hammond. Granularity-aware work-stealing for computationally-uniform Grids. In CCGrid 2010, Melbourne, Australia, pages 123--134. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Kravtsov, P. Bar, D. Carmeli, A. Schuster, and M. T. Swain. A scheduling framework for large-scale, parallel, and topology-aware applications. J. Parallel Distrib. Comput., 70 (9): 983--992, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Kuper, A. Turon, N. R. Krishnaswami, and R. R. Newton. Freeze after writing: Quasi-deterministic parallel programming with LVars and handlers. In POPL 2014, San Diego, USA. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Lifflander, S. Krishnamoorthy, and L. V. Kale. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In HPDC'12, Delft, The Netherlands, pages 137--148. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Linton et al. Easy composition of symbolic computation software using SCSCP. J. Symb. Comput., 49: 95--119, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Logan, E. Merritt, and R. Carlsson. Erlang and OTP in Action. Manning, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel functional programming in Eden. J. Funct. Program., 15(3):431--475, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Ma and S. Krishnamoorthy. Data-driven fault tolerance for work stealing computations. In ICS 2012, Venice, Italy, pages 79--90. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Maier and R. Stewart. HdpH source code, 2014. https://github.com/PatrickMaier/HdpH.Google ScholarGoogle Scholar
  23. P. Maier and P. Trinder. Implementing a high-level distributed-memory parallel Haskell in Haskell. In IFL 2011, Lawrence, KS, USA, Revised Selected Papers, LNCS 7257, pages 35--50. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Maier, D. Livesey, H.-W. Loidl, and P. Trinder. High-performance computer algebra: A Hecke algebra case study. In Euro-Par 2014, Porto, Portugal. Springer, 2014. To appear.Google ScholarGoogle ScholarCross RefCross Ref
  25. P. Maier, R. Stewart, and P. W. Trinder. Reliable scalable symbolic computation: The design of SymGridPar2. Computer Languages, Systems & Structures, 40 (1): 19--35, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  26. S. Marlow, S. L. Peyton-Jones, and S. Singh. Runtime support for multicore Haskell. In ICFP 2009, Edinburgh, Scotland, pages 65--78. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Marlow, R. Newton, and S. L. Peyton-Jones. A monad for deterministic parallelism. In Haskell 2011, Tokyo, Japan, pages 71--82. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S.-J. Min, C. Iancu, and K. Yelick. Hierarchical work stealing on manycore clusters. In PGAS 2011, Galveston Island, TX, USA, 2011.Google ScholarGoogle Scholar
  29. S. L. Peyton-Jones, A. Gordon, and S. Finne. Concurrent Haskell. In POPL 1996, St. Petersburg Beach, USA, pages 295--308, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Stewart. Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell. PhD thesis, Heriot-Watt University, 2013.Google ScholarGoogle Scholar
  31. R. Stewart. Promela abstraction of HdpH-RS reliable scheduler extension, 2013. https://raw.github.com/robstewart57/phd-thesis/master/spin_model/hdph_scheduler.pml.Google ScholarGoogle Scholar
  32. P. W. Trinder et al. GUM: A portable parallel implementation of Haskell. In PLDI 1996, Philadelphia, USA, pages 79--88. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. W. Trinder et al. Algorithms Strategy = Parallelism. J. Funct. Program., 8 (1): 23--60, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. T. White. Hadoop -- The Definitive Guide: MapReduce for the Cloud. O'Reilly, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H. E. Bal. A simple and efficient fault tolerance mechanism for divide-and-conquer systems. In CCGrid 2004, Chicago, USA, pages 735--734. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The HdpH DSLs for scalable reliable computation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 12
      Haskell '14
      December 2014
      141 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2775050
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        Haskell '14: Proceedings of the 2014 ACM SIGPLAN symposium on Haskell
        September 2014
        154 pages
        ISBN:9781450330411
        DOI:10.1145/2633357

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 September 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!