Abstract
The statelessness of functional computations facilitates both parallelism and fault recovery. Faults and non-uniform communication topologies are key challenges for emergent large scale parallel architectures. We report on HdpH and HdpH-RS, a pair of Haskell DSLs designed to address these challenges for irregular task-parallel computations on large distributed-memory architectures. Both DSLs share an API combining explicit task placement with sophisticated work stealing. HdpH focuses on scalability by making placement and stealing topology aware whereas HdpH-RS delivers reliability by means of fault tolerant work stealing.
We present operational semantics for both DSLs and investigate conditions for semantic equivalence of HdpH and HdpH-RS programs, that is, conditions under which topology awareness can be transparently traded for fault tolerance. We detail how the DSL implementations realise topology awareness and fault tolerance. We report an initial evaluation of scalability and fault tolerance on a 256-core cluster and on up to 32K cores of an HPC platform.
- J. Allen. Effective Akka. O'Reilly, 2013. Google Scholar
Digital Library
- J. Armstrong, R. Virding, C. Wikström, and M. Williams. Concurrent Programming in ERLANG. Prentice Hall, 2nd edition, 1996. Google Scholar
Digital Library
- L. A. Barroso, J. Clidaras, and U. Hölzle. The Datacenter as a Computer. Morgan & Claypool, 2nd edition, 2013.Google Scholar
Digital Library
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In USENIX 1997 Annual Technical Conference, Anaheim, CA, USA, 1997. Google Scholar
Digital Library
- P. B. Borwein, R. Ferguson, and M. J. Mossinghoff. Sign changes in sums of the Liouville function. Mathematics of Computation, 77 (263): 1681--1694, 2008.Google Scholar
Cross Ref
- F. Cappello. Fault tolerance in petascale/exascale systems. Int. Journal HPC Applications, 23 (3): 212--226, 2009. Google Scholar
Digital Library
- M. M. T. Chakravarty, R. Leshchinskiy, S. L. Peyton Jones, G. Keller, and S. Marlow. Data parallel Haskell: a status report. In DAMP 2007, Nice, France, pages 10--18. ACM, 2007. Google Scholar
Digital Library
- J. Epstein, A. P. Black, and S. L. Peyton-Jones. Towards Haskell in the cloud. In Haskell 2011, Tokyo, Japan, pages 118--129. ACM, 2011. Google Scholar
Digital Library
- A. Foltzer et al. A meta-scheduler for the Par-monad: composable scheduling for the heterogeneous cloud. In ICFP 2012, Copenhagen, Denmark, pages 235--246. ACM, 2012. Google Scholar
Digital Library
- GAP Group. GAP -- groups, algorithms, and programming, 2007. http://www.gap-system.org.Google Scholar
- M. Geck and J. Müller. James' conjecture for Hecke algebras of exceptional type, I. J. Algebra, 321 (11): 3274--3298, 2009.Google Scholar
Cross Ref
- R. H. Halstead Jr. Multilisp: A language for concurrent symbolic computation. ACM Trans. Prog. Lang. Syst., 7 (4): 501--538, 1985. Google Scholar
Digital Library
- T. Hoff. Netflix: Continually test by failing servers with Chaos Monkey. http://highscalability.com, December 2010.Google Scholar
- V. Janjic and K. Hammond. Granularity-aware work-stealing for computationally-uniform Grids. In CCGrid 2010, Melbourne, Australia, pages 123--134. IEEE, 2010. Google Scholar
Digital Library
- V. Kravtsov, P. Bar, D. Carmeli, A. Schuster, and M. T. Swain. A scheduling framework for large-scale, parallel, and topology-aware applications. J. Parallel Distrib. Comput., 70 (9): 983--992, 2010. Google Scholar
Digital Library
- L. Kuper, A. Turon, N. R. Krishnaswami, and R. R. Newton. Freeze after writing: Quasi-deterministic parallel programming with LVars and handlers. In POPL 2014, San Diego, USA. ACM, 2014. Google Scholar
Digital Library
- J. Lifflander, S. Krishnamoorthy, and L. V. Kale. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In HPDC'12, Delft, The Netherlands, pages 137--148. ACM, 2012. Google Scholar
Digital Library
- S. Linton et al. Easy composition of symbolic computation software using SCSCP. J. Symb. Comput., 49: 95--119, 2013. Google Scholar
Digital Library
- M. Logan, E. Merritt, and R. Carlsson. Erlang and OTP in Action. Manning, 2010. Google Scholar
Digital Library
- R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel functional programming in Eden. J. Funct. Program., 15(3):431--475, 2005. Google Scholar
Digital Library
- W. Ma and S. Krishnamoorthy. Data-driven fault tolerance for work stealing computations. In ICS 2012, Venice, Italy, pages 79--90. ACM, 2012. Google Scholar
Digital Library
- P. Maier and R. Stewart. HdpH source code, 2014. https://github.com/PatrickMaier/HdpH.Google Scholar
- P. Maier and P. Trinder. Implementing a high-level distributed-memory parallel Haskell in Haskell. In IFL 2011, Lawrence, KS, USA, Revised Selected Papers, LNCS 7257, pages 35--50. Springer, 2012. Google Scholar
Digital Library
- P. Maier, D. Livesey, H.-W. Loidl, and P. Trinder. High-performance computer algebra: A Hecke algebra case study. In Euro-Par 2014, Porto, Portugal. Springer, 2014. To appear.Google Scholar
Cross Ref
- P. Maier, R. Stewart, and P. W. Trinder. Reliable scalable symbolic computation: The design of SymGridPar2. Computer Languages, Systems & Structures, 40 (1): 19--35, 2014.Google Scholar
Cross Ref
- S. Marlow, S. L. Peyton-Jones, and S. Singh. Runtime support for multicore Haskell. In ICFP 2009, Edinburgh, Scotland, pages 65--78. ACM, 2009. Google Scholar
Digital Library
- S. Marlow, R. Newton, and S. L. Peyton-Jones. A monad for deterministic parallelism. In Haskell 2011, Tokyo, Japan, pages 71--82. ACM, 2011. Google Scholar
Digital Library
- S.-J. Min, C. Iancu, and K. Yelick. Hierarchical work stealing on manycore clusters. In PGAS 2011, Galveston Island, TX, USA, 2011.Google Scholar
- S. L. Peyton-Jones, A. Gordon, and S. Finne. Concurrent Haskell. In POPL 1996, St. Petersburg Beach, USA, pages 295--308, 1996. Google Scholar
Digital Library
- R. Stewart. Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell. PhD thesis, Heriot-Watt University, 2013.Google Scholar
- R. Stewart. Promela abstraction of HdpH-RS reliable scheduler extension, 2013. https://raw.github.com/robstewart57/phd-thesis/master/spin_model/hdph_scheduler.pml.Google Scholar
- P. W. Trinder et al. GUM: A portable parallel implementation of Haskell. In PLDI 1996, Philadelphia, USA, pages 79--88. ACM, 1996. Google Scholar
Digital Library
- P. W. Trinder et al. Algorithms Strategy = Parallelism. J. Funct. Program., 8 (1): 23--60, 1998 Google Scholar
Digital Library
- T. White. Hadoop -- The Definitive Guide: MapReduce for the Cloud. O'Reilly, 2009. Google Scholar
Digital Library
- G. Wrzesinska, R. van Nieuwpoort, J. Maassen, and H. E. Bal. A simple and efficient fault tolerance mechanism for divide-and-conquer systems. In CCGrid 2004, Chicago, USA, pages 735--734. IEEE, 2004. Google Scholar
Digital Library
Index Terms
The HdpH DSLs for scalable reliable computation
Recommendations
The HdpH DSLs for scalable reliable computation
Haskell '14: Proceedings of the 2014 ACM SIGPLAN symposium on HaskellThe statelessness of functional computations facilitates both parallelism and fault recovery. Faults and non-uniform communication topologies are key challenges for emergent large scale parallel architectures. We report on HdpH and HdpH-RS, a pair of ...
Reliable scalable symbolic computation: the design of SymGridPar2
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied ComputingSymbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and ...
History index of correct computation for fault-tolerant nano-computing
Future nanoscale devices are expected to be more fragile and sensitive to external influences than conventional CMOS-based devices. Researchers predict that it will no longer be possible to test a device and then throw it away if it is found to be ...







Comments