skip to main content
research-article

Resilient X10: efficient failure-aware programming

Published:06 February 2014Publication History
Skip Abstract Section

Abstract

Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically.

We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {\em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge.

We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). These can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ~100x faster than Hadoop.

References

  1. X10 web site, 2013. URL http://x10-lang.org/.Google ScholarGoogle Scholar
  2. T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In Very Large Data Bases, pages 734--746, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Software Foundation. ZooKeeper Recipes and Solutions, 2012. URL http://zookeeper.apache.org/doc/current/recipes.html.Google ScholarGoogle Scholar
  4. C. Bailey. Java Technology, IBM Style: Introduction to the IBM Developer Kit: An overview of the new functions and features in the IBM implementation of Java 5.0, 2006. URL http://www.ibm.com/developerworks/java/library/j-ibmjava1.html.Google ScholarGoogle Scholar
  5. M. Carbin, D. Kim, S. Misailovic, and M. C. Rinard. Proving acceptability properties of relaxed nondeterministic approximate programs. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI '12, pages 169--180, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In OOPSLA, pages 519--538, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. In the Proceedings of SC'12, November 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Crafa, D. Cunningham, V. Saraswat, A. Shinnar, and O. Tardieu. Semantics of (Resilient) X10. Dec. 2013. URL http://arxiv.org/abs/1312.3739.Google ScholarGoogle Scholar
  9. D. Cutting and E. Baldeschwieler. Meet Hadoop. In O'Reilly Open Software Convention, Portland, OR, 2007.Google ScholarGoogle Scholar
  10. C. Danis and C. Halverson. The Value Derived from the Observational Component in an Integrated Methodology for the Study of HPC Programmer Productivity. Third Workshop on Productivity and Performance in High-End Computing, page 11, 2006.Google ScholarGoogle Scholar
  11. M. Dayarathna, C. Houngkaew, H. Ogata, and T. Suzumura. Scalable performance of ScaleGraph for large scale graph analysis. In HiPC, pages 1--9. IEEE, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Ebcioglu, V. Sarkar, T. El-Ghazawi, and J. Urbanic. An Experiment in Measuring the Productivity of Three Parallel Programming Languages. In P-PHEC workshop, held in conjunction with HPCA, February 2006.Google ScholarGoogle Scholar
  14. C. Halverson, C. B. Swart, J. P. Brezin, J. T. Richards, and C. M. Danis. The Value Derived from the Observational Component in an Integrated Methodology for the Study of HPC Programmer Productivity. 1st International Workshop on Software Engineering for Computational Science and Engineering, 2008.Google ScholarGoogle Scholar
  15. P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, pages 11--11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Lifflander, P. Miller, and L. Kale. Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection. In 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, February 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0032-2. . URL http://doi.acm.org/10.1145/1807167.1807184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Milthorpe, V. Ganesh, A. P. Rendell, and D. Grove. X10 as a Parallel Language for Scientific Computation: Practice and Experience. In IPDPS, pages 1080--1088. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Milthorpe, A. P. Rendell, and T. Huber. PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language. Concurrency and Computation: Practice and Experience, pages n/a--n/a, 2013.Google ScholarGoogle Scholar
  20. PAMI Guide: http://tinyurl.com/pamiguide.Google ScholarGoogle Scholar
  21. Parallel Programming Laboratory. The Charm++ Parallel Programming System Manual. Technical Report version 6.4, Department of Computer Science , University of Illinois, Urbana-Champaign, 2013.Google ScholarGoogle Scholar
  22. V. Saraswat and R. Jagadeesan. Concurrent clustered programming. In CONCUR 2005 - Concurrency Theory, pages 353--367. Springer-Verlag, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, and O. Tardieu. The Asynchronous Partitioned Global Address Space Model. In The First Workshop on Advances in Message Passing (co- located with PLDI 2010), Toronto, Canada, June 2010.Google ScholarGoogle Scholar
  24. V. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 201--212, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Shinnar, D. Cunningham, V. Saraswat, and B. Herta. M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB Endow., 5(12):1736--1747, Aug. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Takeuchi, Y. Makino, K. Kawachiya, H. Horii, T. Suzumura, T. Suganuma, and T. Onodera. Compiling X10 to Java. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, pages 3:1--3:10, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Takeuchi, D. Cunningham, D. Grove, and V. Saraswat. Java interoperability in Managed X10. In Proceedings of the third ACM SIGPLAN X10 Workshop, pages 39--46, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at Petascale. In 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, February 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009. ISBN 0596521979, 9780596521974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X10RT API: http://x10.sourceforge.net/x10rt/.Google ScholarGoogle Scholar
  31. C. Xie, Z. Hao, and H. Chen. X10-FT: transparent fault tolerance for APGAS language and runtime. In P. Balaji, M. Guo, and Z. H. 0001, editors, PMAM, pages 11--20. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Resilient X10: efficient failure-aware programming

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 8
      PPoPP '14
      August 2014
      390 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2692916
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
        February 2014
        412 pages
        ISBN:9781450326568
        DOI:10.1145/2555243

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 February 2014

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!