skip to main content
research-article

OPR: deterministic group replay for one-sided communication

Published:27 February 2016Publication History
Skip Abstract Section

Abstract

The ability to reproduce a parallel execution is desirable for debugging and program reliability purposes. In debugging (13), the programmer needs to manually step back in time, while for resilience (6) this is automatically performed by the the application upon failure. To be useful, replay has to faithfully reproduce the original execution. For parallel programs the main challenge is inferring and maintaining the order of conflicting operations (data races). Deterministic record and replay (R&R) techniques have been developed for multithreaded shared memory programs (5), as well as distributed memory programs (14). Our main interest is techniques for large scale scientific (3; 4) programming models.

References

  1. Berkeley UPC. http://upc.lbl.gov.Google ScholarGoogle Scholar
  2. The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.Google ScholarGoogle Scholar
  3. UPC Home Page. http://upc-lang.org.Google ScholarGoogle Scholar
  4. MPI: A Message-Passing Interface Standard. Version 3.0. Message Passing Interface Forum, 2012.Google ScholarGoogle Scholar
  5. J.-D. Choi and H. Srinivasan. Deterministic Replay of Java Multithreaded Applications. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, SPDT '98, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-recovery Protocols in Message-passing Systems. ACM Computing Surveys, 34(3):375--408, September 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly. In Proceedings of the 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. J. LeBlanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, 36(4):471--482, April 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S.-J. Min, C. Iancu, and K. Yelick. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2011.Google ScholarGoogle Scholar
  10. S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic Logging of Operating System Effects to Guide Application-level Architecture Simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Sloan, R. Kumar, and G. Bronevetsky. Large Scale Debugging of Parallel Tasks with AutomaDeD. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, and G. Voelker. MPIWiz: Subgroup Reproducible Replay of MPI Applications. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 251--260. ACM, February 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. OPR: deterministic group replay for one-sided communication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 51, Issue 8
        PPoPP '16
        August 2016
        405 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3016078
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          February 2016
          420 pages
          ISBN:9781450340922
          DOI:10.1145/2851141

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 February 2016

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!