skip to main content
research-article

Globally precise-restartable execution of parallel programs

Published:09 June 2014Publication History
Skip Abstract Section

Abstract

Emerging trends in computer design and use are likely to make exceptions, once rare, the norm, especially as the system size grows. Due to exceptions, arising from hardware faults, approximate computing, dynamic resource management, etc., successful and error-free execution of programs may no longer be assured. Yet, designers will want to tolerate the exceptions so that the programs execute completely, efficiently and without external intervention.

Modern computers easily handle exceptions in sequential programs, using precise interrupts. But they are ill-equipped to handle exceptions in parallel programs, which are growing in prevalence. In this work we introduce the notion of globally precise-restartable execution of parallel programs, analogous to precise-interruptible execution of sequential programs. We present a software runtime recovery system based on the approach to handle exceptions in suitably-written parallel programs. Qualitative and quantitative analyses show that the proposed system scales with the system size, especially when exceptions are frequent, unlike the conventional checkpoint-and-recovery method.

References

  1. "Activities. Android Developr's Guide," http://developer.android.com/guide/components/activities.html.Google ScholarGoogle Scholar
  2. "Amazon EC2 spot instances," http://aws.amazon.com/ec2/spot-instances/.Google ScholarGoogle Scholar
  3. R. Agarwal, P. Garg, and J. Torrellas, "Rebound: Scalable checkpointing for coherent shared memory," ISCA, 2011, pp. 153--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Anand, C. Muthukrishnan, A. Akella, and R. Ramjee, "Redundancy in network traffic: Findings and implications," SIGMETRICS, 2009, pp. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Baek and T. M. Chilimbi, "Green: A framework for supporting energy-conscious programming using controlled approximation," PLDI, 2010, pp. 198--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman, "Core-Det: A compiler and runtime system for deterministic multithreaded execution," ASPLOS, 2010, pp. 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble, "Deterministic process groups in dOS," OSDI, 2010, pp. 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. D. Berger, T. Yang, T. Liu, and G. Novark, "Grace: Safe multithreaded programming for C/C++," OOPSLA, 2009, pp. 81--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Blagojevic, C. Iancu, K. Yelick, M. Curtis-Maury, D. S. Nikolopoulos, and B. Rose, "Scheduling dynamic parallelism on accelerators," CCF, 2009, pp. 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, "Recent advances in checkpoint/recovery systems," IPDPS, 2006, pp. 8--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz, "Application-level checkpointing for shared memory programs," ASPLOS, 2004, pp. 235--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. P. Carter, et al., "Runnemede: An architecture for ubiquitous high-performance computing," HPCA, 2013, pp. 198--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Devietti, B. Lucia, L. Ceze, and M. Oskin, "DMP: Deterministic shared memory multiprocessing," ASPLOS, 2009, pp. 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman, "RCDC: A relaxed consistency deterministic computer," ASPLOS, 2011, pp. 67--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Duell, P. Hargrove, and E. Roman, "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart," Future Technologies Group, White paper, 2003.Google ScholarGoogle Scholar
  16. E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375--408, Sep. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Architecture support for disciplined approximate programming," ASPLOS, 2012, pp. 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural acceleration for general-purpose approximate programs," MICRO, 2012, pp. 449--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Frigo, C. E. Leiserson, and K. H. Randall, "The implementation of the Cilk-5 multithreaded language," PLDI, 1998, pp. 212--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Gilchrist, "Parallel data compression with bzip2," http://compression.ca/pbzip2/.Google ScholarGoogle Scholar
  21. M. Gupta and E. Schonberg, "Static analysis to reduce synchronization costs in data-parallel programs," POPL, 1996, pp. 322--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. S. Gupta, J. A. Rivers, P. Bose, G.-Y. Wei, and D. Brooks, "Tribeca: Design for PVT variations with local recovery and fine-grained adaptation," MICRO, 2009, pp. 435--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Gupta, K. Rangan, M. Smith, G.-Y. Wei, and D. Brooks, "DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors," HPCA, Feb 2008, pp. 381--392.Google ScholarGoogle Scholar
  24. D. R. Hower, P. Dudnik, M. D. Hill, and D. A. Wood, "Calvin: Deterministic or not? Free will to choose," HPCA, 2011, pp. 333--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. "Semiconductor Industry Association (SIA), Design, International Roadmap for Semiconductors, 2011 edition." http://public.itrs.net.Google ScholarGoogle Scholar
  26. X. Li and D. Yeung, "Exploiting application-level correctness for low-cost fault tolerance." J. Instruction-Level Parallelism, vol. 10, 2008.Google ScholarGoogle Scholar
  27. M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System," University of Wisconsin, Madison, Technical Report CS-TR-1997-1346, Apr. 1997.Google ScholarGoogle Scholar
  28. T. Liu, C. Curtsinger, and E. D. Berger, "DTHREADS: efficient deterministic multithreading," SOSP, 2011, pp. 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Manivannan and M. Singhal, "Quasi-synchronous checkpointing: Models, characterization, and classification," IEEE Trans. Parallel Distrib. Syst., vol. 10, no. 7, pp. 703--713, Jul. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Marques, G. Bronevetsky, R. Fernandes, K. Pingali, and P. Stodghil, "Optimizing checkpoint sizes in the C3 system," IPDPS, 2005, pp. 226.1--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Meixner, M. E. Bauer, and D. Sorin, "Argus: Low-cost, comprehensive error detection in simple cores," MICRO, 2007, pp. 210--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit, "Platform 2012, a many-core computing accelerator for embedded socs: Performance evaluation of visual analytics applications," DAC, 2012, pp. 1137--1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, "Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging," ACM Trans. Database Syst., vol. 17, no. 1, pp. 94--162, Mar. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, "ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers," HPCA, Feb 2006, pp. 200--211.Google ScholarGoogle Scholar
  35. M. Olszewski, J. Ansel, and S. Amarasinghe, "Kendo: Efficient deterministic multithreading in software," ASPLOS, 2009, pp. 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Prvulovic and J. Torrellas, "ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes," ISCA, 2003, pp. 110--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Prvulovic, Z. Zhang, and J. Torrellas, "ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors," ISCA, 2002, pp. 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Ramalingam and K. Vaswani, "Fault tolerance via idempotence," POPL, 2013, pp. 249--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Rieker, J. Ansel, and G. Cooperman, "Transparent user-level checkpointing for the native posix thread library for linux," The Int. Conf. on Parallel and Distrib. Process. Techn. and Appl., Jun 2006.Google ScholarGoogle Scholar
  40. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "Enerj: Approximate data types for safe and general low-power computation," PLDI, 2011, pp. 164--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon, "Innovations in internetworking," C. Partridge, Ed., ch. Design and Implementation of the Sun Network Filesystem, pp. 379--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. J. Sorin, "Fault tolerant computer architecture," Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1--104, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, "SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery," ISCA, 2002, pp. 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. P. Wood, L. Ceze, and D. Grossman, "Low-level detection of language-level data races with LARD," ASPLOS, 2014, pp. 671--686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Yan, X. Liang, Y. Han, and X. Li, "Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors," ISCA, 2010, pp. 485--496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. Yi, D. Kondo, and A. Andrzejak, "Reducing costs of spot instances via checkpointing in the Amazon Elastic Compute Cloud," CLOUD, July 2010, pp. 236--243. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Globally precise-restartable execution of parallel programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 49, Issue 6
            PLDI '14
            June 2014
            598 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/2666356
            • Editor:
            • Andy Gill
            Issue’s Table of Contents
            • cover image ACM Conferences
              PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
              June 2014
              619 pages
              ISBN:9781450327848
              DOI:10.1145/2594291

            Copyright © 2014 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 June 2014

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!