Abstract
Emerging trends in computer design and use are likely to make exceptions, once rare, the norm, especially as the system size grows. Due to exceptions, arising from hardware faults, approximate computing, dynamic resource management, etc., successful and error-free execution of programs may no longer be assured. Yet, designers will want to tolerate the exceptions so that the programs execute completely, efficiently and without external intervention.
Modern computers easily handle exceptions in sequential programs, using precise interrupts. But they are ill-equipped to handle exceptions in parallel programs, which are growing in prevalence. In this work we introduce the notion of globally precise-restartable execution of parallel programs, analogous to precise-interruptible execution of sequential programs. We present a software runtime recovery system based on the approach to handle exceptions in suitably-written parallel programs. Qualitative and quantitative analyses show that the proposed system scales with the system size, especially when exceptions are frequent, unlike the conventional checkpoint-and-recovery method.
- "Activities. Android Developr's Guide," http://developer.android.com/guide/components/activities.html.Google Scholar
- "Amazon EC2 spot instances," http://aws.amazon.com/ec2/spot-instances/.Google Scholar
- R. Agarwal, P. Garg, and J. Torrellas, "Rebound: Scalable checkpointing for coherent shared memory," ISCA, 2011, pp. 153--164. Google Scholar
Digital Library
- A. Anand, C. Muthukrishnan, A. Akella, and R. Ramjee, "Redundancy in network traffic: Findings and implications," SIGMETRICS, 2009, pp. 37--48. Google Scholar
Digital Library
- W. Baek and T. M. Chilimbi, "Green: A framework for supporting energy-conscious programming using controlled approximation," PLDI, 2010, pp. 198--209. Google Scholar
Digital Library
- T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman, "Core-Det: A compiler and runtime system for deterministic multithreaded execution," ASPLOS, 2010, pp. 53--64. Google Scholar
Digital Library
- T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble, "Deterministic process groups in dOS," OSDI, 2010, pp. 1--16. Google Scholar
Digital Library
- E. D. Berger, T. Yang, T. Liu, and G. Novark, "Grace: Safe multithreaded programming for C/C++," OOPSLA, 2009, pp. 81--96. Google Scholar
Digital Library
- F. Blagojevic, C. Iancu, K. Yelick, M. Curtis-Maury, D. S. Nikolopoulos, and B. Rose, "Scheduling dynamic parallelism on accelerators," CCF, 2009, pp. 161--170. Google Scholar
Digital Library
- G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, "Recent advances in checkpoint/recovery systems," IPDPS, 2006, pp. 8--. Google Scholar
Digital Library
- G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz, "Application-level checkpointing for shared memory programs," ASPLOS, 2004, pp. 235--247. Google Scholar
Digital Library
- N. P. Carter, et al., "Runnemede: An architecture for ubiquitous high-performance computing," HPCA, 2013, pp. 198--209. Google Scholar
Digital Library
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin, "DMP: Deterministic shared memory multiprocessing," ASPLOS, 2009, pp. 85--96. Google Scholar
Digital Library
- J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman, "RCDC: A relaxed consistency deterministic computer," ASPLOS, 2011, pp. 67--78. Google Scholar
Digital Library
- J. Duell, P. Hargrove, and E. Roman, "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart," Future Technologies Group, White paper, 2003.Google Scholar
- E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375--408, Sep. 2002. Google Scholar
Digital Library
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Architecture support for disciplined approximate programming," ASPLOS, 2012, pp. 301--312. Google Scholar
Digital Library
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural acceleration for general-purpose approximate programs," MICRO, 2012, pp. 449--460. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall, "The implementation of the Cilk-5 multithreaded language," PLDI, 1998, pp. 212--223. Google Scholar
Digital Library
- J. Gilchrist, "Parallel data compression with bzip2," http://compression.ca/pbzip2/.Google Scholar
- M. Gupta and E. Schonberg, "Static analysis to reduce synchronization costs in data-parallel programs," POPL, 1996, pp. 322--332. Google Scholar
Digital Library
- M. S. Gupta, J. A. Rivers, P. Bose, G.-Y. Wei, and D. Brooks, "Tribeca: Design for PVT variations with local recovery and fine-grained adaptation," MICRO, 2009, pp. 435--446. Google Scholar
Digital Library
- M. Gupta, K. Rangan, M. Smith, G.-Y. Wei, and D. Brooks, "DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors," HPCA, Feb 2008, pp. 381--392.Google Scholar
- D. R. Hower, P. Dudnik, M. D. Hill, and D. A. Wood, "Calvin: Deterministic or not? Free will to choose," HPCA, 2011, pp. 333--334. Google Scholar
Digital Library
- "Semiconductor Industry Association (SIA), Design, International Roadmap for Semiconductors, 2011 edition." http://public.itrs.net.Google Scholar
- X. Li and D. Yeung, "Exploiting application-level correctness for low-cost fault tolerance." J. Instruction-Level Parallelism, vol. 10, 2008.Google Scholar
- M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System," University of Wisconsin, Madison, Technical Report CS-TR-1997-1346, Apr. 1997.Google Scholar
- T. Liu, C. Curtsinger, and E. D. Berger, "DTHREADS: efficient deterministic multithreading," SOSP, 2011, pp. 327--336. Google Scholar
Digital Library
- D. Manivannan and M. Singhal, "Quasi-synchronous checkpointing: Models, characterization, and classification," IEEE Trans. Parallel Distrib. Syst., vol. 10, no. 7, pp. 703--713, Jul. 1999. Google Scholar
Digital Library
- D. Marques, G. Bronevetsky, R. Fernandes, K. Pingali, and P. Stodghil, "Optimizing checkpoint sizes in the C3 system," IPDPS, 2005, pp. 226.1--. Google Scholar
Digital Library
- A. Meixner, M. E. Bauer, and D. Sorin, "Argus: Low-cost, comprehensive error detection in simple cores," MICRO, 2007, pp. 210--222. Google Scholar
Digital Library
- D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit, "Platform 2012, a many-core computing accelerator for embedded socs: Performance evaluation of visual analytics applications," DAC, 2012, pp. 1137--1142. Google Scholar
Digital Library
- C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, "Aries: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging," ACM Trans. Database Syst., vol. 17, no. 1, pp. 94--162, Mar. 1992. Google Scholar
Digital Library
- J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, "ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers," HPCA, Feb 2006, pp. 200--211.Google Scholar
- M. Olszewski, J. Ansel, and S. Amarasinghe, "Kendo: Efficient deterministic multithreading in software," ASPLOS, 2009, pp. 97--108. Google Scholar
Digital Library
- M. Prvulovic and J. Torrellas, "ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes," ISCA, 2003, pp. 110--121. Google Scholar
Digital Library
- M. Prvulovic, Z. Zhang, and J. Torrellas, "ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors," ISCA, 2002, pp. 111--122. Google Scholar
Digital Library
- G. Ramalingam and K. Vaswani, "Fault tolerance via idempotence," POPL, 2013, pp. 249--262. Google Scholar
Digital Library
- M. Rieker, J. Ansel, and G. Cooperman, "Transparent user-level checkpointing for the native posix thread library for linux," The Int. Conf. on Parallel and Distrib. Process. Techn. and Appl., Jun 2006.Google Scholar
- A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "Enerj: Approximate data types for safe and general low-power computation," PLDI, 2011, pp. 164--174. Google Scholar
Digital Library
- R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon, "Innovations in internetworking," C. Partridge, Ed., ch. Design and Implementation of the Sun Network Filesystem, pp. 379--390. Google Scholar
Digital Library
- D. J. Sorin, "Fault tolerant computer architecture," Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1--104, 2009. Google Scholar
Digital Library
- D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, "SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery," ISCA, 2002, pp. 123--134. Google Scholar
Digital Library
- B. P. Wood, L. Ceze, and D. Grossman, "Low-level detection of language-level data races with LARD," ASPLOS, 2014, pp. 671--686. Google Scholar
Digital Library
- G. Yan, X. Liang, Y. Han, and X. Li, "Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors," ISCA, 2010, pp. 485--496. Google Scholar
Digital Library
- S. Yi, D. Kondo, and A. Andrzejak, "Reducing costs of spot instances via checkpointing in the Amazon Elastic Compute Cloud," CLOUD, July 2010, pp. 236--243. Google Scholar
Digital Library
Index Terms
Globally precise-restartable execution of parallel programs
Recommendations
Globally precise-restartable execution of parallel programs
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationEmerging trends in computer design and use are likely to make exceptions, once rare, the norm, especially as the system size grows. Due to exceptions, arising from hardware faults, approximate computing, dynamic resource management, etc., successful and ...
Efficient program execution indexing
PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and ImplementationExecution indexing uniquely identifies a point in an execution. Desirable execution indices reveal correlations between points in an execution and establish correspondence between points across multiple executions. Therefore, execution indexing is ...
Speculative optimization using hardware-monitored guarded regions for java virtual machines
VEE '07: Proceedings of the 3rd international conference on Virtual execution environmentsAggressive dynamic optimization in high-performance Java Virtual Machines can be hampered by language features like Java's exception model, which requires precise detection and handling of program-generated exceptions. Furthermore, the compile-time ...







Comments