ABSTRACT
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
- C. Engelmann, G. R. Vallée, T. Naughton, and S. L. Scott. Proactive fault tolerance using preemptive migration. In Proceedings of the International Conference on Parallel, Distributed, and network-based Processing, Weimar, Germany, Feb. 2009. Google Scholar
Digital Library
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for HPC with Xen virtualization. In Proceedings of the International Conference on Supercomputing, Seattle, WA, USA, June 2007. Google Scholar
Digital Library
- A. Tikotekar, G. Vallée, T. Naughton, S. L. Scott, and C. Leangsuksun. Evaluation of fault-tolerant policies using simulation. In Proceedings of the International Conference on Cluster Computing, Austin, TX, USA, Sept. 2007. Google Scholar
Digital Library
- G. R. Vallée, K. Charoenpornwattana, C. Engelmann, A. Tikotekar, C. Leangsuksun, T. Naughton, and S. L. Scott. A framework for proactive fault tolerance. In Proceedings of the International Conference on Availability, Reliability and Security, Barcelona, Spain, Mar. 2007.Google Scholar
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A job pause service under LAM/MPI+BLCR for transparent fault tolerance. In Proceedings of the International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, Mar. 2007.Google Scholar
Cross Ref
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive process-level live migration in HPC environments. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, Nov. 2008. Google Scholar
Digital Library
Index Terms
A tunable holistic resiliency approach for high-performance computing systems
Recommendations
A tunable holistic resiliency approach for high-performance computing systems
PPoPP '09In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault ...
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
AbstractIn recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a ...
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
ICPE '18: Proceedings of the 2018 ACM/SPEC International Conference on Performance EngineeringResiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that handle multiple ...







Comments