Abstract
Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk.
In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50%. To validate our method, we apply the self-checkpoint to an important problem, fault tolerant HPL. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95% of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47% and the performance by 5%.
- top500 website. http://top500.org/.Google Scholar
- S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004. Google Scholar
Digital Library
- L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 32:1--32:32, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063427. Google Scholar
Digital Library
- W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI. In SpringerLink, pages 477--488. Springer Berlin Heidelberg, Aug. 2012. URL http://link.springer.com/chapter/10.1007/978--3--642--32820--6_48. DOI: 10.1007/978--3--642--32820--6\_48.Google Scholar
- G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, ACM/IEEE 2002 Conference, pages 29--29, Nov. 2002. 10.1109/SC.2002.10048.Google Scholar
Digital Library
- A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. MPICH-V2: A Fault Tolerant MPI for Volatile Nodes Based on Pessimistic Sender Based Message Logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC '03, pages 25--, New York, NY, USA, 2003. ACM. ISBN 1--58113--695--1. 10.1145/1048935.1050176. Google Scholar
Digital Library
- Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 73--84, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0552--5. 10.1145/1996130.1996142. Google Scholar
Digital Library
- Z. Chen. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 167--176, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--1922--5. 10.1145/2442516.2442533. URL http://doi.acm.org/10.1145/2442516.2442533.Google Scholar
Digital Library
- Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 213--223, New York, NY, USA, 2005. ACM. ISBN 1--59593-080--9. 10.1145/1065944.1065973. URL http://doi.acm.org/10.1145/1065944.1065973.Google Scholar
Digital Library
- T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, pages 162--171. ACM, 2011. Google Scholar
Digital Library
- X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1--57:12, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654117. Google Scholar
Digital Library
- A. Duarte, D. Rexachs, and E. Luque. An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 4192 in Lecture Notes in Computer Science, pages 150--157. Springer Berlin Heidelberg, Sept. 2006. ISBN 978--3--540--39110--4 978--3--540--39112--8. URL http://link.springer.com/chapter/10.1007/11846802_26.Google Scholar
Digital Library
- I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65 (3): 1302--1326, Sept. 2013. ISSN 0920--8542, 1573-0484. 10.1007/s11227-013-0884-0.Google Scholar
Digital Library
- N. El-Sayed and B. Schroeder. Reading between the lines of failure logs: Understanding how HPC systems fail. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pages 1--12. IEEE, 2013.Google Scholar
Digital Library
- G. E. Fagg and J. J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In J. Dongarra, P. Kacsuk, and N. Podhorszki, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 1908 in Lecture Notes in Computer Science, pages 346--353. Springer Berlin Heidelberg, 2000. ISBN 978--3--540--41010--2, 978--3--540--45255--3.Google Scholar
- K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 44:1--44:12, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063443. Google Scholar
Digital Library
- D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 78. IEEE Computer Society Press, 2012. Google Scholar
Digital Library
- L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. pages 63--72. IEEE, 2010. ISBN 978--1--4244--6987--1. 10.1109/CCGRID.2010.40.Google Scholar
- P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.Google Scholar
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100 (6): 518--528, 1984.Google Scholar
- C. Jin, H. Jiang, D. Feng, and L. Tian. P-code: A new raid-6 code with optimal properties. In Proceedings of the 23rd international conference on Supercomputing, pages 360--369. ACM, 2009. Google Scholar
Digital Library
- D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking Algorithm-based Fault Tolerance with a Cooperative Software-hardware Approach. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 44:1--44:12, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--2378--9. 10.1145/2503210.2503226. URL http://doi.acm.org/10.1145/2503210.2503226.Google Scholar
Digital Library
- A. Moody, G. Bronevetsky, K. Mohror, and B. De Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11, Nov. 2010. 10.1109/SC.2010.18.Google Scholar
- D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD '88, pages 109--116. ACM, 1988. Google Scholar
Digital Library
- A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. http://www.netlib.org/benchmark/hpl/.Google Scholar
- J. S. Plank and K. Li. Faster checkpointing with N+1 parity. In Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing, pages 288--297, June 1994. 10.1109/FTCS.1994.315631.Google Scholar
Cross Ref
- J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9 (10): 972--986, 1998.Google Scholar
Digital Library
- Y. Robert. Fault-tolerance techniques for computing at scale. CCGrid2014, 2014.Google Scholar
- B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, 7 (4): 337--350, Oct. 2010. ISSN 1545--5971. 10.1109/TDSC.2009.4.Google Scholar
Digital Library
- C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, Mar. 2007. 10.1109/IPDPS.2007.370307. Google Scholar
Cross Ref
- Wang, Mueller, Engelmann, and Scott]wang2011hybridC. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid full/incremental checkpoint/restart for mpi jobs in hpc environments. In International Conference on Parallel and Distributed Systems, 2011.Google Scholar
- Wang, Yao, Chen, Tan, Balaji, and Buntinas]wang_building_2011R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant MPI programs. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--9. IEEE, 2011.Google Scholar
Digital Library
- S. B. Wicker and V. K. Bhargava. Reed-Solomon codes and their applications. John Wiley & Sons, 1999. Google Scholar
Cross Ref
- P. Wu and Z. Chen. Ft-scalapack: Correcting soft errors on-line for scalapack cholesky, qr, and lu factorization routines. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2749-7. 10.1145/2600212.2600232. Google Scholar
Digital Library
- E. Yao, M. Chen, R. Wang, W. Zhang, and G. Tan. A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism. arXiv preprint arXiv:1106.4213, 2011.Google Scholar
- E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. pages 438--448. IEEE, May 2012. ISBN 978-1-4673-0975-2, 978-0-7695-4675-9. 10.1109/IPDPS.2012.48.Google Scholar
- G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++and mpi. In IEEE International Conference on Cluster Computing, pages 93--103, Sept 2004.Google Scholar
- G. Zheng, X. Ni, and L. V. Kalé. A scalable double in-memory checkpoint and restart scheme towards exascale. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1--6. IEEE, 2012. Google Scholar
Cross Ref
Index Terms
Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL
Recommendations
Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingFault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO ...
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more ...
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid ComputingIn high performance computing (HPC), the applications are periodically check pointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the ...







Comments