Abstract
Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.
- G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2009. ISBN 978-1-60558-397-6. 10.1145/1504176.1504215. Google Scholar
Digital Library
- G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69 (4): 410 -- 416, 2009. ISSN 0743-7315. 10.1016/j.jpdc.2008.12.002. Google Scholar
Digital Library
- E. W. Dijkstra and C. S. Scholten. Termination detection for diffusing computations. Inf. Proc. Letters, 11 (1): 1--4, 1980.Google Scholar
Cross Ref
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11. ACM, 2009. 10.1145/1654059.1654113. Google Scholar
Digital Library
- F. Freiling, M. Majuntke, and N. Mittal. On detecting termination in the crash-recovery model. In A.-M. Kermarrec, L. Bougé, and T. Priol, editors, Euro-Par 2007 Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 629--638. Springer Berlin / Heidelberg, 2007. ISBN 978-3-540-74465-8. Google Scholar
Digital Library
- A. Geist and C. Engelmann. Development of naturally fault tolerant algorithms for computing on 100,000 processors, 2002.Google Scholar
- T. Hoefler, C. Siebert, and A. Lumsdaine. Scalable communication protocols for dynamic sparse data exchange. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 159--168. ACM, 2010. 10.1145/1693453.1693476. Google Scholar
Digital Library
- L. Kalé and S. Krishnan. Charm++: A portable concurrent object oriented system based on C++. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, September 1993. Google Scholar
Digital Library
- P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google Scholar
- T.-H. Lai and L.-F. Wu. An (N-1)-resilient algorithm for distributed termination detection. Parallel and Distributed Systems, IEEE Transactions on, 6 (1): 63--78, Jan 1995. 10.1109/71.363410. Google Scholar
Digital Library
- J. Lifflander, S. Krishnamoorthy, and L. V. Kale. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 137--148. ACM, 2012. 10.1145/2287076.2287103. Google Scholar
Digital Library
- W. Ma and S. Krishnamoorthy. Data-driven fault tolerance for work stealing computations. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 79--90. ACM, 2012. 10.1145/2304576.2304589. Google Scholar
Digital Library
- F. Mattern. Algorithms for distributed termination detection. Distributed Computing, 2: 161--175, 1987. 10.1007/BF01782776.Google Scholar
Digital Library
- E. Meneses, X. Ni, and L. V. Kale. A Message-Logging Protocol for Multicore Systems. In Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, USA, June 2012.Google Scholar
Cross Ref
- N. Mittal, F. C. Freiling, S. Venkatesan, and L. D. Penso. Efficient reduction for wait-free termination detection in a crash-prone distributed system. In Proceedings of the 19th international conference on Distributed Computing, DISC'05, pages 93--107, 2005. Google Scholar
Digital Library
- S. Venkatesan. Reliable protocols for distributed termination detection. Reliability, IEEE Transactions on, 38 (1): 103--110, Apr 1989. ISSN 0018-9529. 10.1109/24.24583.Google Scholar
- t al.(1992)von Eicken, Culler, Goldstein, and Schauser}b:voneicken-active-messagesT. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. Google Scholar
Digital Library
- J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. AM++: a generalized active message framework. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, 2010. 10.1145/1854273.1854323. Google Scholar
Digital Library
- J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. Active pebbles: parallel programming for data-driven applications. In Proceedings of the international conference on Supercomputing, ICS '11, pages 235--244. ACM, 2011. ISBN 978-1-4503-0102-2. 10.1145/1995896.1995934. Google Scholar
Digital Library
Index Terms
Adoption protocols for fanout-optimal fault-tolerant termination detection
Recommendations
Adoption protocols for fanout-optimal fault-tolerant termination detection
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingTermination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured ...
Designing masking fault-tolerance via nonmasking fault-tolerance
SRDS '95: Proceedings of the 14TH Symposium on Reliable Distributed SystemsMasking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program ...
On termination detection in crash-prone distributed systems with failure detectors
We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection ...







Comments