skip to main content
research-article

Adoption protocols for fanout-optimal fault-tolerant termination detection

Authors Info & Claims
Published:23 February 2013Publication History
Skip Abstract Section

Abstract

Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.

References

  1. G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2009. ISBN 978-1-60558-397-6. 10.1145/1504176.1504215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing, 69 (4): 410 -- 416, 2009. ISSN 0743-7315. 10.1016/j.jpdc.2008.12.002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. W. Dijkstra and C. S. Scholten. Termination detection for diffusing computations. Inf. Proc. Letters, 11 (1): 1--4, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11. ACM, 2009. 10.1145/1654059.1654113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Freiling, M. Majuntke, and N. Mittal. On detecting termination in the crash-recovery model. In A.-M. Kermarrec, L. Bougé, and T. Priol, editors, Euro-Par 2007 Parallel Processing, volume 4641 of Lecture Notes in Computer Science, pages 629--638. Springer Berlin / Heidelberg, 2007. ISBN 978-3-540-74465-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Geist and C. Engelmann. Development of naturally fault tolerant algorithms for computing on 100,000 processors, 2002.Google ScholarGoogle Scholar
  7. T. Hoefler, C. Siebert, and A. Lumsdaine. Scalable communication protocols for dynamic sparse data exchange. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 159--168. ACM, 2010. 10.1145/1693453.1693476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Kalé and S. Krishnan. Charm++: A portable concurrent object oriented system based on C++. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, September 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google ScholarGoogle Scholar
  10. T.-H. Lai and L.-F. Wu. An (N-1)-resilient algorithm for distributed termination detection. Parallel and Distributed Systems, IEEE Transactions on, 6 (1): 63--78, Jan 1995. 10.1109/71.363410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Lifflander, S. Krishnamoorthy, and L. V. Kale. Work stealing and persistence-based load balancers for iterative overdecomposed applications. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 137--148. ACM, 2012. 10.1145/2287076.2287103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Ma and S. Krishnamoorthy. Data-driven fault tolerance for work stealing computations. In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 79--90. ACM, 2012. 10.1145/2304576.2304589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Mattern. Algorithms for distributed termination detection. Distributed Computing, 2: 161--175, 1987. 10.1007/BF01782776.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Meneses, X. Ni, and L. V. Kale. A Message-Logging Protocol for Multicore Systems. In Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, USA, June 2012.Google ScholarGoogle ScholarCross RefCross Ref
  15. N. Mittal, F. C. Freiling, S. Venkatesan, and L. D. Penso. Efficient reduction for wait-free termination detection in a crash-prone distributed system. In Proceedings of the 19th international conference on Distributed Computing, DISC'05, pages 93--107, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Venkatesan. Reliable protocols for distributed termination detection. Reliability, IEEE Transactions on, 38 (1): 103--110, Apr 1989. ISSN 0018-9529. 10.1109/24.24583.Google ScholarGoogle Scholar
  17. t al.(1992)von Eicken, Culler, Goldstein, and Schauser}b:voneicken-active-messagesT. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. AM++: a generalized active message framework. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, 2010. 10.1145/1854273.1854323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. Active pebbles: parallel programming for data-driven applications. In Proceedings of the international conference on Supercomputing, ICS '11, pages 235--244. ACM, 2011. ISBN 978-1-4503-0102-2. 10.1145/1995896.1995934. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adoption protocols for fanout-optimal fault-tolerant termination detection

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM SIGPLAN Notices
                ACM SIGPLAN Notices  Volume 48, Issue 8
                PPoPP '13
                August 2013
                309 pages
                ISSN:0362-1340
                EISSN:1558-1160
                DOI:10.1145/2517327
                Issue’s Table of Contents
                • cover image ACM Conferences
                  PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
                  February 2013
                  332 pages
                  ISBN:9781450319225
                  DOI:10.1145/2442516

                Copyright © 2013 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 23 February 2013

                Check for updates

                Qualifiers

                • research-article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!