skip to main content
research-article

Accurate application progress analysis for large-scale parallel debugging

Published:09 June 2014Publication History
Skip Abstract Section

Abstract

Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, Prodometer, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.

References

  1. DDT - Debugging tool for parallel computing. http://www.allinea.com/products/ddt/.Google ScholarGoogle Scholar
  2. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php.Google ScholarGoogle Scholar
  3. Pin - A Dynamic Binary Instrumentation Tool. http://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.Google ScholarGoogle Scholar
  4. PRODOMETER source code. https://computation-rnd.llnl.gov/automaded/.Google ScholarGoogle Scholar
  5. Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks/.Google ScholarGoogle Scholar
  6. TotalView Debugger. http://www.roguewave.com/products/totalview.aspx.Google ScholarGoogle Scholar
  7. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In SC, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. H. Ahn, B. R. d. Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN 0321486811. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. E. Allen and J. Cocke. A program data flow analysis procedure. Communications of the ACM, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Andrzejewski, A. Mulhern, B. Liblit, and X. Zhu. Statistical debugging using latent topic models. In ECML, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys (CSUR), 26(4):345--420, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys, pages 345--420, Dec. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. U. Banerjee. Loop transformations for restructuring compilers: The foundations. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: a portable, high-performance, ansi c coding methodology. In ICS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Bronevetsky, I. Laguna, S. Bagchi, B. R. d. Supinski, D. H. Ahn, and M. Schulz. Automaded: Automata-based debugging for dissimilar parallel tasks. In DSN, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. Holmes: Effective statistical debugging via efficient path profiling. In ICSE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. N. Dinh, D. Abramson, D. Kurniawan, C. Jin, B. Moench, and L. DeRose. Assertion based parallel debugging. In CCGRID, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Falzone, A. Chan, E. Lusk, and W. Gropp. Collective error detection for mpi collective operations. Recent Advances in Parallel Virtual Machine and Message Passing Interface Lecture Notes in Computer Science, pages 138--147, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In SC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Hangal and M. Lam. Tracking down software bugs using automatic anomaly detection. In ICSE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, pages 19--25, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu. Understanding and detecting real-world performance bugs. In PLDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. B. Johnson. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing, pages 77--84, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Jose and R. Majumdar. Cause clue clauses:error localization using maximum satisability. In PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, and B. Rountree. Large scale debugging of parallel tasks with automaded. In SC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Laguna, D. H. Ahn, B. R. d. Supinski, S. Bagchi, and T. Gamblin. Probabilistic diagnosis of performance faults in large-scale parallel applications. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Mirgorodskiy, N. Maruyama, and B. Miller. Problem diagnosis in large-scale computing environments. In SC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. The MPI Forum. MPI: A Message Passing Interface. https://http://www.mpi-forum.org/, 1993.Google ScholarGoogle Scholar
  33. A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, B. d. Supinski, M. Schulz, and G. Bronevetsky. A scalable and distributed dynamic formal verifier for mpi programs. In SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Xie and A. Aiken. Scalable error detection using boolean satisfiability. In POPL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. X. Zheng, M. I. Jordan, B. Liblit, M. Naik, and A. Aiken. Statistical debugging: simultaneous identification of multiple bugs. In ICML, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accurate application progress analysis for large-scale parallel debugging

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 49, Issue 6
              PLDI '14
              June 2014
              598 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2666356
              • Editor:
              • Andy Gill
              Issue’s Table of Contents
              • cover image ACM Conferences
                PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
                June 2014
                619 pages
                ISBN:9781450327848
                DOI:10.1145/2594291

              Copyright © 2014 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 June 2014

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!