Abstract
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, Prodometer, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
- DDT - Debugging tool for parallel computing. http://www.allinea.com/products/ddt/.Google Scholar
- Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php.Google Scholar
- Pin - A Dynamic Binary Instrumentation Tool. http://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.Google Scholar
- PRODOMETER source code. https://computation-rnd.llnl.gov/automaded/.Google Scholar
- Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks/.Google Scholar
- TotalView Debugger. http://www.roguewave.com/products/totalview.aspx.Google Scholar
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google Scholar
Digital Library
- N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In SC, 2000. Google Scholar
Digital Library
- D. H. Ahn, B. R. d. Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In SC, 2009. Google Scholar
Digital Library
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN 0321486811. Google Scholar
Digital Library
- F. E. Allen and J. Cocke. A program data flow analysis procedure. Communications of the ACM, 1976. Google Scholar
Digital Library
- D. Andrzejewski, A. Mulhern, B. Liblit, and X. Zhu. Statistical debugging using latent topic models. In ECML, 2007. Google Scholar
Digital Library
- M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In OSDI, 2012. Google Scholar
Digital Library
- D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys (CSUR), 26(4):345--420, 1994. Google Scholar
Digital Library
- D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys, pages 345--420, Dec. 1994. Google Scholar
Digital Library
- U. Banerjee. Loop transformations for restructuring compilers: The foundations. 1993. Google Scholar
Digital Library
- J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: a portable, high-performance, ansi c coding methodology. In ICS, 1997. Google Scholar
Digital Library
- G. Bronevetsky, I. Laguna, S. Bagchi, B. R. d. Supinski, D. H. Ahn, and M. Schulz. Automaded: Automata-based debugging for dissimilar parallel tasks. In DSN, 2010.Google Scholar
Cross Ref
- T. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. Holmes: Effective statistical debugging via efficient path profiling. In ICSE, 2009. Google Scholar
Digital Library
- M. N. Dinh, D. Abramson, D. Kurniawan, C. Jin, B. Moench, and L. DeRose. Assertion based parallel debugging. In CCGRID, 2011. Google Scholar
Digital Library
- C. Falzone, A. Chan, E. Lusk, and W. Gropp. Collective error detection for mpi collective operations. Recent Advances in Parallel Virtual Machine and Message Passing Interface Lecture Notes in Computer Science, pages 138--147, 2005. Google Scholar
Digital Library
- Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In SC, 2007. Google Scholar
Digital Library
- S. Hangal and M. Lam. Tracking down software bugs using automatic anomaly detection. In ICSE, 2002. Google Scholar
Digital Library
- W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, pages 19--25, 2006. Google Scholar
Digital Library
- G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu. Understanding and detecting real-world performance bugs. In PLDI, 2012. Google Scholar
Digital Library
- D. B. Johnson. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing, pages 77--84, 1975.Google Scholar
Cross Ref
- M. Jose and R. Majumdar. Cause clue clauses:error localization using maximum satisability. In PLDI, 2011. Google Scholar
Digital Library
- I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, and B. Rountree. Large scale debugging of parallel tasks with automaded. In SC, 2011. Google Scholar
Digital Library
- I. Laguna, D. H. Ahn, B. R. d. Supinski, S. Bagchi, and T. Gamblin. Probabilistic diagnosis of performance faults in large-scale parallel applications. In PACT, 2012. Google Scholar
Digital Library
- A. Mirgorodskiy, N. Maruyama, and B. Miller. Problem diagnosis in large-scale computing environments. In SC, 2006. Google Scholar
Digital Library
- T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF, 2007. Google Scholar
Digital Library
- The MPI Forum. MPI: A Message Passing Interface. https://http://www.mpi-forum.org/, 1993.Google Scholar
- A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, B. d. Supinski, M. Schulz, and G. Bronevetsky. A scalable and distributed dynamic formal verifier for mpi programs. In SC, 2009. Google Scholar
Digital Library
- Y. Xie and A. Aiken. Scalable error detection using boolean satisfiability. In POPL, 2005. Google Scholar
Digital Library
- A. X. Zheng, M. I. Jordan, B. Liblit, M. Naik, and A. Aiken. Statistical debugging: simultaneous identification of multiple bugs. In ICML, 2006. Google Scholar
Digital Library
Index Terms
Accurate application progress analysis for large-scale parallel debugging
Recommendations
Accurate application progress analysis for large-scale parallel debugging
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationDebugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-...
Optimization and Performance of a Fortran 90 MPI-Based Unstructured Code on Large-Scale Parallel Systems
The message-passing interface (MPI) has become the standard in achieving effective results when using the message passing paradigm of parallelization. Codes written using MPI are extremely portable and are applicable to both clusters and massively ...
Extending a traditional debugger to debug massively parallel applications
Beowulf systems, and other proprietary approaches, are placing systems with four or more CPUs in the hands of many researchers and commercial users. In the near future, systems with hundreds of CPUs will become commonly available, with some programmers ...







Comments