Abstract
Debugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help programmers resolve them, but their runtime interference perturbs the timing such that subtle races often cannot be reproduced with debugging tools. We present novel noise injection techniques to expose message races even under a tool's control. We first formalize this race problem in the context of non-deterministic parallel applications and use this analysis to determine an effective noise-injection strategy to uncover them. We codified these techniques in NINJA (Noise INJection Agent) that exposes these races without modification to the application. Our evaluations on synthetic cases as well as a real-world bug in Hypre-2.10.1 show that NINJA significantly helps expose races.
- P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Cluster Computing, 2006 IEEE International Conference on, pages 1--12, Sept 2006. 10.1109/CLUSTR.2006.311846. Google Scholar
Cross Ref
- A. Bouteiller, G. Bosilca, and J. Dongarra. Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging. In F. Cappello, T. Herault, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 4757 of phLecture Notes in Computer Science, pages 297--306. Springer Berlin Heidelberg, 2007. ISBN 978--3--540--75415--2. 10.1007/978--3--540--75416--9_41. URL http://dx.doi.org/10.1007/978--3--540--75416--9_41.Google Scholar
Cross Ref
- C. Clemencon, J. Fritscher, M. Meehan, and R. Ruhl. An Implementation of Race Detection and Deterministic Replay with MPI. In EURO-PAR '95 Parallel Processing, volume 966 of phLecture Notes in Computer Science, pages 155--166. Springer Berlin Heidelberg, 1995. ISBN 978--3--540--60247--7. 10.1007/BFb0020462. URL http://dx.doi.org/10.1007/BFb0020462. Google Scholar
Cross Ref
- D. Comer. Internetworking with TCP/IP: Principles, Protocols, and Architecture. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988. ISBN 0--13--470154--2.Google Scholar
Digital Library
- CORAL. Collaboration of Oak Ridge, Argonne, and Livermore benchmark codes. https://asc.llnl.gov/CORAL-benchmarks.Google Scholar
- Emmi:2011:DS:1926385.1926432M. Emmi, S. Qadeer, and Z. Rakamarić. Delay-bounded scheduling. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11, pages 411--422, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0490-0. 10.1145/1926385.1926432. URL http://doi.acm.org/10.1145/1926385.1926432.Google Scholar
Digital Library
- C. Engelmann. Investigating operating system noise in extreme-scale high-performance computing systems using simulation. In Proceedings of thehrefhttp://www.iasted.org/conferences/home-795.html 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, Feb. 11--13, 2013.hrefhttp://www.actapress.comACTA Press, Calgary, AB, Canada. ISBN 978-0--88986--943--1. http://dx.doi.org/10.2316/P.2013.795-010. URL http://www.christian-engelmann.info/publications/engelmann12investigating.pdf. Google Scholar
Cross Ref
- K. B. Ferreira, P. Bridges, and R. Brightwell. Characterizing application sensitivity to os interference using kernel-level noise injection. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--12, Nov 2008. 10.1109/SC.2008.5219920. Google Scholar
Cross Ref
- C. Flanagan and S. N. Freund. Fasttrack: Efficient and precise dynamic race detection. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, pages 121--133, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--392--1. 10.1145/1542476.1542490. URL http://doi.acm.org/10.1145/1542476.1542490.Google Scholar
Digital Library
- M. P. Forum. MPI: A Message-Passing Interface Standard. Technical report, Knoxville, TN, USA, 1994. URL http://www.mpi-forum.org/.Google Scholar
Digital Library
- M. Gusat, D. Craddock, W. Denzel, T. Engbersen, N. Ni, G. Pfister, W. Rooney, and J. Duato. Congestion control in infiniband networks. In High Performance Interconnects, 2005. Proceedings. 13th Symposium on, pages 158--159, Aug 2005. 10.1109/CONECT.2005.14. Google Scholar
Digital Library
- er]Hilbrich:2012:MRE:2388996.2389037T. Hilbrich, J. Protze, M. Schulz, B. R. de Supinski, and M. S. Müller. Runtime error detection with must: Advances in deadlock detection. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 30:1--30:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978-1-4673-0804-5. URL http://dl.acm.org/citation.cfm?id=2388996.2389037.Google Scholar
Digital Library
- T. Hoefler, T. Schneider, and A. Lumsdaine. The impact of network noise at large-scale communication performance. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--8, May 2009. 10.1109/IPDPS.2009.5161095. Google Scholar
Digital Library
- J. C. d. Kergommeaux, M. Ronsse, and K. D. Bosschere. MPL*: Efficient Record/Play of Nondeterministic Features of Message Passing Libraries. In Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 141--148, London, UK, UK, 1999. Springer-Verlag. ISBN 3-540-66549-8. URL http://dl.acm.org/citation.cfm?id=648136.746462.Google Scholar
Digital Library
- D. Kranzlmüller and J. Volkert. NOPE: A Nondeterministic Program Evaluator. In P. Zinterhof, M. Vajteršic, and A. Uhl, editors, Parallel Computation, volume 1557 of Lecture Notes in Computer Science, pages 490--499. Springer Berlin Heidelberg, 1999. ISBN 978--3--540--65641--8. 10.1007/3--540--49164--3_47. URL http://dx.doi.org/10.1007/3-540-49164-3_47.Google Scholar
Digital Library
- D. Kranzlmüller, C. Schaubschläger, and J. Volkert. An Integrated Record & Replay Mechanism for Nondeterministic Message Passing Programs. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 2131 of Lecture Notes in Computer Science, pages 192--200. Springer Berlin Heidelberg, 2001. ISBN 978-3-540-42609-7. 10.1007/3-540-45417-9_28. URL http://dx.doi.org/10.1007/3--540--45417--9_28. Google Scholar
Cross Ref
- R. H. B. Netzer and B. P. Miller. Optimal Tracing and Replay for Debugging Message-passing Parallel Programs. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, Supercomputing '92, pages 502--511, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press. ISBN 0--8186--2630--5. URL http://dl.acm.org/citation.cfm?id=147877.148058.Google Scholar
Digital Library
- C.-S. Park, K. Sen, P. Hargrove, and C. Iancu. Efficient data race detection for distributed memory parallel programs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 51:1--51:12, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063452. URL http://doi.acm.org/10.1145/2063384.2063452.Google Scholar
Digital Library
- M.-Y. Park, S. J. Shim, Y.-K. Jun, and H.-R. Park. phMPIRace-Check: Detection of Message Races in MPI Programs, pages 322--333. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978--3--540--72360--8. 10.1007/978--3--540--72360--8_28. URL http://dx.doi.org/10.1007/978--3--540--72360--8_28.Google Scholar
- K. Sato, D. H. Ahn, I. Laguna, G. L. Lee, and M. Schulz. Clock delta compression for scalable order-replay of non-deterministic parallel applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 62:1--62:12, New York, NY, USA, 2015. ACM. ISBN 978--1--4503--3723--6. 10.1145/2807591.2807642. URL http://doi.acm.org/10.1145/2807591.2807642.Google Scholar
Digital Library
- S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15 (4): 391--411, Nov. 1997. ISSN 0734--2071. 10.1145/265924.265927. URL http://doi.acm.org/10.1145/265924.265927.Google Scholar
Digital Library
- K. Serebryany and T. Iskhodzhanov. Threadsanitizer: Data race detection in practice. In phProceedings of the Workshop on Binary Instrumentation and Applications, WBIA '09, pages 62--71, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--793--6. 10.1145/1791194.1791203. URL http://doi.acm.org/10.1145/1791194.1791203.Google Scholar
Digital Library
- G. Shipman, P. M., Cormick, K. Pedretti, S. Olivier, K. B. Ferreira, R. Sankaran, S. Treichler, A. Aiken, and M. Bauer. Analysis of application sensitivity to system performance variability in a dynamic task based runtime. In The Workshop on Runtime Systems for Extreme Scale Programming Models and Architectures, 2015.Google Scholar
- A. Vo, S. Vakkalanka, M. DeLisi, G. Gopalakrishnan, R. M. Kirby, and R. Thakur. Formal verification of practical mpi programs. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 261--270, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--397--6. 10.1145/1504176.1504214. URL http://doi.acm.org/10.1145/1504176.1504214.Google Scholar
Digital Library
- A. Vo, S. Aananthakrishnan, G. Gopalakrishnan, B. R. d. Supinski, M. Schulz, and G. Bronevetsky. A scalable and distributed dynamic formal verifier for mpi programs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978--1--4244--7559--9. 10.1109/SC.2010.7. URL http://dx.doi.org/10.1109/SC.2010.7. Google Scholar
Digital Library
- R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, Z. Zhang, and G. Voelker. Mpiwiz: Subgroup reproducible replay of mpi applications. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 251--260, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--397--6. 10.1145/1504176.1504213. URL http://doi.acm.org/10.1145/1504176.1504213.Google Scholar
Digital Library
Index Terms
Noise Injection Techniques to Expose Subtle and Unintended Message Races
Recommendations
Noise Injection Techniques to Expose Subtle and Unintended Message Races
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingDebugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help ...
What are race conditions?: Some issues and formalizations
In shared-memory parallel programs that use explicit synchronization, race conditions result when accesses to shared memory are not properly synchronized. Race conditions are often considered to be manifestations of bugs, since their presence can cause ...
Applying transactional memory to concurrency bugs
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating SystemsMultithreaded programs often suffer from synchronization bugs such as atomicity violations and deadlocks. These bugs arise from complicated locking strategies and ad hoc synchronization methods to avoid the use of locks. A survey of the bug databases of ...







Comments