Abstract
Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains -- computation performed by a task or a parallel for-loop chunk instance -- and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google Scholar
Digital Library
- J. M. Arul, G.-J. Hwang, and H.-Y. Ko. GOMP profiler: A profiler for OpenMP task level parallelism. Computer Science and Engineering, 3(3):56--66, 2013.Google Scholar
- E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3): 404--418, 2009. Google Scholar
Digital Library
- Barcelona Supercomputing Center. OmpSs task dependency graph, 2013. http://pm.bsc.es/ompss-docs/user-guide/run-programs-plugin-instrument-tdg.html. Accessed 10 April 2015.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proc. of the International Conference on Parallel Architecture and Compilation Techniques (17th PACT'08), pages 72--81. ACM, 2008. Google Scholar
Digital Library
- S. Brinkmann, J. Gracia, and C. Niethammer. Task debugging with temanejo. In Tools for High Performance Computing 2012, pages 13--21. Springer, 2013.Google Scholar
- H. Brunst and B. Mohr. Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with Vampir NG. In OpenMP Shared Memory Parallel Programming, number 4315 in LNCS, pages 5--14. Springer, 2008. Google Scholar
Digital Library
- D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'05, pages 21--28. ACM, 2005. Google Scholar
Digital Library
- J. Cownie, J. DelSignore, John, B. de Supinski, and K. Warren. DMPL: An OpenMP DLL debugging interface. In OpenMP Shared Memory Parallel Programming, volume 2716 of LNCS, pages 137--146. Springer, 2003. Google Scholar
Digital Library
- G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.Google Scholar
- Y. Ding, K. Hu, K. Wu, and Z. Zhao. Performance monitoring and analysis of task-based OpenMP. PLoS ONE, 8(10):e77742, 2013. doi: 10.1371/journal.pone.0077742.Google Scholar
Cross Ref
- A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach-Temam. Aftermath: A graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In 7th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG, associated with HiPEAC), Vienna, Austria, 2014.Google Scholar
- A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for task parallelism. In High Performance Computing, Networking, Storage and Analysis. SC'08. International Conference for, pages 1--11, 2008. Google Scholar
Digital Library
- A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Parallel Processing, 2009. ICPP'09. International Conference on, pages 124--131, 2009. Google Scholar
Digital Library
- A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): 173--193, 2011.Google Scholar
Cross Ref
- A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu, E. Loh, and D. Lorenz. OMPT: An OpenMP tools application programming interface for performance analysis. In OpenMP in the Era of Low Power Devices and Accelerators, pages 171--185. Springer, 2013.Google Scholar
Cross Ref
- K. Fürlinger. OpenMP application profiling---state of the art and directions for the future. Procedia Computer Science, 1(1):2107--2114, 2010.Google Scholar
Cross Ref
- K. Fürlinger and D. Skinner. Performance profiling for OpenMP tasks. In Evolving OpenMP in an Age of Extreme Parallelism, number 5568 in LNCS, pages 132--139. Springer, Jan. 2009. Google Scholar
Digital Library
- M. Geimer, F. Wolf, B. J. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, 2010. Google Scholar
Digital Library
- Intel Corporation. OpenMP* Runtime to align with Intel Parallel Studio XE 2015 Composer Edition Update 3, 2015. https://www.openmprtl.org/download. Accessed 10 April 2015.Google Scholar
- K. E. Isaacs, A. Bhatele, J. Lifflander, D. Böhme, T. Gamblin, M. Schulz, B. Hamann, and P.-T. Bremer. Recovering logical structure from Charm++ event traces. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC, volume 15, 2015. Google Scholar
Digital Library
- D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP tasks with score-p. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 444--453, 2012. Google Scholar
Digital Library
- M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Access Online via Elsevier, 2012. Google Scholar
Digital Library
- B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The paradyn parallel performance measurement tool. Computer, 28 (11):37--46, 1995. Google Scholar
Digital Library
- M. S. Mohsen, R. Abdullah, and Y. M. Teo. A survey on performance tools for OpenMP. World Academy of Science, Engineering and Technology, 49, 2009.Google Scholar
- P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A portable interface to hardware performance counters. In Proceedings of the Department of Defense HPCMP Users Group Conference, pages 7--10, 1999.Google Scholar
- A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-aware task scheduling and data distribution on NUMA systems. In OpenMP in the Era of Low Power Devices and Accelerators, number 8122 in LNCS, pages 156--170. Springer, 2013.Google Scholar
- A. Muddukrishna, P. A. Jonsson, and M. Brorsson. Characterizing task-based OpenMP programs. PLoS ONE, 10(4):e0123545, 2015. doi: 10.1371/journal.pone.0123545.Google Scholar
Cross Ref
- M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, et al. Spec OMP2012---an application benchmark suite for parallel systems using openmp. In OpenMP in a Heterogeneous World, pages 223--236. Springer, 2012. Google Scholar
Digital Library
- S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and mitigating work time inflation in task parallel programs. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--12, 2012. Google Scholar
Digital Library
- OpenMP Architecture Review Board. OpenMP application program interface version 4.5, 2015. http://www.openmp.org/mp-documents/openmp-4.5.pdf.Google Scholar
- V. Pillet, J. Labarta, T. Cortes, and S. Girona. Paraver: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and occam Developments, volume 44, pages 17--31, 1995.Google Scholar
- A. Podobas and M. Brorsson. A comparison of some recent task-based parallel programming models. In Proceedings of the 3rd Workshop on Programmability Issues for Multi-Core Computers, (MULTIPROG' 2010), Pisa, 2010.Google Scholar
- A. Podobas, M. Brorsson, and K.-F. Faxén. A comparative performance study of common and popular task-centric programming frameworks. Concurrency and Computation: Practice and Experience, 27(1):1--28, 2015. Google Scholar
Digital Library
- T. B. Schardl, B. C. Kuszmaul, I. Lee, W. M. Leiserson, C. E. Leiserson, and others. The Cilkprof Scalability Profiler. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, pages 89--100. ACM. URL http://dl.acm.org/citation.cfm?id=2755603. Google Scholar
Digital Library
- D. Schmidl, P. Philippen, D. Lorenz, C. Rössel, M. Geimer, D. a. Mey, B. Mohr, and F. Wolf. Performance analysis techniques for task-based OpenMP applications. In OpenMP in a Heterogeneous World, number 7312 in LNCS, pages 196--209. Springer, 2012. Google Scholar
Digital Library
- D. Schmidl, C. Terboven, D. a. Mey, and M. S. Müller. Suitability of performance tools for OpenMP task-parallel programs. In Tools for High Performance Computing 2013, pages 25--37. Springer, 2014.Google Scholar
- H. Servat, X. Teruel, G. Llort, A. Duran, J. Gimenez, X. Martorell, E. Ayguadé, and J. Labarta. On the instrumentation of OpenMP and OmpSs tasking constructs. In Euro-Par Workshops, volume 7640 of LNCS, pages 414--428. Springer, 2012. Google Scholar
Digital Library
- O. Sinnen. Task scheduling for parallel systems, volume 60. John Wiley & Sons, 2007. Google Scholar
Digital Library
- M. E. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, and T. Ideker. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 27(3):431--432, 2011. Google Scholar
Digital Library
- V. Subotic, S. Brinkmann, V. Marjanovic, R. M. Badia, J. Gracia, C. Niethammer, E. Ayguade, J. Labarta, and M. Valero. Programmability and portability for exascale: Top down programming methodology and tools with starss. Journal of Computational Science, 4(6):450--456, 2013. doi: http://dx.doi.org/10.1016/j.jocs.2013.01.008.Google Scholar
Cross Ref
- G. Team. Gecode: Generic constraint development environment, 2006. http://www.gecode.org.Google Scholar
- M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen's matrix multiplication for memory efficiency. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1--14. IEEE Computer Society, 1998. Google Scholar
Digital Library
- V. Tovinkere and M. Voss. Flow graph designer: A tool for designing and analyzing Intel® threading building blocks flow graphs. In ICPP Workshops, pages 149--158. IEEE Computer Society, 2014. Google Scholar
Digital Library
- yWorks GmBh. yEd graph editor, 2015. http://www.yworks.com/en/products_yed_about.html. Accessed 10 April 2015.Google Scholar
Index Terms
Grain graphs: OpenMP performance analysis made easy
Recommendations
Grain graphs: OpenMP performance analysis made easy
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAverage programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is ...
Scaling applications to massively parallel machines using Projections performance analysis tool
Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel ...
Scalability analysis of SPMD codes using expectations
ICS '07: Proceedings of the 21st annual international conference on SupercomputingWe present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our ...






Comments