skip to main content
research-article

Grain graphs: OpenMP performance analysis made easy

Published:27 February 2016Publication History
Skip Abstract Section

Abstract

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains -- computation performed by a task or a parallel for-loop chunk instance -- and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

References

  1. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. M. Arul, G.-J. Hwang, and H.-Y. Ko. GOMP profiler: A profiler for OpenMP task level parallelism. Computer Science and Engineering, 3(3):56--66, 2013.Google ScholarGoogle Scholar
  3. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3): 404--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barcelona Supercomputing Center. OmpSs task dependency graph, 2013. http://pm.bsc.es/ompss-docs/user-guide/run-programs-plugin-instrument-tdg.html. Accessed 10 April 2015.Google ScholarGoogle Scholar
  5. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proc. of the International Conference on Parallel Architecture and Compilation Techniques (17th PACT'08), pages 72--81. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brinkmann, J. Gracia, and C. Niethammer. Task debugging with temanejo. In Tools for High Performance Computing 2012, pages 13--21. Springer, 2013.Google ScholarGoogle Scholar
  7. H. Brunst and B. Mohr. Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with Vampir NG. In OpenMP Shared Memory Parallel Programming, number 4315 in LNCS, pages 5--14. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'05, pages 21--28. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cownie, J. DelSignore, John, B. de Supinski, and K. Warren. DMPL: An OpenMP DLL debugging interface. In OpenMP Shared Memory Parallel Programming, volume 2716 of LNCS, pages 137--146. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.Google ScholarGoogle Scholar
  11. Y. Ding, K. Hu, K. Wu, and Z. Zhao. Performance monitoring and analysis of task-based OpenMP. PLoS ONE, 8(10):e77742, 2013. doi: 10.1371/journal.pone.0077742.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach-Temam. Aftermath: A graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In 7th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG, associated with HiPEAC), Vienna, Austria, 2014.Google ScholarGoogle Scholar
  13. A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for task parallelism. In High Performance Computing, Networking, Storage and Analysis. SC'08. International Conference for, pages 1--11, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Parallel Processing, 2009. ICPP'09. International Conference on, pages 124--131, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): 173--193, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu, E. Loh, and D. Lorenz. OMPT: An OpenMP tools application programming interface for performance analysis. In OpenMP in the Era of Low Power Devices and Accelerators, pages 171--185. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  17. K. Fürlinger. OpenMP application profiling---state of the art and directions for the future. Procedia Computer Science, 1(1):2107--2114, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  18. K. Fürlinger and D. Skinner. Performance profiling for OpenMP tasks. In Evolving OpenMP in an Age of Extreme Parallelism, number 5568 in LNCS, pages 132--139. Springer, Jan. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Geimer, F. Wolf, B. J. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Intel Corporation. OpenMP* Runtime to align with Intel Parallel Studio XE 2015 Composer Edition Update 3, 2015. https://www.openmprtl.org/download. Accessed 10 April 2015.Google ScholarGoogle Scholar
  21. K. E. Isaacs, A. Bhatele, J. Lifflander, D. Böhme, T. Gamblin, M. Schulz, B. Hamann, and P.-T. Bremer. Recovering logical structure from Charm++ event traces. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC, volume 15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP tasks with score-p. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 444--453, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Access Online via Elsevier, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The paradyn parallel performance measurement tool. Computer, 28 (11):37--46, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. S. Mohsen, R. Abdullah, and Y. M. Teo. A survey on performance tools for OpenMP. World Academy of Science, Engineering and Technology, 49, 2009.Google ScholarGoogle Scholar
  26. P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A portable interface to hardware performance counters. In Proceedings of the Department of Defense HPCMP Users Group Conference, pages 7--10, 1999.Google ScholarGoogle Scholar
  27. A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-aware task scheduling and data distribution on NUMA systems. In OpenMP in the Era of Low Power Devices and Accelerators, number 8122 in LNCS, pages 156--170. Springer, 2013.Google ScholarGoogle Scholar
  28. A. Muddukrishna, P. A. Jonsson, and M. Brorsson. Characterizing task-based OpenMP programs. PLoS ONE, 10(4):e0123545, 2015. doi: 10.1371/journal.pone.0123545.Google ScholarGoogle ScholarCross RefCross Ref
  29. M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, et al. Spec OMP2012---an application benchmark suite for parallel systems using openmp. In OpenMP in a Heterogeneous World, pages 223--236. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and mitigating work time inflation in task parallel programs. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. OpenMP Architecture Review Board. OpenMP application program interface version 4.5, 2015. http://www.openmp.org/mp-documents/openmp-4.5.pdf.Google ScholarGoogle Scholar
  32. V. Pillet, J. Labarta, T. Cortes, and S. Girona. Paraver: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and occam Developments, volume 44, pages 17--31, 1995.Google ScholarGoogle Scholar
  33. A. Podobas and M. Brorsson. A comparison of some recent task-based parallel programming models. In Proceedings of the 3rd Workshop on Programmability Issues for Multi-Core Computers, (MULTIPROG' 2010), Pisa, 2010.Google ScholarGoogle Scholar
  34. A. Podobas, M. Brorsson, and K.-F. Faxén. A comparative performance study of common and popular task-centric programming frameworks. Concurrency and Computation: Practice and Experience, 27(1):1--28, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. B. Schardl, B. C. Kuszmaul, I. Lee, W. M. Leiserson, C. E. Leiserson, and others. The Cilkprof Scalability Profiler. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, pages 89--100. ACM. URL http://dl.acm.org/citation.cfm?id=2755603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Schmidl, P. Philippen, D. Lorenz, C. Rössel, M. Geimer, D. a. Mey, B. Mohr, and F. Wolf. Performance analysis techniques for task-based OpenMP applications. In OpenMP in a Heterogeneous World, number 7312 in LNCS, pages 196--209. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Schmidl, C. Terboven, D. a. Mey, and M. S. Müller. Suitability of performance tools for OpenMP task-parallel programs. In Tools for High Performance Computing 2013, pages 25--37. Springer, 2014.Google ScholarGoogle Scholar
  38. H. Servat, X. Teruel, G. Llort, A. Duran, J. Gimenez, X. Martorell, E. Ayguadé, and J. Labarta. On the instrumentation of OpenMP and OmpSs tasking constructs. In Euro-Par Workshops, volume 7640 of LNCS, pages 414--428. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. O. Sinnen. Task scheduling for parallel systems, volume 60. John Wiley & Sons, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. E. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, and T. Ideker. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 27(3):431--432, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. V. Subotic, S. Brinkmann, V. Marjanovic, R. M. Badia, J. Gracia, C. Niethammer, E. Ayguade, J. Labarta, and M. Valero. Programmability and portability for exascale: Top down programming methodology and tools with starss. Journal of Computational Science, 4(6):450--456, 2013. doi: http://dx.doi.org/10.1016/j.jocs.2013.01.008.Google ScholarGoogle ScholarCross RefCross Ref
  42. G. Team. Gecode: Generic constraint development environment, 2006. http://www.gecode.org.Google ScholarGoogle Scholar
  43. M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen's matrix multiplication for memory efficiency. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1--14. IEEE Computer Society, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. V. Tovinkere and M. Voss. Flow graph designer: A tool for designing and analyzing Intel® threading building blocks flow graphs. In ICPP Workshops, pages 149--158. IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. yWorks GmBh. yEd graph editor, 2015. http://www.yworks.com/en/products_yed_about.html. Accessed 10 April 2015.Google ScholarGoogle Scholar

Index Terms

  1. Grain graphs: OpenMP performance analysis made easy

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 51, Issue 8
      PPoPP '16
      August 2016
      405 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3016078
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        February 2016
        420 pages
        ISBN:9781450340922
        DOI:10.1145/2851141

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 February 2016

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!