skip to main content
research-article

Kremlin: rethinking and rebooting gprof for the multicore age

Published:04 June 2011Publication History
Skip Abstract Section

Abstract

Many recent parallelization tools lower the barrier for parallelizing a program, but overlook one of the first questions that a programmer needs to answer: which parts of the program should I spend time parallelizing?

This paper examines Kremlin, an automatic tool that, given a serial version of a program, will make recommendations to the user as to what regions (e.g. loops or functions) of the program to attack first. Kremlin introduces a novel hierarchical critical path analysis and develops a new metric for estimating the potential of parallelizing a region: self-parallelism. We further introduce the concept of a parallelism planner, which provides a ranked order of specific regions to the programmer that are likely to have the largest performance impact when parallelized. Kremlin supports multiple planner personalities, which allow the planner to more effectively target a particular programming environment or class of machine.

We demonstrate the effectiveness of one such personality, an OpenMP planner, by comparing versions of programs that are parallelized according to Kremlin's plan against third-party manually parallelized versions. The results show that Kremlin's OpenMP planner is highly effective, producing plans whose performance is typically comparable to, and sometimes much better than, manual parallelization. At the same time, these plans would require that the user parallelize significantly fewer regions of the program.

References

  1. NAS Parallel Benchmarks 2.3; OpenMP C. www.hpcc.jp/Omni/.Google ScholarGoogle Scholar
  2. Spec OMP2001 Benchmarks. http://www.spec.org/omp.Google ScholarGoogle Scholar
  3. F. Allen, M. Burke, R. Cytron, J. Ferrante, W. Hsieh, and V. Sarkar. "A framework for determining useful parallelism". In Proceedings of the 2nd international conference on Supercomputing, ICS '88, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. E. Anderson, and E. D. Lazowska. "Quartz: A tool for tuning parallel program performance". In SIGMETRICS, vol. 18, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Austin, and G. S. Sohi. "Dynamic dependency analysis of ordinary programs". In ISCA, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bailey et al. "The NAS parallel benchmarks. In SC, 1991.Google ScholarGoogle Scholar
  7. W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, W. Paek, Y. Pottenger, L. Rauchwerger, and P. Tu. "Parallel programming with Polaris". IEEE Computer, Aug 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. M. Bull, and D. O'Neill. "A microbenchmark suite for openmp 2.0. SIGARCH Comput. Archit. News, December 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. K. Chen, and K. Olukotun. "The Jrpm system for dynamically parallelizing Java programs". In ISCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Dig, J. Marrero, and M. D. Ernst. "Refactoring sequential java code for concurrency via concurrent libraries". In ICSE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. H. Du, C. C. Lim, X. F. Li, C. Yang, Q. Zhao, and T. F. Ngai. "A cost-driven compilation framework for speculative parallelization of sequential programs". In PLDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and E. Bu. "Maximizing multiprocessor performance with the SUIF compiler". IEEE Computer, Aug 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Hammacher, K. Streit, S. Hack, and A. Zeller. "Profiling java programs for parallelism". In Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering, IWMSE '09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. He, C. Leiserson, and W. Leiserson. "The Cilkview Scalability Analyzer". In SPAA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Jeon, S. Garcia, C. Louie, S. Kota Venkata, and M. Taylor. "Kremlin: Like gprof, but for Parallelization". In Principles and Practice of Parallel Programming, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Jost, H. Jin, J. Labarta, and J. Gimenez. "Interfacing computer aided parallelization and performance analysis". In Computational Science ICCS 2003, vol. 2660 of Lecture Notes in Computer Science, 715--715. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Kelsey, T. Bai, C. Ding, and C. Zhang. "Fast track: A software system for speculative program optimization". In CGO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Kennedy, K. S. McKinley, and C. W. Tseng. "Interactive parallel programming using the parascope editor". IEEE TPDS, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kim, H. Kim, and C.-K. Luk. "Sd3: A scalable approach to dynamic data-dependence profiling". Microarchitecture, IEEE/ACM International Symposium on, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Kota Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, S. Belongie, and M. Taylor. "SD-VBS: The San Diego Vision Benchmark Suite". In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Kuck, Y. Muraoka, and S.-C. Chen. "On the number of operations simultaneously executable in fortran-like programs and their resulting speedup". IEEE Transactions on Computers, Dec. 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval. "How much parallelism is there in irregular applications"? In PPoPP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Kumar. "Measuring parallelism in computation-intensive scientific/engineering applications". IEEE TOC, Sep 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. R. Larus. "Loop-level parallelism in numeric and symbolic programs". IEEE Trans. Parallel Distrib. Syst., 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Lattner, and V. Adve. "LLVM: A compilation framework for lifelong program analysis & transformation". In CGO, Mar 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. "Space-time scheduling of instruction-level parallelism on a Raw machine". In ASPLOS, October 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. E. Leiserson. "The Cilk concurrency platform. In DAC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S.-W. Liao, A. Diwan, R. P. Bosch, Jr., A. Ghuloum, and M. S. Lam. "Suif explorer: an interactive and interprocedural parallelizer". In Proceedings of the ACM SIGPLAN symposium on Principles and practice of parallel programming, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Liu, J. Tuck, L. Ceze, W. Ahn, K. Strauss, J. Renau, and J. Torrellas. "POSH: a TLS compiler that exploits program structure". In PPoPP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Nethercote, and J. Seward. "Valgrind: A framework for heavyweight dynamic binary instrumentation". In PLDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Rauchwerger, and D. Padua. "The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization". In PLDI, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. "Understanding parallelism-inhibiting dependences in sequential java programs". In Software Maintenance (ICSM), 2010 IEEE International Conference on, Sept 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. A. Saraswat, V. Sarkar, and C. von Praun. "X10: concurrent programming for modern architectures". In PPoPP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Sohi, S. Breach, and T. Vijaykumar. "Multiscalar processors". In ISCA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. N. R. Tallent, and J. M. Mellor Crummey. "Effective performance measurement and analysis of multithreaded applications". In PPoPP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Thies, S. Hall, and S. Amarasinghe. "Manipulating lossless video in the compressed domain". In ACM Multimedia, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Tian, M. Feng, V. Nagarajan, and R. Gupta. "Copy or discard execution model for speculative parallelization on multicores". In MICRO, 2008.Google ScholarGoogle Scholar
  38. G. Tournavitis, Z. Wang, B. Franke, and M. F. P. O'Boyle. "Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping". In PLDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. C. von Praun, R. Bordawekar, and C. Cascaval. "Modeling optimistic concurrency using quantitative dependence analysis". In PPoPP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Wloka, M. Sridharan, and F. Tip. "Refactoring for reentrancy". In FSE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Wu, A. Kejariwal, and C. Caşcaval. "Compiler-driven dependence profiling to guide program parallelization". In LCPC, 232--248. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. B. Xin, and X. Zhang. "Efficient online detection of dynamic control dependence". In ISSTA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. X. Zhang, A. Navabi, and S. Jagannathan. "Alchemist: A transparent dependence distance profiling infrastructure". In CGO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Zhang, and R. Gupta. "Timestamped whole program path representation and its applications". In PLDI, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Q. Zhao, D. Bruening, and S. Amarasinghe. "Umbra: Efficient and scalable memory shadowing". In CGO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. "Uncovering hidden loop level parallelism in sequential applications". In HPCA, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Kremlin: rethinking and rebooting gprof for the multicore age

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!