skip to main content
10.1145/1542476.1542496acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Published:15 June 2009Publication History

ABSTRACT

Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning based parallelization for complex multi-core platforms.

References

  1. H. P. Hofstee. Future microprocessors and off-chip SOP interconnect. IEEE Trans. on Advanced Packaging, 27(2), May 2004.Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Lamport. The parallel execution of DO loops. Communications of ACM, 17(2), 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. PLDI , 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. Parallel Computing, ACM, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. A. Padua, R. Eigenmann, et al. Polaris: A new-generation parallelizing compiler for MPPs. Technical report, In CSRD No. 1306. UIUC, 1993.Google ScholarGoogle Scholar
  7. M. W. Hall, J. M. Anderson, et al. Maximizing multiprocessor performance with the SUIF compiler. Computer, 29(12), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Open64. http://www.open64.net.Google ScholarGoogle Scholar
  9. F. Matteo, C. Leiserson, and K. Randall. The implementation of the Cilk-5 multithreaded language. PLDI, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Gordon, W. Thies, M. Karczmarek, et al. A stream compiler for communication-exposed architectures. ASPLOS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Husbands Parry, C. Iancu, and K. Yelick. A performance analysis of the Berkeley UPC compiler. SC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. A. Saraswat, V. Sarkar, and C von. Praun. X10: Concurrent programming for modern architectures. PPoPP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Shih-Wei, D. Amer, et al. SUIF Explorer: An interactive and interprocedural parallelizer. SIGPLAN Not., 34(8), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Kulkarni, K. Pingali, B. Walter, et al. Optimistic parallelism requires abstractions. PLDI'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Rauchwerger, F. Arzu, and K. Ouchi. Standard Templates Adaptive Parallel Library. Inter. Workshop LCR, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jia Guo, Ganesh Bikshandi, et al. Hierarchically tiled arrays for parallelism and locality. IPDPS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Irigoin, P. Jouvelot, and R. Triolet. Semantical interprocedural parallelization: an overview of the PIPS project. ICS 1991 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Kennedy, K. S. McKinley, and C. W. Tseng. Interactive parallel programming using the Parascope editor. IEEE TPDS, 2(3), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Brandes, S. Chaumette, M. C. Counilh et al. HPFIT: a set of integrated tools for the parallelization of applications using high performance Fortran. part I: HPFIT and the Transtool environment. Parallel Comput., 23(1--2), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Ishihara, H. Honda, and M. Sato. Development and implementation of an interactive parallelization assistance tool for OpenMP: iPat/OMP. IEICE Trans. Inf. Syst., E89-D(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Rul, H. Vandierendonck, and K. De Bosschere. A dynamic analysis tool for finding coarse-grain parallelism. In HiPEAC Industrial Workshop, 2008.Google ScholarGoogle Scholar
  22. W. M. Pottenger. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Technical Report, UIUC, 1994.Google ScholarGoogle Scholar
  23. M. O'Boyle and E. Stöhr. Compile time barrier synchronization minimization. IEEE TPDS, 13(6), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. B. Bernhard, M. G. Isabelle, and N. V. Vladimir. A training algorithm for optimal margin classifiers. Workshop on Computational Learning Theory, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Ziegler and M. Hall. Evaluating heuristics in automatically mapping multi-loop applications to FPGAs. FPGA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3), 1991.Google ScholarGoogle Scholar
  27. R. E. Grant and A. Afsahi. A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs. IPDPS, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  28. NAS Parallel Benchmarks 2.3, OpenMP C version. http://phase.hpcc.jp/Omni/benchmarks/NPB/index.html.Google ScholarGoogle Scholar
  29. V. Aslot, M. Domeika, et al. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. LNCS, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution. ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Dou and M. Cintra. Compiler estimation of load imbalance overhead in speculative parallelization. PACT, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Ramaseshan and F. Mueller. Toward thread-level speculation for coarse-grained parallelism of regular access patterns. MULTIPROG, 2008.Google ScholarGoogle Scholar
  33. M. Bridges, N. Vachharajani, et al. Revisiting the sequential programming model for multi--core. MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores, 2007. ICS, 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Peterson and D. Padua. Dynamic dependence analysis: A novel method for data dependence evaluation. LCPC, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Chen and K. Olukotun. The JRPM system for dynamically parallelizing Java programs. ISCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Rus and L. Rauchwerger. Hybrid dependence analysis for automatic parallelization. Technical Report, Dept. of CS, Texas A&M U., 2005.Google ScholarGoogle Scholar
  38. C. Ding, X. Shen, et al. Software behavior oriented parallelization. PLDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. SC, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Liao and B. Chapman. A compile-time cost model for OpenMP. IPDPS, 2007.Google ScholarGoogle Scholar
  42. J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. IEEE TPDS, 16(7), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Zhang and M. Voss. Runtime empirical selection of loop schedulers on Hyperthreaded SMPs. IPDPS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Cooper, P. Schielke, and D. Subramanian. Optimizing for reduced code space using genetic algorithms. LCTES, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Monsifrot, F. Bodin, and R. Quiniou. A machine learning approach to automatic production of compiler heuristics. Artificial Intelligence: Methodology, Systems, Applications, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. L.N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: part II, multidimensional time. PLDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!