ABSTRACT
Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning based parallelization for complex multi-core platforms.
- H. P. Hofstee. Future microprocessors and off-chip SOP interconnect. IEEE Trans. on Advanced Packaging, 27(2), May 2004.Google Scholar
Cross Ref
- L. Lamport. The parallel execution of DO loops. Communications of ACM, 17(2), 1974. Google Scholar
Digital Library
- M. Burke and R. Cytron. Interprocedural dependence analysis and parallelization. PLDI , 1986. Google Scholar
Digital Library
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2002. Google Scholar
Digital Library
- A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. Parallel Computing, ACM, 1997. Google Scholar
Digital Library
- D. A. Padua, R. Eigenmann, et al. Polaris: A new-generation parallelizing compiler for MPPs. Technical report, In CSRD No. 1306. UIUC, 1993.Google Scholar
- M. W. Hall, J. M. Anderson, et al. Maximizing multiprocessor performance with the SUIF compiler. Computer, 29(12), 1996. Google Scholar
Digital Library
- Open64. http://www.open64.net.Google Scholar
- F. Matteo, C. Leiserson, and K. Randall. The implementation of the Cilk-5 multithreaded language. PLDI, 1998. Google Scholar
Digital Library
- M. Gordon, W. Thies, M. Karczmarek, et al. A stream compiler for communication-exposed architectures. ASPLOS, 2002. Google Scholar
Digital Library
- P. Husbands Parry, C. Iancu, and K. Yelick. A performance analysis of the Berkeley UPC compiler. SC, 2003. Google Scholar
Digital Library
- V. A. Saraswat, V. Sarkar, and C von. Praun. X10: Concurrent programming for modern architectures. PPoPP, 2007. Google Scholar
Digital Library
- L. Shih-Wei, D. Amer, et al. SUIF Explorer: An interactive and interprocedural parallelizer. SIGPLAN Not., 34(8), 1999. Google Scholar
Digital Library
- M. Kulkarni, K. Pingali, B. Walter, et al. Optimistic parallelism requires abstractions. PLDI'07, 2007. Google Scholar
Digital Library
- L. Rauchwerger, F. Arzu, and K. Ouchi. Standard Templates Adaptive Parallel Library. Inter. Workshop LCR, 1998. Google Scholar
Digital Library
- Jia Guo, Ganesh Bikshandi, et al. Hierarchically tiled arrays for parallelism and locality. IPDPS, 2006. Google Scholar
Digital Library
- F. Irigoin, P. Jouvelot, and R. Triolet. Semantical interprocedural parallelization: an overview of the PIPS project. ICS 1991 Google Scholar
Digital Library
- K. Kennedy, K. S. McKinley, and C. W. Tseng. Interactive parallel programming using the Parascope editor. IEEE TPDS, 2(3), 1991. Google Scholar
Digital Library
- T. Brandes, S. Chaumette, M. C. Counilh et al. HPFIT: a set of integrated tools for the parallelization of applications using high performance Fortran. part I: HPFIT and the Transtool environment. Parallel Comput., 23(1--2), 1997. Google Scholar
Digital Library
- M. Ishihara, H. Honda, and M. Sato. Development and implementation of an interactive parallelization assistance tool for OpenMP: iPat/OMP. IEICE Trans. Inf. Syst., E89-D(2), 2006. Google Scholar
Digital Library
- S. Rul, H. Vandierendonck, and K. De Bosschere. A dynamic analysis tool for finding coarse-grain parallelism. In HiPEAC Industrial Workshop, 2008.Google Scholar
- W. M. Pottenger. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Technical Report, UIUC, 1994.Google Scholar
- M. O'Boyle and E. Stöhr. Compile time barrier synchronization minimization. IEEE TPDS, 13(6), 2002. Google Scholar
Digital Library
- E. B. Bernhard, M. G. Isabelle, and N. V. Vladimir. A training algorithm for optimal margin classifiers. Workshop on Computational Learning Theory, 1992. Google Scholar
Digital Library
- H. Ziegler and M. Hall. Evaluating heuristics in automatically mapping multi-loop applications to FPGAs. FPGA, 2005. Google Scholar
Digital Library
- D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3), 1991.Google Scholar
- R. E. Grant and A. Afsahi. A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs. IPDPS, 2007.Google Scholar
Cross Ref
- NAS Parallel Benchmarks 2.3, OpenMP C version. http://phase.hpcc.jp/Omni/benchmarks/NPB/index.html.Google Scholar
- V. Aslot, M. Domeika, et al. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. LNCS, 2001. Google Scholar
Digital Library
- S. Wallace, B. Calder, and D. M. Tullsen. Threaded multiple path execution. ISCA, 1998. Google Scholar
Digital Library
- J. Dou and M. Cintra. Compiler estimation of load imbalance overhead in speculative parallelization. PACT, 2004. Google Scholar
Digital Library
- R. Ramaseshan and F. Mueller. Toward thread-level speculation for coarse-grained parallelism of regular access patterns. MULTIPROG, 2008.Google Scholar
- M. Bridges, N. Vachharajani, et al. Revisiting the sequential programming model for multi--core. MICRO, 2007. Google Scholar
Digital Library
- S. Rus, M. Pennings, and L. Rauchwerger. Sensitivity analysis for automatic parallelization on multi-cores, 2007. ICS, 2007 Google Scholar
Digital Library
- P. Peterson and D. Padua. Dynamic dependence analysis: A novel method for data dependence evaluation. LCPC, 1992. Google Scholar
Digital Library
- M. Chen and K. Olukotun. The JRPM system for dynamically parallelizing Java programs. ISCA, 2003. Google Scholar
Digital Library
- S. Rus and L. Rauchwerger. Hybrid dependence analysis for automatic parallelization. Technical Report, Dept. of CS, Texas A&M U., 2005.Google Scholar
- C. Ding, X. Shen, et al. Software behavior oriented parallelization. PLDI, 2007. Google Scholar
Digital Library
- W. Thies, V. Chandrasekhar, and S. Amarasinghe. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. MICRO, 2007. Google Scholar
Digital Library
- J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. SC, 1989. Google Scholar
Digital Library
- C. Liao and B. Chapman. A compile-time cost model for OpenMP. IPDPS, 2007.Google Scholar
- J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. IEEE TPDS, 16(7), 2005. Google Scholar
Digital Library
- Y. Zhang and M. Voss. Runtime empirical selection of loop schedulers on Hyperthreaded SMPs. IPDPS, 2005. Google Scholar
Digital Library
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), 1990. Google Scholar
Digital Library
- K. Cooper, P. Schielke, and D. Subramanian. Optimizing for reduced code space using genetic algorithms. LCTES, 1999. Google Scholar
Digital Library
- A. Monsifrot, F. Bodin, and R. Quiniou. A machine learning approach to automatic production of compiler heuristics. Artificial Intelligence: Methodology, Systems, Applications, 2002. Google Scholar
Digital Library
- L.N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: part II, multidimensional time. PLDI, 2008. Google Scholar
Digital Library
Index Terms
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping
Recommendations
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping
PLDI '09Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a ...
Integrating profile-driven parallelism detection and machine-learning-based mapping
Compiler-based auto-parallelization is a much-studied area but has yet to find widespread application. This is largely due to the poor identification and exploitation of application parallelism, resulting in disappointing performance far below that ...
Performance Comparison with OpenMP Parallelization for Multi-core Systems
ISPA '11: Proceedings of the 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with ApplicationsToday, the multi-core processor has occupied more and more market shares, and the programming personnel also must face the collision brought by the revolution of multi-core processor. Semiconductor scaling limits and associated power and thermal ...







Comments