ABSTRACT
The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.
- D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, 1991.Google Scholar
Digital Library
- B. Barnes, B. Rountree, et al. A regression-based approach to scalability prediction. In ICS'08, 2008. Google Scholar
Digital Library
- E. B. Bernhard, M. G. Isabelle, et al. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 1992. Google Scholar
Digital Library
- C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, U. K., 1996. Google Scholar
Digital Library
- F. Blagojevic, X. Feng, et al. Modeling multi-grain parallelism on heterogeneous multicore processors: A case study of the Cell BE. In HiPEAC'08, 2008. Google Scholar
Digital Library
- J. Cavazos, G. Fursin, et al. Rapidly selecting good compiler optimizations using performance counters. In CGO'07, 2007. Google Scholar
Digital Library
- K. D. Cooper, P. J. Schielke, et al. Optimizing for reduced code space using genetic algorithms. In LCTES'99, 1999. Google Scholar
Digital Library
- J. Corbalan, X. Martorell, et al. Performance-driven processor allocation. IEEE Transaction Parallel Distribution System, 16(7):599--611, 2005. Google Scholar
Digital Library
- L. Dagum and R. Menon. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 5(1):46--55, 1998. Google Scholar
Digital Library
- E. C. David, M. K. Richard, et al. LogP: a practical model of parallel computation. Communications of the ACM, 39(11):78--85, 1996. Google Scholar
Digital Library
- M. Gabriel and M. John. Cross-architecture performance predictions for scientific applications using parameterized models. In SIGMETRICS'04, 2004. Google Scholar
Digital Library
- M. R. Guthaus, J. S. Ringenberg, et al. Mibench: A free, commercially representative embedded benchmark suite, 2001.Google Scholar
- H. Hofstee. Future microprocessors and off-chip SOP interconnect. Advanced Packaging, IEEE Transactions on, 27(2):301--303, May 2004.Google Scholar
- S. Ilya, K. Robert, et al. A case study in top-down performance estimation for a large-scale parallel application. In PPoPP'06, 2006. Google Scholar
Digital Library
- E. Ipek, B. R. de Supinski, et al. An approach to performance prediction for parallel applications. In Euro-Par'05, 2005. Google Scholar
Digital Library
- Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999. Google Scholar
Digital Library
- C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04, 2004. Google Scholar
Digital Library
- C. Lee. UTDSP benchmark suite, http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html.Google Scholar
- G. V. Leslie. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google Scholar
Digital Library
- C. Liao and B. Chapman. A compile-time cost model for OpenMP. In IPDPS'07, 2007.Google Scholar
- C. L. Liu and W. L. James. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46--61, 1973. Google Scholar
Digital Library
- S. Long, G. Fursin, et al. A cost-aware parallel workload allocation approach based on machine learning. In NPC '07, 2007. Google Scholar
Digital Library
- C. K. Luk, Robert Cohn, et al. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI'05, 2005. Google Scholar
Digital Library
- B. S. Macey and A. Y. Zomaya. A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In IPPS/SPDP'98, 1998. Google Scholar
Digital Library
- Z. Qin, C. Ioana, et al. Pipa: pipelined profiling and analysis on multicore systems. In CGO'08, 2008. Google Scholar
Digital Library
- J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In SuperComputing'89, 1989. Google Scholar
Digital Library
- T. Xinmin, G. Milind, et al. Compiler and Runtime Support for Running OpenMP Programs on Pentium-and Itanium-Architectures. In IPDPS'03, 2003. Google Scholar
Digital Library
- Z. Yun and V. Michael. Runtime empirical selection of loop schedulers on hyperthreaded SMPs. In IPDPS'05, 2005. Google Scholar
Digital Library
Index Terms
Mapping parallelism to multi-cores: a machine learning based approach
Recommendations
Partitioning streaming parallelism for multi-cores: a machine learning based approach
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesStream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address ...
Mapping parallelism to multi-cores: a machine learning based approach
PPoPP '09The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...







Comments