skip to main content
10.1145/1504176.1504189acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Mapping parallelism to multi-cores: a machine learning based approach

Published:14 February 2009Publication History

ABSTRACT

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.

References

  1. D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Barnes, B. Rountree, et al. A regression-based approach to scalability prediction. In ICS'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. B. Bernhard, M. G. Isabelle, et al. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, U. K., 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Blagojevic, X. Feng, et al. Modeling multi-grain parallelism on heterogeneous multicore processors: A case study of the Cell BE. In HiPEAC'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cavazos, G. Fursin, et al. Rapidly selecting good compiler optimizations using performance counters. In CGO'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. D. Cooper, P. J. Schielke, et al. Optimizing for reduced code space using genetic algorithms. In LCTES'99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Corbalan, X. Martorell, et al. Performance-driven processor allocation. IEEE Transaction Parallel Distribution System, 16(7):599--611, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Dagum and R. Menon. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 5(1):46--55, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. C. David, M. K. Richard, et al. LogP: a practical model of parallel computation. Communications of the ACM, 39(11):78--85, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Gabriel and M. John. Cross-architecture performance predictions for scientific applications using parameterized models. In SIGMETRICS'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. R. Guthaus, J. S. Ringenberg, et al. Mibench: A free, commercially representative embedded benchmark suite, 2001.Google ScholarGoogle Scholar
  13. H. Hofstee. Future microprocessors and off-chip SOP interconnect. Advanced Packaging, IEEE Transactions on, 27(2):301--303, May 2004.Google ScholarGoogle Scholar
  14. S. Ilya, K. Robert, et al. A case study in top-down performance estimation for a large-scale parallel application. In PPoPP'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Ipek, B. R. de Supinski, et al. An approach to performance prediction for parallel applications. In Euro-Par'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Lee. UTDSP benchmark suite, http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html.Google ScholarGoogle Scholar
  19. G. V. Leslie. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Liao and B. Chapman. A compile-time cost model for OpenMP. In IPDPS'07, 2007.Google ScholarGoogle Scholar
  21. C. L. Liu and W. L. James. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46--61, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Long, G. Fursin, et al. A cost-aware parallel workload allocation approach based on machine learning. In NPC '07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. K. Luk, Robert Cohn, et al. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. S. Macey and A. Y. Zomaya. A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In IPPS/SPDP'98, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Z. Qin, C. Ioana, et al. Pipa: pipelined profiling and analysis on multicore systems. In CGO'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In SuperComputing'89, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Xinmin, G. Milind, et al. Compiler and Runtime Support for Running OpenMP Programs on Pentium-and Itanium-Architectures. In IPDPS'03, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Yun and V. Michael. Runtime empirical selection of loop schedulers on hyperthreaded SMPs. In IPDPS'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mapping parallelism to multi-cores: a machine learning based approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!