skip to main content
research-article

Portable performance on heterogeneous architectures

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.

To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.

References

  1. F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O'boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In Symposium on Code Generation and Optimization, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. Finding effective compilation sequences. In Conference on Languages, Compilers, and Tools for Embedded Systems, New York, NY, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: A language and compiler for algorithmic choice. In Programming Language Design and Implementation, Dublin, Ireland, Jun 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. Language and compiler support for auto-tuning variable-accuracy algorithms. In Symposium on Code Generation and Optimization, Chamonix, France, Apr 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Muthu Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-CUDA code generation for affine programs. In Rajiv Gupta, editor, Compiler Construction, volume 6011. Springer Berlin / Heidelberg, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Charles, P. Jassi, N.S. Ananth, A. Sadat, and A. Fedorova. Evaluation of the Intel Core i7 Turbo Boost feature. In Symposium on Workload Characterization, Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrew Davidson, Yao Zhang, and John D. Owens. An auto-tuned method for solving large tridiagonal systems on the GPU. In Parallel and Distributed Processing Symposium. IEEE, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), February 2005.Google ScholarGoogle ScholarCross RefCross Ref
  11. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Programming language design and implementation, New York, NY, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Ayal Zaks, Bilha Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, Chris Williams, Michael O'Boyle, Phil Barnard, Elton Ashton, Eric Courtois, and Francois Bodin. MILEPOST GCC: machine learning based research compiler. In GCC Developers' Summit, Jul 2008.Google ScholarGoogle Scholar
  13. Scott Grauer-Gray, Lifan Xu, Robert Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing Conference. IEEE, May 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. Eun-jin Im and Katherine Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Computational Science. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In Programming language design and implementation, New York, NY, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11), November 2005.Google ScholarGoogle Scholar
  17. Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44, February 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In international symposium on Computer architecture, New York, NY, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David Wohlford, Cedric Bastoul, and Richard Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Workshop on General-Purpose Computation on Graphics Processing Units, New York, NY, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In International Symposium on Microarchitecture, New York, NY, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-d FFT library for CUDA GPUs. In High Performance Computing Networking, Storage and Analysis, New York, NY, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Eunjung Park, L.-N. Pouche, J. Cavazos, A. Cohen, and P. Sadayappan. Predictive modeling in a polyhedral optimization space. In Symposium on Code Generation and Optimization, April 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Markus Puschel, Jose M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robbert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93. IEEE, Feb 2005.Google ScholarGoogle ScholarCross RefCross Ref
  24. Alina Sbırlea, Yi Zou, Zoran Budimlıc, Jason Cong, and Vivek Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. In International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, New York, NY, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Supercomputing, November 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Scientific Discovery through Advanced Computing Conference, San Francisco, CA, USA, June 2005.Google ScholarGoogle Scholar
  27. Richard Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yonghong Yan, Max Grossman, and Vivek Sarkar. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Henk Sips, Dick Epema, and Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, volume 5704. Springer Berlin / Heidelberg, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sain zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen mei W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In Workshops on Languages and Compilers for Parallel Computing. Springer, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the GPU. In Symposium on Principles and Practice of Parallel Programming, January 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Portable performance on heterogeneous architectures

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 4
        ASPLOS '13
        April 2013
        540 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2499368
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
          March 2013
          574 pages
          ISBN:9781450318709
          DOI:10.1145/2451116

        Copyright © 2013 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 March 2013

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!