skip to main content
research-article

Sponge: portable stream programming on graphics engines

Published:05 March 2011Publication History
Skip Abstract Section

Abstract

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a write-once software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponge's compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.

References

  1. I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777--786, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Chen, Z. Huang, F. Su, J.-K. Peir, J. Ho, and L. Peng. Weak execution ordering - exploiting iterative methods on many-core gpus. In Proc. of the 2010 IEEE Symposium on Performance Analysis of Systems and Software, pages 154--163, 2010.Google ScholarGoogle Scholar
  3. K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 83, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. of the 40th Annual International Symposium on Microarchitecture, pages 407--420, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151--162, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291--303, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Ha and E. A. Lee. Compile-time scheduling and assignment of data-flow program graphs with data-dependent iteration. IEEE Transactions on Computers, 40(11):1225--1238, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Han and T. Abdelrahman. hicuda: High-level gpgpu programming. IEEE Transactions on Parallel and Distributed Systems, (99):1--1, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 152--163, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Hormati, M. Kudlur, D. Bacon, S. Mahlke, and R. Rabbah. Optimus: Efficient realization of streaming applications on FPGAs. In Proc. of the 2008 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 41--50, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proc. of the 18th International Conference on Parallel Architectures and Compilation Techniques, pages 214--223, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, T. Mudge, and S. Mahlke. Macross: Macro-simdization of streaming applications. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285--296, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010.Google ScholarGoogle Scholar
  14. M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Lee and D. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235--1245, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler framework for automatic translation and optimization. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proc. of the 37th Annual International Symposium on Computer Architecture, pages 451--460, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Mark, R. Glanville, K. Akeley, and J. Kilgard. Cg: A system for programming graphics hardware in a C-like language. In Proc. of the 30th International Conference on Computer Graphics and Interactive Techniques, pages 893--907, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. NVIDIA. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.Google ScholarGoogle Scholar
  20. NVIDIA. Fermi: Nvidia's next generation cuda compute architecture, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google ScholarGoogle Scholar
  21. NVIDIA. Gpus are only up to 14 times faster than cpus says intel, 2010. http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html.Google ScholarGoogle Scholar
  22. J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. A hierarchical multiprocessor scheduling framework for synchronous dataflow graphs. Technical Report UCB/ERL M95/36, University of California, Berkeley, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. A. Stratton, S. S. Stone, and W.-M. W. Hwu. Mcuda: An efficient implementation of cuda kernels for multi-core cpus. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 16--30, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Thies and S. Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, page To Appear, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 2002 International Conference on Compiler Construction, pages 179--196, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on gpus. In Proc. of the 2009 International Symposium on Code Generation and Optimization, pages 200--209, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. wei Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and computation transformations for brook streaming applications on multiprocessors. Proc. of the 2006 International Symposium on Code Generation and Optimization, 0(1):196--207, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu. Cuda-lite: Reducing gpu programming complexity. In Proc. of the 21st Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sponge: portable stream programming on graphics engines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 3
      ASPLOS '11
      March 2011
      407 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1961296
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
        March 2011
        432 pages
        ISBN:9781450302661
        DOI:10.1145/1950365

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 March 2011

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!