skip to main content
research-article

Scalable framework for mapping streaming applications onto multi-GPU systems

Authors Info & Claims
Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

References

  1. NVIDIA CUDA 4.0. http://developer.nvidia.com/cuda-toolkit-40.Google ScholarGoogle Scholar
  2. Streamit benchmarks. http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.Google ScholarGoogle Scholar
  3. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In The 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH '04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao. Dynamic load balancing on single- and multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  6. G. Diamos and S. Yalamanchili. Speculative execution on multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS '10), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  7. C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In The 19th Design Automation Conference (DAC '82), 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In The 10th international conference on Architectural support for programming languages and operating systems (ASPLOS '02), Oct 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In The 12th international conference on Architectural support for programming languages and operating systems (ASPLOS '06), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Hagiescu, W.-F. Wong, D. F. Bacon, and R. Rabbah. A computing origami: folding streams in FPGAs. In The 46th Annual Design Automation Conference (DAC '09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Hagiescu, H. P. Huynh, W. F. Wong, and R. S. M. Goh. Automated architecture-aware mapping of streaming applications onto GPUs. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In The 16th international conference on Architectural support for programming languages and operating systems (ASPLOS '11), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. P. Huynh, Y. Liang, and T. Mitra. Efficient custom instructions generation for system-level design. In 2010 International Conference on Field-Programmable Technology (FPT '10), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 1970.Google ScholarGoogle Scholar
  16. Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.Google ScholarGoogle Scholar
  17. M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In The 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI '08), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, 36 (1), 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30, 2010. ISSN 0272--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26 (1), 2007.Google ScholarGoogle Scholar
  21. D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS'11), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. A. Stuart and J. D. Owens. Multi-GPU MapReduce on GPU clusters. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on GPUs. In The 7th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '09), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS '10), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  25. Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In (The 17th International Symposium on High Performance Computer Architecture (HPCA '11)), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable framework for mapping streaming applications onto multi-GPU systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 47, Issue 8
      PPOPP '12
      August 2012
      334 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2370036
      Issue’s Table of Contents
      • cover image ACM Conferences
        PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
        February 2012
        352 pages
        ISBN:9781450311601
        DOI:10.1145/2145816

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2012

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!