Abstract
Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.
- NVIDIA CUDA 4.0. http://developer.nvidia.com/cuda-toolkit-40.Google Scholar
- Streamit benchmarks. http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.Google Scholar
- S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In The 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10), 2010. Google Scholar
Digital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH '04, 2004. Google Scholar
Digital Library
- L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao. Dynamic load balancing on single- and multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), 2010.Google Scholar
Cross Ref
- G. Diamos and S. Yalamanchili. Speculative execution on multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS '10), 2010.Google Scholar
Cross Ref
- C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In The 19th Design Automation Conference (DAC '82), 1982. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In The 10th international conference on Architectural support for programming languages and operating systems (ASPLOS '02), Oct 2002. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In The 12th international conference on Architectural support for programming languages and operating systems (ASPLOS '06), 2006. Google Scholar
Digital Library
- A. Hagiescu, W.-F. Wong, D. F. Bacon, and R. Rabbah. A computing origami: folding streams in FPGAs. In The 46th Annual Design Automation Conference (DAC '09), 2009. Google Scholar
Digital Library
- A. Hagiescu, H. P. Huynh, W. F. Wong, and R. S. M. Goh. Automated architecture-aware mapping of streaming applications onto GPUs. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011. Google Scholar
Digital Library
- A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In The 16th international conference on Architectural support for programming languages and operating systems (ASPLOS '11), 2011. Google Scholar
Digital Library
- H. P. Huynh, Y. Liang, and T. Mitra. Efficient custom instructions generation for system-level design. In 2010 International Conference on Field-Programmable Technology (FPT '10), 2010.Google Scholar
Cross Ref
- G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 1998. Google Scholar
Digital Library
- B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 1970.Google Scholar
- Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.Google Scholar
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In The 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI '08), 2008. Google Scholar
Digital Library
- E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, 36 (1), 1987. Google Scholar
Digital Library
- J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30, 2010. ISSN 0272--1732. Google Scholar
Digital Library
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26 (1), 2007.Google Scholar
- D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS'11), 2009. Google Scholar
Digital Library
- J. A. Stuart and J. D. Owens. Multi-GPU MapReduce on GPU clusters. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011. Google Scholar
Digital Library
- A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on GPUs. In The 7th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '09), 2009. Google Scholar
Digital Library
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS '10), 2010.Google Scholar
Cross Ref
- Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In (The 17th International Symposium on High Performance Computer Architecture (HPCA '11)), 2011. Google Scholar
Digital Library
Index Terms
Scalable framework for mapping streaming applications onto multi-GPU systems
Recommendations
Scalable framework for mapping streaming applications onto multi-GPU systems
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingGraphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications ...
Communication-aware mapping of stream graphs for multi-GPU platforms
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationStream graphs can provide a natural way to represent many applications in multimedia and DSP domains. Though the exposed parallelism of stream graphs makes it relatively easy to map them to GP (General Purpose)-GPUs, very large stream graphs as well as ...
Multi-GPU DGEMM and High Performance Linpack on Highly Energy-Efficient Clusters
High Performance Linpack can maximize requirements throughout a computer system. An efficient multi-GPU double-precision general matrix multiply (DGEMM), together with adjustments to the HPL, is required to utilize a heterogeneous computer to its full ...







Comments