Abstract
While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In addition to the programmability challenges posed by GPU's complex memory hierarchy and parallelism model, a well-known application design problem is target portability across different GPUs. However, even for a single GPU target, changing a program's input characteristics can make an already-optimized implementation of a program perform poorly. In this work, we propose Adaptic, an adaptive input-aware compilation system to tackle this important, yet overlooked, input portability problem. Using this system, programmers develop their applications in a high-level streaming language and let Adaptic undertake the difficult task of input portable optimizations and code generation. Several input-aware optimizations are introduced to make efficient use of the memory hierarchy and customize thread composition. At runtime, a properly optimized version of the application is executed based on the actual program input. We perform a head-to-head comparison between the Adaptic generated and hand-optimized CUDA programs. The results show that Adaptic is capable of generating codes that can perform on par with their hand-optimized counterparts over certain input ranges and outperform them when the input falls out of the hand-optimized programs' "comfort zone". Furthermore, we show that input-aware results are sustainable across different GPU targets making it possible to write and optimize applications once and run them anywhere.
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proc. of the 19th International Conference on Compiler Construction, pages 244--263, 2010. Google Scholar
Digital Library
- I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777--786, Aug. 2004. Google Scholar
Digital Library
- B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. In Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 47--56, 2011. Google Scholar
Digital Library
- B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proc. of the 25th International Conference on Machine learning, pages 104--111, 2008. Google Scholar
Digital Library
- R. Garg and J. N. Amaral. Compiling python to a hybrid execution environment. In Proc. of the 3rd Workshop on General Purpose Processing on Graphics Processing Units, pages 19--30, 2010. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151--162, 2006. Google Scholar
Digital Library
- M. Grossman, A. Simion, Z. Budimli, and V. Sarkar. CnC-CUDA: Declarative Programming for GPU's. In Proc. of the 23rd Workshop on Languages and Compilers for Parallel Computing, pages 230--245, 2010. Google Scholar
Digital Library
- T. Han and T. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1):52--61, 2010. Google Scholar
Digital Library
- C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin. Mapcg: writing parallel program portable between CPU and GPU. In Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 217--226, 2010. Google Scholar
Digital Library
- S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 152--163, 2009. Google Scholar
Digital Library
- A. Hormati, Y. Choi, M. Woh, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Macross: Macro-simdization of streaming applications. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285--296, 2010. Google Scholar
Digital Library
- A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--392, 2011. Google Scholar
Digital Library
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008. Google Scholar
Digital Library
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2009. Google Scholar
Digital Library
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proc. of the 37th Annual International Symposium on Computer Architecture, pages 451--460, 2010. Google Scholar
Digital Library
- A. Leung, O. Lhoták, and G. Lashari. Automatic parallelization for graphics processing units. In Proc. of the 7th International Conference on Principles and Practice of Programming in Java, pages 91--100, 2009. Google Scholar
Digital Library
- X. Li, M. J. Garzarán, and D. Padua. A dynamically tuned sorting library. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 111--, 2004. Google Scholar
Digital Library
- Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU program optimizations. In 2009 IEEE International Symposium on Parallel and Distributed Processing, pages 1--10, 2009. Google Scholar
Digital Library
- J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. of the 2009 International Conference on Supercomputing, pages 256--265, 2009. Google Scholar
Digital Library
- NVIDIA. GPUs Are Only Up To 14 Times Faster than CPUs says Intel, 2010. http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html.Google Scholar
- A. Prasad, J. Anantpur, and R. Govindarajan. Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors. In Proc. of the '11 Conference on Programming Language Design and Implementation, pages 152--163, 2011. Google Scholar
Digital Library
- V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proc. of the 2010 International Conference on Supercomputing, pages 137--146, 2010. Google Scholar
Digital Library
- D. Roger, U. Assarsson, and N. Holzschuch. Efficient stream reduction on the GPU. In Proc. of the 1st Workshop on General Purpose Processing on Graphics Processing Units, pages 1--4, 2007.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008. Google Scholar
Digital Library
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for general-purpose uses. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 325--335, 2006. Google Scholar
Digital Library
- W. Thies and S. Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 365--376, 2010. Google Scholar
Digital Library
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 2002 International Conference on Compiler Construction, pages 179--196, 2002. Google Scholar
Digital Library
- N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, and L. Rauchwerger. A framework for adaptive algorithm selection in stapl. In Proc. of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, 2005. Google Scholar
Digital Library
- K. Tian, Y. Jiang, E. Z. Zhang, and X. Shen. An input-centric paradigm for program dynamic optimizations. In Proceedings of the OOPSLA'10, pages 125--139, 2010. Google Scholar
Digital Library
- A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on GPUs. In Proc. of the 2009 International Symposium on Code Generation and Optimization, pages 200--209, 2009. Google Scholar
Digital Library
- M. Wolfe. Implementing the PGI accelerator model. In Proc. of the 3rd Workshop on General Purpose Processing on Graphics Processing Units, pages 43--50, 2010. Google Scholar
Digital Library
- X.-L. Wu, N. Obeid, and W.-M. Hwu. Exploiting more parallelism from applications having generalized reductions on GPU architectures. In Proc. of the 2010 10th International Conference on Computers and Information Technology, pages 1175--1180, 2010. Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010. Google Scholar
Digital Library
- S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In Proc. of the 21st Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 2008. Google Scholar
Digital Library
Index Terms
Adaptive input-aware compilation for graphics engines
Recommendations
Adaptive input-aware compilation for graphics engines
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and ImplementationWhile graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In ...
Sponge: portable stream programming on graphics engines
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systemsGraphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. ...
Sponge: portable stream programming on graphics engines
ASPLOS '11Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. ...







Comments