skip to main content
research-article

Adaptive input-aware compilation for graphics engines

Published:11 June 2012Publication History
Skip Abstract Section

Abstract

While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In addition to the programmability challenges posed by GPU's complex memory hierarchy and parallelism model, a well-known application design problem is target portability across different GPUs. However, even for a single GPU target, changing a program's input characteristics can make an already-optimized implementation of a program perform poorly. In this work, we propose Adaptic, an adaptive input-aware compilation system to tackle this important, yet overlooked, input portability problem. Using this system, programmers develop their applications in a high-level streaming language and let Adaptic undertake the difficult task of input portable optimizations and code generation. Several input-aware optimizations are introduced to make efficient use of the memory hierarchy and customize thread composition. At runtime, a properly optimized version of the application is executed based on the actual program input. We perform a head-to-head comparison between the Adaptic generated and hand-optimized CUDA programs. The results show that Adaptic is capable of generating codes that can perform on par with their hand-optimized counterparts over certain input ranges and outperform them when the input falls out of the hand-optimized programs' "comfort zone". Furthermore, we show that input-aware results are sustainable across different GPU targets making it possible to write and optimize applications once and run them anywhere.

References

  1. M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proc. of the 19th International Conference on Compiler Construction, pages 244--263, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777--786, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. In Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 47--56, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proc. of the 25th International Conference on Machine learning, pages 104--111, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Garg and J. N. Amaral. Compiling python to a hybrid execution environment. In Proc. of the 3rd Workshop on General Purpose Processing on Graphics Processing Units, pages 19--30, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151--162, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Grossman, A. Simion, Z. Budimli, and V. Sarkar. CnC-CUDA: Declarative Programming for GPU's. In Proc. of the 23rd Workshop on Languages and Compilers for Parallel Computing, pages 230--245, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Han and T. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, 22(1):52--61, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin. Mapcg: writing parallel program portable between CPU and GPU. In Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 217--226, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 152--163, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Hormati, Y. Choi, M. Woh, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Macross: Macro-simdization of streaming applications. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285--296, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--392, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proc. of the 37th Annual International Symposium on Computer Architecture, pages 451--460, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Leung, O. Lhoták, and G. Lashari. Automatic parallelization for graphics processing units. In Proc. of the 7th International Conference on Principles and Practice of Programming in Java, pages 91--100, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Li, M. J. Garzarán, and D. Padua. A dynamically tuned sorting library. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 111--, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU program optimizations. In 2009 IEEE International Symposium on Parallel and Distributed Processing, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. of the 2009 International Conference on Supercomputing, pages 256--265, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. NVIDIA. GPUs Are Only Up To 14 Times Faster than CPUs says Intel, 2010. http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html.Google ScholarGoogle Scholar
  21. A. Prasad, J. Anantpur, and R. Govindarajan. Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors. In Proc. of the '11 Conference on Programming Language Design and Implementation, pages 152--163, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In Proc. of the 2010 International Conference on Supercomputing, pages 137--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Roger, U. Assarsson, and N. Holzschuch. Efficient stream reduction on the GPU. In Proc. of the 1st Workshop on General Purpose Processing on Graphics Processing Units, pages 1--4, 2007.Google ScholarGoogle Scholar
  24. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Tarditi, S. Puri, and J. Oglesby. Accelerator: using data parallelism to program GPUs for general-purpose uses. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 325--335, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Thies and S. Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, pages 365--376, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 2002 International Conference on Compiler Construction, pages 179--196, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, and L. Rauchwerger. A framework for adaptive algorithm selection in stapl. In Proc. of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Tian, Y. Jiang, E. Z. Zhang, and X. Shen. An input-centric paradigm for program dynamic optimizations. In Proceedings of the OOPSLA'10, pages 125--139, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on GPUs. In Proc. of the 2009 International Symposium on Code Generation and Optimization, pages 200--209, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Wolfe. Implementing the PGI accelerator model. In Proc. of the 3rd Workshop on General Purpose Processing on Graphics Processing Units, pages 43--50, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X.-L. Wu, N. Obeid, and W.-M. Hwu. Exploiting more parallelism from applications having generalized reductions on GPU architectures. In Proc. of the 2010 10th International Conference on Computers and Information Technology, pages 1175--1180, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In Proc. of the 21st Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive input-aware compilation for graphics engines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 47, Issue 6
      PLDI '12
      June 2012
      534 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2345156
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2012
        572 pages
        ISBN:9781450312059
        DOI:10.1145/2254064

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2012

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!