skip to main content
research-article

Scaling Performance via Self-Tuning Approximation for Graphics Engines

Published:29 August 2014Publication History
Skip Abstract Section

Abstract

Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing abundance of information. For particular domains, such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation, and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach—SAGE—combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a runtime system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.

References

  1. Anant Agarwal, Martin Rinard, Stelios Sidiroglou, Sasa Misailovic, and Henry Hoffmann. 2009. Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures. Technical Report MIT-CSAIL-TR-2009-042. Massachusetts Institute of Technology, Cambridge, MA. Available at http://hdl.handle.net/1721.1/46709.Google ScholarGoogle Scholar
  2. Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 Conference on Programming Language Design and Implementation. 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 2011 International Symposium on Code Generation and Optimization. 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the 2010 Conference on Programming Language Design and Implementation. 198--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. 2008. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning. 104--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Marc De Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 497--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. EMC Corporation. 2011. Extracting Value from Chaos. http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.Google ScholarGoogle Scholar
  8. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual International Symposium on Microarchitecture. 449--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andrew Frank and Arthur Asuncion. 2010. UCI Machine Learning Repository. Retrieved July 29, 2014, from http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  11. Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. 2011. Dynamic knobs for responsive power-aware computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 199--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alex Kulesza and Fernando Pereira. 2008. Structured learning with approximate inference. In Advances in Neural Information Processing Systems. 785--792.Google ScholarGoogle Scholar
  13. Sang Ik Lee, Troy Johnson, and Rudolf Eigenmann. 2003. Cetus—an extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16th Workshop on Languages and Compilers for Parallel Computing. 539--553.Google ScholarGoogle Scholar
  14. Xuanhua Li and Donald Yeung. 2007. Application-level correctness and its impact on fault tolerance. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture. 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. You Li, Kaiyong Zhao, Xiaowen Chu, and Jiming Liu. 2010. Speeding up K-means algorithm by GPUs. In Proceedings of the 2010 10th International Conference on Computers and Information Technology. 115--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. 2010. Quality of service profiling. In Proceedings of the 32nd ACM/IEEE Conference on Software Engineering. 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. NVIDIA. 2013. NVIDIA CUDA C Programming Guide, Version 5.5. Retrieved July 29, 2014, from https://developer.nvidia.com/cuda-toolkit-55-archive.Google ScholarGoogle Scholar
  18. Martin Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 2006 International Conference on Supercomputing. 324--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Martin C. Rinard. 2007. Using early phase termination to eliminate load imbalances at barrier synchronization points. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Systems and Applications. 369--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Stuart Russell and Peter Norvig. 2009. Artificial Intelligence: A Modern Approach. Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mehrzad Samadi, Amir Hormati, Mojtaba Mehrara, Janghaeng Lee, and Scott Mahlke. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of the 2012 Conference on Programming Language Design and Implementation. 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 35--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual International Symposium on Microarchitecture. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mehrzad Samadi and Scott Mahlke. 2014. CPU-GPU collaboration for output quality monitoring. In Proceedings of the 1st Workshop on Approximate Computing across the System Stack. 1--3.Google ScholarGoogle Scholar
  25. Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. ACM SIGPLAN Notices 46, 6, 164--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate storage in solid-state memories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. John Sartori and Rakesh Kumar. 2012. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 427--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik. 2006a. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 15, 11, 3440--3451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michael Shindler, Alex Wong, and Adam W. Meyerson. 2011. Fast and accurate k-means for large datasets. In Advances in Neural Information Processing Systems. 2375--2383.Google ScholarGoogle Scholar
  30. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 16--30.Google ScholarGoogle Scholar
  32. Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. 2011. OptiML: An implicitly parallel domain specific language for machine learning. In Proceedings of the 28th International Conference on Machine Learning. 609--616.Google ScholarGoogle Scholar
  33. Ajit C. Tamhane and Dorothy D. Dunlop. 2000. Statistics and Data Analysis. Prentice Hall.Google ScholarGoogle Scholar
  34. Hamid R. Sheikh, Zhou Wang, Lawrence Cormack, and Alan C. Bovik. 2006b. LIVE Image Quality Assessment Database Release 2. Available at http://live.ece.utexas.edu/research/quality.Google ScholarGoogle Scholar

Index Terms

  1. Scaling Performance via Self-Tuning Approximation for Graphics Engines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 32, Issue 3
      September 2014
      76 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/2666140
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 August 2014
      • Accepted: 1 May 2014
      • Revised: 1 April 2014
      • Received: 1 April 2014
      Published in tocs Volume 32, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!