Abstract
Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing abundance of information. For particular domains, such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation, and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach—SAGE—combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a runtime system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.
- Anant Agarwal, Martin Rinard, Stelios Sidiroglou, Sasa Misailovic, and Henry Hoffmann. 2009. Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures. Technical Report MIT-CSAIL-TR-2009-042. Massachusetts Institute of Technology, Cambridge, MA. Available at http://hdl.handle.net/1721.1/46709.Google Scholar
- Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 Conference on Programming Language Design and Implementation. 38--49. Google Scholar
Digital Library
- Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 2011 International Symposium on Code Generation and Optimization. 85--96. Google Scholar
Digital Library
- Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the 2010 Conference on Programming Language Design and Implementation. 198--209. Google Scholar
Digital Library
- Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. 2008. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning. 104--111. Google Scholar
Digital Library
- Marc De Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 497--508. Google Scholar
Digital Library
- EMC Corporation. 2011. Extracting Value from Chaos. http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.Google Scholar
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312. Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual International Symposium on Microarchitecture. 449--460. Google Scholar
Digital Library
- Andrew Frank and Arthur Asuncion. 2010. UCI Machine Learning Repository. Retrieved July 29, 2014, from http://archive.ics.uci.edu/ml.Google Scholar
- Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. 2011. Dynamic knobs for responsive power-aware computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 199--212. Google Scholar
Digital Library
- Alex Kulesza and Fernando Pereira. 2008. Structured learning with approximate inference. In Advances in Neural Information Processing Systems. 785--792.Google Scholar
- Sang Ik Lee, Troy Johnson, and Rudolf Eigenmann. 2003. Cetus—an extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16th Workshop on Languages and Compilers for Parallel Computing. 539--553.Google Scholar
- Xuanhua Li and Donald Yeung. 2007. Application-level correctness and its impact on fault tolerance. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture. 181--192. Google Scholar
Digital Library
- You Li, Kaiyong Zhao, Xiaowen Chu, and Jiming Liu. 2010. Speeding up K-means algorithm by GPUs. In Proceedings of the 2010 10th International Conference on Computers and Information Technology. 115--122. Google Scholar
Digital Library
- Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. 2010. Quality of service profiling. In Proceedings of the 32nd ACM/IEEE Conference on Software Engineering. 25--34. Google Scholar
Digital Library
- NVIDIA. 2013. NVIDIA CUDA C Programming Guide, Version 5.5. Retrieved July 29, 2014, from https://developer.nvidia.com/cuda-toolkit-55-archive.Google Scholar
- Martin Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 2006 International Conference on Supercomputing. 324--334. Google Scholar
Digital Library
- Martin C. Rinard. 2007. Using early phase termination to eliminate load imbalances at barrier synchronization points. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Systems and Applications. 369--386. Google Scholar
Digital Library
- Stuart Russell and Peter Norvig. 2009. Artificial Intelligence: A Modern Approach. Prentice Hall. Google Scholar
Digital Library
- Mehrzad Samadi, Amir Hormati, Mojtaba Mehrara, Janghaeng Lee, and Scott Mahlke. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of the 2012 Conference on Programming Language Design and Implementation. 13--22. Google Scholar
Digital Library
- Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 35--50. Google Scholar
Digital Library
- Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual International Symposium on Microarchitecture. 13--24. Google Scholar
Digital Library
- Mehrzad Samadi and Scott Mahlke. 2014. CPU-GPU collaboration for output quality monitoring. In Proceedings of the 1st Workshop on Approximate Computing across the System Stack. 1--3.Google Scholar
- Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. ACM SIGPLAN Notices 46, 6, 164--174. Google Scholar
Digital Library
- Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate storage in solid-state memories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 25--36. Google Scholar
Digital Library
- John Sartori and Rakesh Kumar. 2012. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 427--428. Google Scholar
Digital Library
- Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik. 2006a. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 15, 11, 3440--3451. Google Scholar
Digital Library
- Michael Shindler, Alex Wong, and Adam W. Meyerson. 2011. Fast and accurate k-means for large datasets. In Advances in Neural Information Processing Systems. 2375--2383.Google Scholar
- Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134. Google Scholar
Digital Library
- John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 16--30.Google Scholar
- Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. 2011. OptiML: An implicitly parallel domain specific language for machine learning. In Proceedings of the 28th International Conference on Machine Learning. 609--616.Google Scholar
- Ajit C. Tamhane and Dorothy D. Dunlop. 2000. Statistics and Data Analysis. Prentice Hall.Google Scholar
- Hamid R. Sheikh, Zhou Wang, Lawrence Cormack, and Alan C. Bovik. 2006b. LIVE Image Quality Assessment Database Release 2. Available at http://live.ece.utexas.edu/research/quality.Google Scholar
Index Terms
Scaling Performance via Self-Tuning Approximation for Graphics Engines
Recommendations
Adaptive input-aware compilation for graphics engines
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and ImplementationWhile graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In ...
Adaptive input-aware compilation for graphics engines
PLDI '12While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In ...
gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and OptimizationGraphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...






Comments