Abstract
General-purpose computing using graphics processing units (GPGPUs) is an attractive option for acceleration of applications with massively data-parallel tasks. While performance of modern GPGPUs is increasing rapidly, the power consumption of these devices is becoming a major concern. In particular, execution units and register file are among the top three most power-hungry components in GPGPUs. In this work, we exploit trivial instructions to reduce power consumption in GPGPUs.
Trivial instructions are those instructions that do not need computations, i.e., multiplication by one. We found that, during the course of a program's execution, a GPGPU executes many trivial instructions. Execution of these instructions wastes power unnecessarily. In this work, we propose trivial bypassing which skips execution of trivial instructions and avoids unnecessary allocation of resources for trivial instructions. By power gating execution units and skipping trivial computing, trivial bypassing reduces both static and dynamic power. Also, trivial bypassing reduces dynamic energy of register file by avoiding access to register file for source and/or destination operands of trivial instructions. While trivial bypassing reduces energy of GPGPUs, it has detrimental impact on performance as a power-gated execution unit requires several cycles to resume its normal operation. Conventional warp schedulers are oblivious to the status of execution units. We propose a new warp scheduler that prioritizes warps based on availability of execution units. We also propose a set of new power management techniques to reduce performance penalty of power gating, further. To increase energy saving of trivial bypassing, we also propose approximating operands of instructions. We offer a set of new techniques to approximate both integer and floating-point instructions and increase the pool of trivial instructions. Our evaluations using a diverse set of benchmarks reveal that our proposed techniques are able to reduce energy of execution units by 11.2% and dynamic energy of register file by 12.2% with minimal performance and quality degradation.
- A. Sethia, D. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15), 2015.Google Scholar
Cross Ref
- Tse-Yuh Yeh and Yale Patt. 1992. Alternative implementations of two-level adaptive branch prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992.Google Scholar
Digital Library
- Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011.Google Scholar
Digital Library
- Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. ACM SIGPLAN Notices 45. ACM, 198--209.Google Scholar
- J. J. Yi and D. J. Lilja. 2002. Improving processor performance by simplifying and bypassing trivial computations. In Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (September 2002), 462--465.Google Scholar
Cross Ref
- S. Richardson. 1993. Caching function results: Faster arithmetic by avoiding unnecessary computation. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA'93).Google Scholar
- P. Rogers. 2010. CUDA-samples/Sobel. 2010. github.com/hellopatrick/cuda-samples/tree/master/sobel.Google Scholar
- A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das. 2013. Orchestrated scheduling and prefetching for GPGPUS. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.Google Scholar
Digital Library
- R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 1974.Google Scholar
Cross Ref
- NVIDIA Tesla P100, images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google Scholar
- Whitepaper: NVIDIA GeForce GTX 980.Google Scholar
- Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik G. Hallnor, Hong Jiang, Martin G. Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stéphan Jourdan, Steve Gunther, Thomas Piazza, and Ted Burton. 2014. Haswell: The fourth-generation Intel core processor. IEEE Micro 34, 2, (2014).Google Scholar
- NVIDIA, CUDA C Programming Guide.Google Scholar
- AMD, Introduction to OpenCL™ Programming. 2010.Google Scholar
- NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.Google Scholar
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software 2009, 163--174.Google Scholar
- E. Atoofian and A. Baniasadi. 2006. Improving energy-efficiency in high-performance processors by bypassing trivial computations. IEE Proceedings Computer and Digital Techniques 153, 5 (2006), 313--322.Google Scholar
Cross Ref
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.Google Scholar
- NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, 2012.Google Scholar
- Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: Gating aware scheduling and power gating for GPGPUs. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. (December 2013).Google Scholar
Digital Library
- David Reinsel, John Gantz and John Rydning. 2017. Data Age 2025: The Evolution of Data to Life-Critical. International Data Corporation 2017.Google Scholar
- S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA'15).Google Scholar
- Extracting value from chaos. 2011. www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.Google Scholar
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44'11), 2011.Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.Google Scholar
Digital Library
- A. Sodani and G. S. Sohi. 1997. Dynamic instruction reuse. Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97), 194--205.Google Scholar
- AMD Graphics Cores Next (GCN) Architecture. Technical report, AMD, 2012.Google Scholar
- America's Data Centers Consuming and Wasting Growing Amounts of Energy, NRDC. 2015, http://www.nrdc.org/energy/data-center-efficiency-assessment.asp.Google Scholar
- K. Kim and W. W. Ro. 2018. WIR: Warp Instruction Reuse to minimize repeated computations in GPUs. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 389--402, 2018.Google Scholar
Cross Ref
- Jingwen Leng, Tayler H. Hetherington, Ahmed ElTantawy, Syed Zohaib Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. Gpuwattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13), 487--498, 2013.Google Scholar
Digital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. MCPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.Google Scholar
Digital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of MICRO, 3--14, 2007.Google Scholar
- FreePDKTM process design kit, http://www.eda.ncsu.edu/wiki/FreePDK.Google Scholar
- E. Atoofian. Trivial bypassing in GPGPUs. IEEE Embed. Syst. Lett.Google Scholar
- Zayan Shaikh and Ehsan Atoofian. 2020. Approximate trivial instructions. In Proceedings of the ACM International Conference on Computing Frontiers, 1--9, 2020.Google Scholar
Digital Library
- NVIDIA, CUDA C/C++ SDK code samples, 2013.Google Scholar
- Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. MARS: A mapreduce framework on graphics processors. PACT 2008.Google Scholar
Digital Library
- Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor V. Zyuban, Hans M. Jacobson, and Pradip Bose. 2004. Microarchitectural techniques for power gating of execution units. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 32--37, 2004.Google Scholar
Digital Library
- Qiumin Xu and Murali Annavaram. 2014. PATS: Pattern aware scheduling and power gating for GPGPUs. In PACT 2014.Google Scholar
Digital Library
- Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM TACO.Google Scholar
- Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power gating strategies on GPUs. ACM TACO, 2011.Google Scholar
Digital Library
- Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. ACM SIGARCH Computer Architecture News 39. ACM, 235--246.Google Scholar
- Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 356--361, 2016.Google Scholar
Digital Library
- Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization (TACO), 12, 4 (2016), 1--26.Google Scholar
Digital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12), 2012.Google Scholar
Digital Library
- E. Atoofian. 2020. Approximate cache in GPGPUs. ACM Trans. Embed. Comput. Syst. 19, 5 (2020), 1--22.Google Scholar
Digital Library
- Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie D. Enright Jerger. 2015. Doppelganger: A cache for approximate computing. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO'15), (Waikiki, Hawaii).Google Scholar
- M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th International Symposium on Microarchitecture, (MICRO'13) 2013.Google Scholar
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO'07) 407--418, 2007.Google Scholar
Digital Library
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2018. CRAT: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE Transactions on Computers 67, 6 (2018).Google Scholar
Cross Ref
Index Terms
Reducing Energy in GPGPUs through Approximate Trivial Bypassing
Recommendations
Approximate trivial instructions
CF '20: Proceedings of the 17th ACM International Conference on Computing FrontiersApproximate computing has the potential to improve performance and energy efficiency in high-performance processors. This work focuses on the impact of approximating conventionally non-trivial instructions to trivial instructions. Instructions which do ...
Reducing GPU Register File Energy
Euro-Par 2018: Parallel ProcessingAbstractGraphics Processing Units (GPUs) maintain a large register file to increase the thread level parallelism (TLP). To increase the TLP further, recent GPUs have increased the number of on-chip registers in every generation. However, with the increase ...
Speculative parallelization on GPGPUs
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingThis paper overviews the first speculative parallelization technique for GPUs that can exploit parallelism in loops even in the presence of dynamic irregularities that may give rise to cross-iteration dependences. The execution of a speculatively ...






Comments