skip to main content
research-article

Reducing Energy in GPGPUs through Approximate Trivial Bypassing

Published:04 January 2021Publication History
Skip Abstract Section

Abstract

General-purpose computing using graphics processing units (GPGPUs) is an attractive option for acceleration of applications with massively data-parallel tasks. While performance of modern GPGPUs is increasing rapidly, the power consumption of these devices is becoming a major concern. In particular, execution units and register file are among the top three most power-hungry components in GPGPUs. In this work, we exploit trivial instructions to reduce power consumption in GPGPUs.

Trivial instructions are those instructions that do not need computations, i.e., multiplication by one. We found that, during the course of a program's execution, a GPGPU executes many trivial instructions. Execution of these instructions wastes power unnecessarily. In this work, we propose trivial bypassing which skips execution of trivial instructions and avoids unnecessary allocation of resources for trivial instructions. By power gating execution units and skipping trivial computing, trivial bypassing reduces both static and dynamic power. Also, trivial bypassing reduces dynamic energy of register file by avoiding access to register file for source and/or destination operands of trivial instructions. While trivial bypassing reduces energy of GPGPUs, it has detrimental impact on performance as a power-gated execution unit requires several cycles to resume its normal operation. Conventional warp schedulers are oblivious to the status of execution units. We propose a new warp scheduler that prioritizes warps based on availability of execution units. We also propose a set of new power management techniques to reduce performance penalty of power gating, further. To increase energy saving of trivial bypassing, we also propose approximating operands of instructions. We offer a set of new techniques to approximate both integer and floating-point instructions and increase the pool of trivial instructions. Our evaluations using a diverse set of benchmarks reveal that our proposed techniques are able to reduce energy of execution units by 11.2% and dynamic energy of register file by 12.2% with minimal performance and quality degradation.

References

  1. A. Sethia, D. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15), 2015.Google ScholarGoogle ScholarCross RefCross Ref
  2. Tse-Yuh Yeh and Yale Patt. 1992. Alternative implementations of two-level adaptive branch prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. ACM SIGPLAN Notices 45. ACM, 198--209.Google ScholarGoogle Scholar
  5. J. J. Yi and D. J. Lilja. 2002. Improving processor performance by simplifying and bypassing trivial computations. In Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (September 2002), 462--465.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Richardson. 1993. Caching function results: Faster arithmetic by avoiding unnecessary computation. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA'93).Google ScholarGoogle Scholar
  7. P. Rogers. 2010. CUDA-samples/Sobel. 2010. github.com/hellopatrick/cuda-samples/tree/master/sobel.Google ScholarGoogle Scholar
  8. A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R. Iyer, and C. Das. 2013. Orchestrated scheduling and prefetching for GPGPUS. In Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  10. NVIDIA Tesla P100, images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  11. Whitepaper: NVIDIA GeForce GTX 980.Google ScholarGoogle Scholar
  12. Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik G. Hallnor, Hong Jiang, Martin G. Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stéphan Jourdan, Steve Gunther, Thomas Piazza, and Ted Burton. 2014. Haswell: The fourth-generation Intel core processor. IEEE Micro 34, 2, (2014).Google ScholarGoogle Scholar
  13. NVIDIA, CUDA C Programming Guide.Google ScholarGoogle Scholar
  14. AMD, Introduction to OpenCL™ Programming. 2010.Google ScholarGoogle Scholar
  15. NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.Google ScholarGoogle Scholar
  16. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software 2009, 163--174.Google ScholarGoogle Scholar
  17. E. Atoofian and A. Baniasadi. 2006. Improving energy-efficiency in high-performance processors by bypassing trivial computations. IEE Proceedings Computer and Digital Techniques 153, 5 (2006), 313--322.Google ScholarGoogle ScholarCross RefCross Ref
  18. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.Google ScholarGoogle Scholar
  19. NVIDIA Corp. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, 2012.Google ScholarGoogle Scholar
  20. Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: Gating aware scheduling and power gating for GPGPUs. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. (December 2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. David Reinsel, John Gantz and John Rydning. 2017. Data Age 2025: The Evolution of Data to Life-Critical. International Data Corporation 2017.Google ScholarGoogle Scholar
  22. S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA'15).Google ScholarGoogle Scholar
  23. Extracting value from chaos. 2011. www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.Google ScholarGoogle Scholar
  24. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO-44'11), 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Sodani and G. S. Sohi. 1997. Dynamic instruction reuse. Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97), 194--205.Google ScholarGoogle Scholar
  27. AMD Graphics Cores Next (GCN) Architecture. Technical report, AMD, 2012.Google ScholarGoogle Scholar
  28. America's Data Centers Consuming and Wasting Growing Amounts of Energy, NRDC. 2015, http://www.nrdc.org/energy/data-center-efficiency-assessment.asp.Google ScholarGoogle Scholar
  29. K. Kim and W. W. Ro. 2018. WIR: Warp Instruction Reuse to minimize repeated computations in GPUs. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 389--402, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  30. Jingwen Leng, Tayler H. Hetherington, Ahmed ElTantawy, Syed Zohaib Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. Gpuwattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13), 487--498, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. MCPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of MICRO, 3--14, 2007.Google ScholarGoogle Scholar
  33. FreePDKTM process design kit, http://www.eda.ncsu.edu/wiki/FreePDK.Google ScholarGoogle Scholar
  34. E. Atoofian. Trivial bypassing in GPGPUs. IEEE Embed. Syst. Lett.Google ScholarGoogle Scholar
  35. Zayan Shaikh and Ehsan Atoofian. 2020. Approximate trivial instructions. In Proceedings of the ACM International Conference on Computing Frontiers, 1--9, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NVIDIA, CUDA C/C++ SDK code samples, 2013.Google ScholarGoogle Scholar
  37. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. MARS: A mapreduce framework on graphics processors. PACT 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor V. Zyuban, Hans M. Jacobson, and Pradip Bose. 2004. Microarchitectural techniques for power gating of execution units. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 32--37, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Qiumin Xu and Murali Annavaram. 2014. PATS: Pattern aware scheduling and power gating for GPGPUs. In PACT 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM TACO.Google ScholarGoogle Scholar
  41. Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power gating strategies on GPUs. ACM TACO, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. ACM SIGARCH Computer Architecture News 39. ACM, 235--246.Google ScholarGoogle Scholar
  43. Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, 356--361, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization (TACO), 12, 4 (2016), 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. G. Rogers, M. O'Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'12), 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. E. Atoofian. 2020. Approximate cache in GPGPUs. ACM Trans. Embed. Comput. Syst. 19, 5 (2020), 1--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie D. Enright Jerger. 2015. Doppelganger: A cache for approximate computing. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO'15), (Waikiki, Hawaii).Google ScholarGoogle Scholar
  48. M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th International Symposium on Microarchitecture, (MICRO'13) 2013.Google ScholarGoogle Scholar
  49. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO'07) 407--418, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2018. CRAT: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE Transactions on Computers 67, 6 (2018).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Reducing Energy in GPGPUs through Approximate Trivial Bypassing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!