skip to main content
research-article

Approximate Cache in GPGPUs

Published:26 September 2020Publication History
Skip Abstract Section

Abstract

There is a growing number of application domains ranging from multimedia to machine learning where a certain level of inexactness can be tolerated. For these applications, approximate computing is an effective technique that trades off some loss in output data integrity for energy and/or performance gains. In this article, we present the approximate cache, which approximates similar values and saves energy in the L2 cache of general-purpose graphics processing units (GPGPUs). The L2 cache is a critical component in memory hierarchy of GPGPUs, as it accommodates data of thousands of simultaneously executing threads. Simply increasing the size of the L2 cache is not a viable solution to keep up with the growing size of data in many-core applications.

This work is motivated by the observation that threads within a warp write values into memory that are arithmetically similar. We exploit this property and propose a low-cost and implementation-efficient hardware to trade off accuracy for energy. The approximate cache identifies similar values during the runtime and allows only one thread writes into the cache in the event of similarity. Since the approximate cache is able to pack more data in a smaller space, it enables downsizing of the data array with negligible impact on cache misses and lower-level memory. The approximate cache reduces both dynamic and static energy. By storing data of a thread into a cache block, each memory instruction requires accessing fewer cache cells, thus reducing dynamic energy. In addition, the approximate cache increases frequency of bank idleness. By power gating idle banks, static energy is reduced. Our evaluations reveal that the approximate cache reduces energy by 52% with minimal quality degradation while maintaining performance of a diverse set of GPGPU applications.

References

  1. S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11).Google ScholarGoogle Scholar
  2. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle Scholar
  3. Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12).Google ScholarGoogle Scholar
  4. M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. 2013. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  5. M. H. Lipasti and J. P. Shen. 1996. Exceeding the data flow limit via value prediction. In Proceedings of the 29th International Symposium on Microarchitectures. 226--237.Google ScholarGoogle Scholar
  6. Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM Journal of Research and Development 37, 4 (1993), 547--564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Collange, D. Defour, and Y. Zhang. 2010. Dynamic detection of uniform and affine vectors in GPGPU computations. In Euro-Par 2009—Parallel Processing Workshops. Lecture Notes in Computer Science, Vol. 6043. Springer, 46--55.Google ScholarGoogle Scholar
  9. S. Collange and A. Kouyoumdjian. 2011. Affine Vector Cache for Memory Bandwidth Savings. Technical Report. Universite de Lyon.Google ScholarGoogle Scholar
  10. NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. White Paper. NVIDIA.Google ScholarGoogle Scholar
  11. NVIDIA. 2016. NVIDIA Tesla P100. White Paper. NVIDIA. Available at https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  12. Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO’10). 421--432.Google ScholarGoogle Scholar
  13. NVIDIA. 2014. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110/210. White Paper. NVIDIA.Google ScholarGoogle Scholar
  14. Shuai Che and Kevin Skadron. 2014. BenchFriend: Correlating the performance of GPU benchmarks. International Journal of High Performance Computing Applications 28, 2 (May 2014), 238--250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312.Google ScholarGoogle Scholar
  16. AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White Paper. AMD.Google ScholarGoogle Scholar
  17. NVIDIA. 2013. CUDA C/C++ SDK code samples. Available at http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html.Google ScholarGoogle Scholar
  18. Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 163--174.Google ScholarGoogle Scholar
  19. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 Conference on Innovative Parallel Computing (InPar’12).Google ScholarGoogle ScholarCross RefCross Ref
  20. M. Herlihy and J. E. B. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Rogers. 2010. Cuda-samples/sobel. Retrieved August 13, 2020 from https://github.com/hellopatrick/cuda-samples/tree/master/sobel.Google ScholarGoogle Scholar
  22. Zhenhong Liu, Daniel Wong, and Nam Sung Kim. 2018. Load-triggered warp approximation on GPU. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134.Google ScholarGoogle Scholar
  24. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. H. Lipasti, C. B. Wilkerson, and J. P. Shen. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 138--147.Google ScholarGoogle Scholar
  26. A. Jain, P. Hill, S. C. Lin, M. Khan, M. E. Haque, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars. 2016. Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation. In Proceedings of the 2016 49th International Symposium on Microarchitecture (MICRO’16).Google ScholarGoogle Scholar
  27. Y. Sazeides and J. Smith. 1997. The predictability of data values. In Proceedings of the International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Jongse Park, Girish Mururu, Onur Mutlu, and Todd Mowry. 2014. Rollback-free value prediction with approximate loads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization 12, 4 (Jan. 2016), Article 62, 26 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. San Miguel, J. Albericio, A. Moshovos, and N. Enright Jerger. 2015. Doppelgänger: A cache for approximate computing. In Proceedings of the 48th International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  31. Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. 2013. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ehsan Atoofian and Sean Rea. 2018. Data-type specific cache compression in GPGPUs. Journal of Supercomputing 74, 4 (2018), 1609--1635.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. An-Navaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15).Google ScholarGoogle Scholar
  34. Daniel Wong, Nam Sung Kim, and Murali Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In Proceedings of the 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA’16).Google ScholarGoogle ScholarCross RefCross Ref
  35. N. Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 3--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NCSU EDA. 2014. FreePDK. Retrieved August 13, 2020 from http://www.eda.ncsu.edu/wiki/FreePDK.Google ScholarGoogle Scholar
  37. X. Li and D. Yeung. 2007. Application-level correctness and its impact on fault tolerance. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture. 181--192.Google ScholarGoogle Scholar
  38. J. S. Miguel, M. Badr, and N. E. Jerger. 2014. Load value approximation. 2014. In Proceedings of the 47th International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  39. B. Saha, A. R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg. 2006. McRT-STM: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 187--197.Google ScholarGoogle Scholar
  40. Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2015. Neural acceleration for GPU throughput processors. In Proceedings of the 48th International Symposium on Microarchitecture.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximate Cache in GPGPUs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!