Abstract
There is a growing number of application domains ranging from multimedia to machine learning where a certain level of inexactness can be tolerated. For these applications, approximate computing is an effective technique that trades off some loss in output data integrity for energy and/or performance gains. In this article, we present the approximate cache, which approximates similar values and saves energy in the L2 cache of general-purpose graphics processing units (GPGPUs). The L2 cache is a critical component in memory hierarchy of GPGPUs, as it accommodates data of thousands of simultaneously executing threads. Simply increasing the size of the L2 cache is not a viable solution to keep up with the growing size of data in many-core applications.
This work is motivated by the observation that threads within a warp write values into memory that are arithmetically similar. We exploit this property and propose a low-cost and implementation-efficient hardware to trade off accuracy for energy. The approximate cache identifies similar values during the runtime and allows only one thread writes into the cache in the event of similarity. Since the approximate cache is able to pack more data in a smaller space, it enables downsizing of the data array with negligible impact on cache misses and lower-level memory. The approximate cache reduces both dynamic and static energy. By storing data of a thread into a cache block, each memory instruction requires accessing fewer cache cells, thus reducing dynamic energy. In addition, the approximate cache increases frequency of bank idleness. By power gating idle banks, static energy is reduced. Our evaluations reveal that the approximate cache reduces energy by 52% with minimal quality degradation while maintaining performance of a diverse set of GPGPU applications.
- S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11).Google Scholar
- A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation.Google Scholar
- Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12).Google Scholar
- M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. 2013. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th International Symposium on Microarchitecture.Google Scholar
- M. H. Lipasti and J. P. Shen. 1996. Exceeding the data flow limit via value prediction. In Proceedings of the 29th International Symposium on Microarchitectures. 226--237.Google Scholar
- Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM Journal of Research and Development 37, 4 (1993), 547--564.Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09).Google Scholar
Digital Library
- S. Collange, D. Defour, and Y. Zhang. 2010. Dynamic detection of uniform and affine vectors in GPGPU computations. In Euro-Par 2009—Parallel Processing Workshops. Lecture Notes in Computer Science, Vol. 6043. Springer, 46--55.Google Scholar
- S. Collange and A. Kouyoumdjian. 2011. Affine Vector Cache for Memory Bandwidth Savings. Technical Report. Universite de Lyon.Google Scholar
- NVIDIA. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. White Paper. NVIDIA.Google Scholar
- NVIDIA. 2016. NVIDIA Tesla P100. White Paper. NVIDIA. Available at https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google Scholar
- Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO’10). 421--432.Google Scholar
- NVIDIA. 2014. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110/210. White Paper. NVIDIA.Google Scholar
- Shuai Che and Kevin Skadron. 2014. BenchFriend: Correlating the performance of GPU benchmarks. International Journal of High Performance Computing Applications 28, 2 (May 2014), 238--250.Google Scholar
Digital Library
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312.Google Scholar
- AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White Paper. AMD.Google Scholar
- NVIDIA. 2013. CUDA C/C++ SDK code samples. Available at http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html.Google Scholar
- Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 163--174.Google Scholar
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 Conference on Innovative Parallel Computing (InPar’12).Google Scholar
Cross Ref
- M. Herlihy and J. E. B. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture.Google Scholar
Digital Library
- P. Rogers. 2010. Cuda-samples/sobel. Retrieved August 13, 2020 from https://github.com/hellopatrick/cuda-samples/tree/master/sobel.Google Scholar
- Zhenhong Liu, Daniel Wong, and Nam Sung Kim. 2018. Load-triggered warp approximation on GPU. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’18).Google Scholar
Digital Library
- S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134.Google Scholar
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
Digital Library
- M. H. Lipasti, C. B. Wilkerson, and J. P. Shen. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. 138--147.Google Scholar
- A. Jain, P. Hill, S. C. Lin, M. Khan, M. E. Haque, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars. 2016. Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation. In Proceedings of the 2016 49th International Symposium on Microarchitecture (MICRO’16).Google Scholar
- Y. Sazeides and J. Smith. 1997. The predictability of data values. In Proceedings of the International Symposium on Microarchitecture.Google Scholar
Digital Library
- Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Jongse Park, Girish Mururu, Onur Mutlu, and Todd Mowry. 2014. Rollback-free value prediction with approximate loads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation.Google Scholar
Digital Library
- Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization 12, 4 (Jan. 2016), Article 62, 26 pages.Google Scholar
Digital Library
- J. San Miguel, J. Albericio, A. Moshovos, and N. Enright Jerger. 2015. Doppelgänger: A cache for approximate computing. In Proceedings of the 48th International Symposium on Microarchitecture.Google Scholar
- Ping Xiang, Yi Yang, Mike Mantor, Norm Rubin, Lisa R. Hsu, and Huiyang Zhou. 2013. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13).Google Scholar
Digital Library
- Ehsan Atoofian and Sean Rea. 2018. Data-type specific cache compression in GPGPUs. Journal of Supercomputing 74, 4 (2018), 1609--1635.Google Scholar
Digital Library
- S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. An-Navaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15).Google Scholar
- Daniel Wong, Nam Sung Kim, and Murali Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In Proceedings of the 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA’16).Google Scholar
Cross Ref
- N. Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 3--14.Google Scholar
Digital Library
- NCSU EDA. 2014. FreePDK. Retrieved August 13, 2020 from http://www.eda.ncsu.edu/wiki/FreePDK.Google Scholar
- X. Li and D. Yeung. 2007. Application-level correctness and its impact on fault tolerance. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture. 181--192.Google Scholar
- J. S. Miguel, M. Badr, and N. E. Jerger. 2014. Load value approximation. 2014. In Proceedings of the 47th International Symposium on Microarchitecture.Google Scholar
- B. Saha, A. R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg. 2006. McRT-STM: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 187--197.Google Scholar
- Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2015. Neural acceleration for GPU throughput processors. In Proceedings of the 48th International Symposium on Microarchitecture.Google Scholar
Digital Library
Index Terms
Approximate Cache in GPGPUs
Recommendations
Data-type specific cache compression in GPGPUs
In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are ...
Domino Cache: An Energy-Efficient Data Cache for Modern Applications
The energy consumption for processing modern workloads is challenging in data centers. Due to the large datasets of cloud workloads, the miss rate of the L1 data cache is high, and with respect to the energy efficiency concerns, such misses are costly ...
Early miss prediction based periodic cache bypassing for high performance GPUs
The aim of the hierarchical cache memories that are equipped for GPUs is the management of irregular memory access patterns for general purpose workloads. The level-1 data cache (L1D) of the GPU plays an important role for its ability in the provision ...






Comments