Abstract
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.
- "LTRF Register-Interval-Algorithm," https://github.com/Carnegie Mellon University-SAFARI/Register-Interval.Google Scholar
- M. Abdel-Majeed and M. Annavaram, "Warped register file: A power efficient register file for GPGPUs," in HPCA, 2013. Google Scholar
Digital Library
- M. Abdel-Majeed, A. Shafaei, H. Jeon, M. Pedram, and M. Annavaram, "Pilot Register File: Energy efficient partitioned register file for GPUs," in HPCA, 2017.Google Scholar
- A. Annunziata, M. Gaidis, L. Thomas, C. Chien, C. Hung, P. Chevalier, E. O'Sullivan, J. Hummel, E. Joseph, Y. Zhu et al., "Racetrack memory cell array with integrated magnetic tunnel junction readout," in IEDM, 2011.Google Scholar
- C. Augustine, A. Raychowdhury, B. Behin-Aein, S. Srinivasan, J. Tschanz, V. K. De, and K. Roy, "Numerical analysis of domain wall propagation for dense memory arrays," in IEDM, 2011.Google Scholar
- R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting inter-warp heterogeneity to improve gpgpu performance," in PACT, 2015. Google Scholar
Digital Library
- A. Bakhoda, J. Kim, and T. M. Aamodt, "On-chip network design considerations for compute accelerators," in PACT, 2010. Google Scholar
Digital Library
- A. Bakhoda, J. Kim, and T. M. Aamodt, "Throughput-effective on-chip networks for manycore accelerators," in MICRO, 2010. Google Scholar
Digital Library
- A. Bakhoda, J. Kim, and T. M. Aamodt, "Designing on-chip networks for throughput accelerators," in ACM TACO, 2013. Google Scholar
Digital Library
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.Google Scholar
- R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, "Reducing the complexity of the register file in dynamic superscalar processors," in MICRO, 2001. Google Scholar
Digital Library
- K. K. Bhuwalka, S. Sedlmaier, A. K. Ludsteck, C. Tolksdorf, J. Schulze, and I. Eisele, "Vertical tunnel field-effect transistor," in IEEE TED, 2004.Google Scholar
- E. Borch, E. Tune, S. Manne, and J. Emer, "Loose loops sink chips," in HPCA, 2002. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google Scholar
Digital Library
- K. D. Cooper and T. J. Harvey, "Compiler-controlled memory," in ASPLOS, 1998. Google Scholar
Digital Library
- J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, "Multiple-banked register file architectures," in ISCA, 2000. Google Scholar
Digital Library
- X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," in IEEE TCAD, 2012. Google Scholar
Digital Library
- S. Fukami, T. Suzuki, K. Nagahara, N. Ohshima, Y. Ozaki, S. Saito, R. Nebashi, N. Sakimura, H. Honjo, K. Mori et al., "Low-current perpendicular domain wall motion cell for scalable high-speed mram," in VLSIT, 2009.Google Scholar
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in ISCA, 2011. Google Scholar
Digital Library
- M. Gebhart, S. W. Keckler, and W. J. Dally, "A compile-time managed multi-level register file hierarchy," in MICRO, 2011. Google Scholar
Digital Library
- M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, "Unifying primary cache, scratch, and register file memories in a throughput processor," in MICRO, 2012. Google Scholar
Digital Library
- M. S. Hecht, Flow analysis of computer programs. hskip 1em plus 0.5em minus 0.4emrelax Elsevier Science Inc., 1977. Google Scholar
Digital Library
- C.-C. Hsiao, S.-L. Chu, and C.-C. Hsieh, "An adaptive thread scheduling mechanism with low-power register file for mobile GPUs," in IEEE TMM, 2014.Google Scholar
- H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim, "Bandwidth-efficient on-chip interconnect designs for GPGPUs," in DAC, 2015. Google Scholar
Digital Library
- H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU register file virtualization," in MICRO, 2015. Google Scholar
Digital Library
- N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, "Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs," in IEEE TC, 2016. Google Scholar
Digital Library
- N. Jing, H. Liu, Y. Lu, and X. Liang, "Compiler assisted dynamic register file in GPGPU," in ISLPED, 2013. Google Scholar
Digital Library
- N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal Corretger, and X. Liang, "An energy-efficient and scalable eDRAM-based register file architecture for GPGPU," in ISCA, 2013. Google Scholar
Digital Library
- A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in ASPLOS, 2013. Google Scholar
Digital Library
- A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in ISCA, 2013. Google Scholar
Digital Library
- T. M. Jones, M. F. P. O'Boyle, J. Abella, A. González, and O. Ergin, "Energy-efficient register caching with compiler assistance," in ACM TACO, 2009. Google Scholar
Digital Library
- U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, "The imagine stream processor," in ICCD, 2002.Google Scholar
- O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: optimizing thread-level parallelism for GPGPUs," in PACT, 2013. Google Scholar
Digital Library
- O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU concurrency in heterogeneous architectures," in MICRO, 2014. Google Scholar
Digital Library
- J. Kim, J. Balfour, and W. Dally, "Flattened butterfly topology for on-chip networks," in MICRO, 2007. Google Scholar
Digital Library
- J. Kloosterman, J. Beaumont, D. A. Jamshidi, J. Bailey, T. Mudge, and S. Mahlke, "Regless: Just-in-time operand staging for GPUs," in MICRO, 2017. Google Scholar
Digital Library
- C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in CGO, 2004. Google Scholar
Digital Library
- J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-thread aware prefetching mechanisms for GPGPU applications," in MICRO, 2010. Google Scholar
Digital Library
- S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, "Warped-Compression: Enabling power efficient GPUs through register compression," in ISCA, 2015. Google Scholar
Digital Library
- J. Leng, T. Hetherington, A. Eltantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling energy optimizations in GPGPUs," in ISCA, 2013. Google Scholar
Digital Library
- E. Lewis, D. Petit, L. O'Brien, A. Fernandez-Pacheco, J. Sampaio, A. Jausovec, H. Zeng, D. Read, and R. Cowburn, "Fast domain wall motion in magnetic comb structures," in Nature Materials, 2010.Google Scholar
- C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-driven dynamic GPU cache bypassing," in ICS, 2015. Google Scholar
Digital Library
- Z. Li, J. Tan, and X. Fu, "Hybrid CMOS-TFET based register files for energy-efficient GPGPUs," in ISQED, 2013.Google Scholar
- J. E. Lindholm, M. Y. Siu, S. S. Moy, S. Liu, and J. R. Nickolls, "Simulating multiported memories using lower port count memories," 2008, US Patent 7,339,592.Google Scholar
- X. Liu, Y. Li, Y. Zhang, A. K. Jones, and Y. Chen, "STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures," in ASP-DAC, 2014.Google Scholar
- X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, "An efficient STT-RAM-based register file in GPU architectures," in ASP-DAC, 2015.Google Scholar
- A. Magni, C. Dubach, and M. F. P. O'Boyle, "A large-scale cross-architecture evaluation of thread-coarsening," in SC, 2013. Google Scholar
Digital Library
- M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li, "Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory," in DAC, 2014. Google Scholar
Digital Library
- A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch, "BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems," in NOCS, 2017. Google Scholar
Digital Library
- S. Mookerjea and S. Datta, "Comparative study of si, ge and inas based steep subthreshold slope tunnel transistors for 0.25 v supply voltage logic applications," in Device Research Conference, 2008.Google Scholar
- N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," HP Laboratories, Tech. Rep., 2009.Google Scholar
- G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, "Optimal loop unrolling for GPGPU programs," in IPDPS, 2010.Google Scholar
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in MICRO, 2011. Google Scholar
Digital Library
- P. R. Nuth and W. J. Dally, "The named-state register file: Implementation and performance," in HPCA, 1995. Google Scholar
Digital Library
- Nvidia, "C programming guide V6. 5. 2014," San Jose California: Nvidia.Google Scholar
- Nvidia, "White paper: NVIDIA GeForce GTX 980," Nvidia, Tech. Rep.Google Scholar
- Nvidia, "White paper: NVIDIA Tesla P100," Nvidia, Tech. Rep.Google Scholar
- D. W. Oehmke, N. L. Binkert, T. Mudge, and S. K. Reinhardt, "How to fake 1000 registers," in MICRO, 2005. Google Scholar
Digital Library
- S. S. Parkin, M. Hayashi, and L. Thomas, "Magnetic domain-wall racetrack memory," in Science, 2008.Google Scholar
- W. M. Reddick and G. A. Amaratunga, "Silicon surface tunnel transistor," Applied Physics Letters, 1995.Google Scholar
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory access scheduling," in ISCA, 2000. Google Scholar
Digital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in MICRO, 2012. Google Scholar
Digital Library
- R. M. Russell, "The CRAY-1 computer system," Commun. ACM, 1978. Google Scholar
Digital Library
- M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad, "Effective cache bank placement for GPUs," in DATE. Google Scholar
Digital Library
- M. H. Samavatian, H. Abbasitabar, M. Arjomand, and H. Sarbazi-Azad, "An efficient STT-RAM last level cache architecture for GPUs," in DAC, 2014. Google Scholar
Digital Library
- M. H. Samavatian, M. Arjomand, R. Bashizade, and H. Sarbazi-Azad, "Architecting the last-level cache for GPUs using STT-RAM technology," in ACM TODAES, 2015. Google Scholar
Digital Library
- A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, "APOGEE: Adaptive prefetching on GPUs for energy efficiency," in PACT, 2013. Google Scholar
Digital Library
- A. Sethia and S. Mahlke, "Equalizer: Dynamic tuning of gpu resources for efficient execution," in MICRO, 2014. Google Scholar
Digital Library
- M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, "Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches," in ISLPED, 2013. Google Scholar
Digital Library
- R. Shioya, K. Horio, M. Goshima, and S. Sakai, "Register cache system not for latency reduction purpose," in MICRO, 2010. Google Scholar
Digital Library
- J. Singh, K. Ramakrishnan, S. Mookerjea, S. Datta, N. Vijaykrishnan, and D. Pradhan, "A novel si-tunnel FET based SRAM design for ultra low-power 0.3V VDD applications," in ASP-DAC, 2010. Google Scholar
Digital Library
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, UIUC, Tech. Rep., 2012.Google Scholar
- J. A. Swensen and Y. N. Patt, "Hierarchical registers for scientific computers," in ICS, 1988. Google Scholar
Digital Library
- L. Thomas, R. Moriya, C. Rettner, and S. S. Parkin, "Dynamics of magnetic domain walls under their own inertia," in Science, 2010.Google Scholar
- Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jiménez, "Adaptive GPU Cache Bypassing," in GPGPU, 2015. Google Scholar
Digital Library
- R. Venkatesan, S. G. Ramasubramanian, S. Venkataramani, K. Roy, and A. Raghunathan, "Stag: Spintronic-tape architecture for GPGPU cache hierarchies," in ISCA, 2014. Google Scholar
Digital Library
- R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, "Dwm-tapestri-an energy efficient all-spin cache using domain wall shift based writes," in DATE, 2013. Google Scholar
Digital Library
- N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A holistic approach to resource virtualization in GPUs," in MICRO, 2016. Google Scholar
Digital Library
- N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015. Google Scholar
Digital Library
- J. Wang and Y. Xie, "A write-aware STTRAM-based register file architecture for GPGPU," in ACM JETC, 2015. Google Scholar
Digital Library
- P.-F. Wang, "Complementary tunneling-FETs (CTFET) in CMOS technology," Ph.D. dissertation, Technische Universit"at München, Universit"atsbibliothek, 2003.Google Scholar
- X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan, "Enabling coordinated register allocation and thread-level parallelism optimization for GPUs," in MICRO, 2015. Google Scholar
Digital Library
- X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in ICCAD, 2013. Google Scholar
Digital Library
- Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou, "A unified optimizing compiler framework for different GPGPU architectures," in ACM TACO, 2012. Google Scholar
Digital Library
- W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh, "SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading," in ISCA, 2011. Google Scholar
Digital Library
- R. Yung and N. C. Wilhelm, "Caching processor general registers," in ICCD, 1995. Google Scholar
Digital Library
- H. Zeng and K. Ghose, "Register file caching for energy efficiency," in ISLPED, 2006. Google Scholar
Digital Library
- W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.Google Scholar
Index Terms
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
Recommendations
Highly Concurrent Latency-tolerant Register Files for GPUs
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsGraphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
A large, fast instruction window for tolerating cache misses
ISCA '02: Proceedings of the 29th annual international symposium on Computer architectureInstruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately naively scaling conventional window ...







Comments