Abstract
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache.
In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.
- Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCGoogle Scholar
- Mohammad Abdel-Majeed, Alireza Shafaei, Hyeran Jeon, Massoud Pedram, and Murali Annavaram. 2017. Pilot register file: Energy efficient partitioned register file for GPUs. In HPCA.Google Scholar
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA.Google Scholar
- Alfred V. Aho. 2003. Compilers: Principles, Techniques and Tools (for Anna University) (2nd ed.).Google Scholar
Digital Library
- Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2001. Data prefetching by dependence graph precomputation. In ISCA.Google Scholar
- A. J. Annunziata, M. C. Gaidis, L. Thomas, C. W. Chien, C. C. Hung, P. Chevalier, E. J. O’Sullivan, J. P. Hummel, E. A. Joseph, Y. Zhu, et al. 2011. Racetrack memory cell array with integrated magnetic tunnel junction readout. In IEDM.Google Scholar
- Hodjat Asghari Esfeden, Amirali Abdolrashidi, Shafiur Rahman, Daniel Wong, and Nael Abu-Ghazaleh. 2020. BOW: Breathing operand windows to exploit bypassing in GPUs. In MICRO.Google Scholar
- Islam Atta, Xin Tong, Vijayalakshmi Srinivasan, Ioana Baldini, and Andreas Moshovos. 2015. Self-contained, accurate precomputation prefetching. In MICRO.Google Scholar
- C. Augustine, A. Raychowdhury, B. Behin-Aein, S. Srinivasan, J. Tschanz, Vivek K. De, and K. Roy. 2011. Numerical analysis of domain wall propagation for dense memory arrays. In IEDM.Google Scholar
- Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In PACT.Google Scholar
- Sung Hoon Baek and Kyu Ho Park. 2008. Prefetching with adaptive cache culling for striped disk arrays. In USENIX ATC.Google Scholar
- Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. On-chip network design considerations for compute accelerators. In PACT.Google Scholar
- Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In MICRO.Google Scholar
- Ali Bakhoda, John Kim, and Tor M. Aamodt. 2013. Designing on-chip networks for throughput accelerators. ACM Trans. Arch. Code Optimiz. 10, 3 (2013), 1--35.Google Scholar
Digital Library
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS.Google Scholar
- M. Bakhshalipour, P. Lotfi-Kamran, A. Mazloumi, F. Samandi, M. Naderan-Tahan, M. Modarressi, and H. Sarbazi-Azad. 2018. Fast data delivery for many-core processors. IEEE Trans. Comput. 67, 10 (2018), 1416--1429.Google Scholar
Cross Ref
- M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2017. An efficient temporal data prefetcher for L1 caches. IEEE CAL 16, 2 (2017), 99--102.Google Scholar
- M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2018. Domino temporal data prefetcher. In HPCA.Google Scholar
- M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In HPCA.Google Scholar
- Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluation of hardware data prefetchers on server processors. ACM Comput. Surv. 52, 3 (2019), 1--29.Google Scholar
Digital Library
- R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO.Google Scholar
- Rahul Bera, Anant V. Nori, Onur Mutlu, and Sreenivas Subramoney. 2019. Dspatch: Dual spatial pattern prefetcher. In MICRO.Google Scholar
Digital Library
- Krishna Kumar Bhuwalka, Stefan Sedlmaier, Alexandra Katherina Ludsteck, Carolin Tolksdorf, Joerg Schulze, and Ignaz Eisele. 2004. Vertical tunnel field-effect transistor. IEEE Trans. Electr. Dev. 51, 2 (2004), 279--282.Google Scholar
Cross Ref
- Eric Borch, Eric Tune, Srilatha Manne, and Joel Emer. 2002. Loose loops sink chips. In HPCA.Google Scholar
- Preston Briggs. 1992. Register Allocation via Graph Coloring. Technical Report.Google Scholar
- Jeffery A. Brown, Hong Wang, George Chrysos, Perry H. Wang, and John P. Shen. 2002. Speculative precomputation on chip multiprocessors. In Proceedings of the MTEAC.Google Scholar
- Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1996. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14, 4 (1996), 311--343.Google Scholar
Digital Library
- Benjamin Cassell, Tyler Szepesi, Jim Summers, Tim Brecht, Derek Eager, and Bernard Wong. 2018. Disk prefetching mechanisms for increasing HTTP streaming video server throughput. ACM Trans. Model. Perf. Eval. Comput. Syst. 3, 2 (2018), 1--30.Google Scholar
Digital Library
- Gregory J Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins, and Peter W. Markstein. 1981. Register allocation via coloring. Computer Languages 6, 1 (1981), 47--57.Google Scholar
Digital Library
- Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous subordinate microthreading (SSMT). In ISCA.Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC.Google Scholar
- Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.Google Scholar
Digital Library
- Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut Kademir, Anand Sivasubramaniam, Onur Mutlu, and Chita R. Das. 2012. Application-aware prefetch prioritization in on-chip networks. In PACT.Google Scholar
- Trishul M. Chilimbi and Martin Hirzel. 2002. Dynamic hot data stream prefetching for general-purpose programs. In PLDI.Google Scholar
- Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John P. Shen. 2001. Dynamic speculative precomputation. In MICRO.Google Scholar
- Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In ISCA.Google Scholar
- Keith D. Cooper and Timothy J. Harvey. 1998. Compiler-controlled memory. In ASPLOS.Google Scholar
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms.Google Scholar
- J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. 2000. Multiple-banked register file architectures. In ISCA.Google Scholar
- Xiaoning Ding, Song Jiang, Feng Chen, Kei Davis, and Xiaodong Zhang. 2007. DiskSeen: Exploiting disk layout and access history to enhance I/O prefetch. In USENIX ATC.Google Scholar
- X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. Trans. Comput.-Aid. Des. Integr. Circ. Syst. 31, 7 (2012), 994--1007.Google Scholar
Digital Library
- Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA.Google Scholar
- Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009. Coordinated control of multiple prefetchers in multi-core systems. In MICRO.Google Scholar
- Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA.Google Scholar
- Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. 2019. CORF: Coalescing operand register file for GPUs. In ASPLOS.Google Scholar
- Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In MICRO.Google Scholar
- Adi Fuchs, Shie Mannor, Uri Weiser, and Yoav Etsion. 2014. Loop-aware memory prefetching using code block working sets. In MICRO.Google Scholar
- S. Fukami, T. Suzuki, K. Nagahara, N. Ohshima, Y. Ozaki, S. Saito, R. Nebashi, N. Sakimura, H. Honjo, K. Mori, et al. 2009. Low-current perpendicular domain wall motion cell for scalable high-speed MRAM. In Proceedings of the Symposium on VLSI Technology (VLSIT'09). 230--231.Google Scholar
- Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA.Google Scholar
- Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In MICRO.Google Scholar
- Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In MICRO.Google Scholar
- Knuth Stener Grimsrud, James K. Archibald, and Brent E. Nelson. 1993. Multiple prefetch adaptive disk caching. IEEE Trans. Knowl. Data Eng. 5, 1 (1993), 88--103.Google Scholar
Digital Library
- SAFARI Research Group. 2018. LTRF Register-Interval-Algorithm. Retrieved from https://github.com/CMU-SAFARI/Register-Interval.Google Scholar
- Jayanth Gummaraju and Mendel Rosenblum. 2005. Stream programming on general-purpose processors. In MICRO.Google Scholar
- Sebastian Hack, Daniel Grund, and Gerhard Goos. 2006. Register allocation for programs in SSA-form. In CC.Google Scholar
- Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO.Google Scholar
- Milad Hashemi and Yale N. Patt. 2015. Filtered runahead execution with a runahead buffer. In MICRO.Google Scholar
- Matthew S. Hecht. 1977. Flow Analysis of Computer Programs. Elsevier Science Inc.Google Scholar
- Chih-Chieh Hsiao, Slo-Li Chu, and Chiu-Cheng Hsieh. 2014. An adaptive thread scheduling mechanism with low-power register file for mobile GPUs. IEEE Trans. Multimedia 1, 16 (2014), 60--67.Google Scholar
Cross Ref
- Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg. 2003. Slipstream execution mode for CMP-based multiprocessors. In HPCA.Google Scholar
- Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access map pattern matching for data cache prefetch. In ICS.Google Scholar
- Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In MICRO.Google Scholar
- Hyunjun Jang, Jinchun Kim, Paul Gratz, Ki Hwan Yum, and Eun Jung Kim. 2015. Bandwidth-efficient on-chip interconnect designs for GPGPUs. In DAC.Google Scholar
- H. Jeon, H. A. Esfeden, N. B. Abu-Ghazaleh, D. Wong, and S. Elango. 2019. Locality-aware GPU register file. IEEE CAL 18, 2 (2019), 153--156.Google Scholar
- Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In MICRO.Google Scholar
- Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, David E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In ISPASS.Google Scholar
- Song Jiang, Xiaoning Ding, Yuehai Xu, and Kei Davis. 2013. A prefetching scheme exploiting both data layout and access history on disk. ACM Trans. Stor. 9, 3 (2013), 1--23.Google Scholar
Digital Library
- Naifeng Jing, Li Jiang, Tao Zhang, Chao Li, Fengfeng Fan, and Xiaoyao Liang. 2016. Energy-efficient eDRAM-based on-chip storage architecture for GPGPUs. IEEE Trans. Comput. 65, 1 (2016), 122--135.Google Scholar
Digital Library
- Naifeng Jing, Haopeng Liu, Yao Lu, and Xiaoyao Liang. 2013. Compiler assisted dynamic register file in GPGPU. In ISLPED.Google Scholar
- Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal Corretger, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In ISCA.Google Scholar
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS.Google Scholar
- Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In ISCA.Google Scholar
- Timothy M. Jones, Michael F. P. O’Boyle, Jaume Abella, Antonio González, and Oğuz Ergin. 2009. Energy-efficient register caching with compiler assistance. ACM Trans. Arch. Code Optimiz. 6, 4 (2009), 1--23.Google Scholar
Digital Library
- Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA.Google Scholar
- David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-fetch: Branch prediction directed prefetching for chip-multiprocessors. In MICRO.Google Scholar
- Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Inter-core prefetching for multicore processors using migrating helper threads. In ASPLOS.Google Scholar
- Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany. 2002. The imagine stream processor. In ICCD.Google Scholar
- Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In PACT.Google Scholar
- Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. C-states: Fine-grained GPU datapath power management. In PACT.Google Scholar
Digital Library
- Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In MICRO.Google Scholar
- Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. Regmutex: Inter-warp GPU register time-sharing. In ISCA.Google Scholar
- Dongkeun Kim, S. S.-W. Liao, Perry H. Wang, Juan Del Cuvillo, Xinmin Tian, Xiang Zou, Hong Wang, Donald Yeung, Milind Girkar, and John Paul Shen. 2004. Physical experimentation with prefetching helper threads on Intel’s hyper-threaded processors. In CGO.Google Scholar
- Dongkeun Kim and Donald Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. In ASPLOS.Google Scholar
- J. Kim, J. Balfour, and W. Dally. 2007. Flattened butterfly topology for on-chip networks. In MICRO.Google Scholar
- Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence based lookahead prefetching. In MICRO.Google Scholar
- Jon Kleinberg and Eva Tardos. 2006. Algorithm Design. Pearson Education India.Google Scholar
- John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In MICRO.Google Scholar
Digital Library
- Emre Kültürsay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu. 2013. Evaluating STT-RAM as an energy-efficient main memory alternative. In ISPASS.Google Scholar
- An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction 8 dead-block correlating prefetchers. In ISCA.Google Scholar
- Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In CGO.Google Scholar
- N. B. Lakshminarayana and H. Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs. In HPCA.Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In CGO.Google Scholar
- Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. In ISCA.Google Scholar
- Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger. 2010. Phase-change technology and the future of main memory. In IEEE MICRO.Google Scholar
- Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2011. Prefetch-aware memory controllers. IEEE Trans. Comput. 60, 10 (2011), 1406--1430.Google Scholar
Digital Library
- Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.Google Scholar
- Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO.Google Scholar
- Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In ISCA.Google Scholar
- Jingwen Leng, Tayler Hetherington, Ahmed Eltantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA.Google Scholar
- E. R. Lewis, D. Petit, L. O’Brien, A. Fernandez-Pacheco, J. Sampaio, A. V. Jausovec, H. T. Zeng, D. E. Read, and R. P. Cowburn. 2010. Fast domain wall motion in magnetic comb structures. Nat. Mater. 9, 12 (2010), 980--983.Google Scholar
Cross Ref
- Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A framework for memory oversubscription management in graphics processing units. In ASPLOS.Google Scholar
- Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In ICS.Google Scholar
- Zhi Li, Jingweijia Tan, and Xin Fu. 2013. Hybrid CMOS-TFET based register files for energy-efficient GPGPUs. In ISQED.Google Scholar
- Steve S. W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, and John P. Shen. 2002. Post-pass binary adaptation for software-based speculative precomputation. In PLDI.Google Scholar
- John Erik Lindholm, Ming Y. Siu, Simon S. Moy, Samuel Liu, and John R. Nickolls. 2008. Simulating multiported memories using lower port count memories. US Patent 7,339,592.Google Scholar
- Mikko H. Lipasti, William J. Schmidt, Steven R. Kunkel, and Robert R. Roediger. 1995. SPAID: Software prefetching in pointer-and call-intensive environments. In MICRO.Google Scholar
- Xiaoxiao Liu, Yong Li, Yaojun Zhang, Alex K. Jones, and Yiran Chen. 2014. STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures. In ASP-DAC.Google Scholar
- Xiaoxiao Liu, Mengjie Mao, Xiuyuan Bi, Hai Li, and Yiran Chen. 2015. An efficient STT-RAM-based register file in GPU architectures. In ASP-DAC.Google Scholar
- Yang Liu and Wang Wei. 2014. FLAP: Flash-aware prefetching for improving SSD-based disk cache. J. Netw. 9, 10 (2014), 2766.Google Scholar
- Jiwei Lu, Howard Chen, Rao Fu, Wei-Chung Hsu, Bobbie Othmer, Pen-Chung Yew, and Dong-Yuan Chen. 2003. The performance of runtime data cache prefetching in a dynamic optimization system. In MICRO.Google Scholar
- Jiwei Lu, Abhinav Das, Wei-Chung Hsu, Khoa Nguyen, and Santosh G. Abraham. 2005. Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In MICRO.Google Scholar
- Chi-Keung Luk. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In ISCA.Google Scholar
- Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. In ASPLOS.Google Scholar
- A. Magni, C. Dubach, and M. F. P. O’Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In SC.Google Scholar
- Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory. In DAC.Google Scholar
- Pierre Michaud. 2016. Best-offset hardware prefetching. In HPCA.Google Scholar
- Amirhossein Mirhosseini, Mohammad Sadrosadati, Fatemeh Aghamohammadi, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2019. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM Trans. Parallel Comput. 5, 3 (2019), 1--29.Google Scholar
Digital Library
- A. Mirhosseini, M. Sadrosadati, A. Fakhrzadehgan, M. Modarressi, and H. Sarbazi-Azad. 2015. An energy-efficient virtual channel power-gating mechanism for on-chip networks. In DATE.Google Scholar
- Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In NOCS.Google Scholar
- A. Mirhosseini, M. Sadrosadati, M. Zare, and H. Sarbazi-Azad. 2016. Quantifying the difference in resource demand among classic and modern NoC workloads. In ICCD.Google Scholar
- Amirhossein Mirhosseini, Akshitha Sriraman, and Thomas F. Wenisch. 2019. Enhancing server efficiency in the face of killer microseconds. In HPCA.Google Scholar
- Saurabh Mookerjea and Suman Datta. 2008. Comparative study of Si, Ge and InAs based steep subthreshold slope tunnel transistors for 0.25 V supply voltage logic applications. In DRC.Google Scholar
- Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google Scholar
- G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan. 2010. Optimal loop unrolling for GPGPU programs. In IPDPS.Google Scholar
- Onur Mutlu, Hyesoon Kim, David N. Armstrong, and Yale N. Patt. 2005. An analysis of the performance impact of wrong-path memory references on out-of-order and runahead execution processors. IEEE Trans. Comput. 54, 12 (2005), 1556--1571.Google Scholar
Digital Library
- Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In MICRO.Google Scholar
- Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Techniques for efficient processing in runahead execution engines. In ISCA.Google Scholar
- Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2006. Efficient runahead execution: Power-efficient memory latency tolerance. In IEEE MICRO.Google Scholar
- Onur Mutlu, Hyesoon Kim, Jared Stark, and Yale N. Patt. 2005. On reusing the results of pre-executed instructions in a runahead execution processor. In IEEE CAL.Google Scholar
- Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA.Google Scholar
- Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An effective alternative to large instruction windows. In IEEE MICRO.Google Scholar
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO.Google Scholar
- N. Nematollahi, M. Sadrosadati, H. Falahati, M. Barkhordar, and H. Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE CAL 17, 2 (2018), 225--229.Google Scholar
- P. R. Nuth and W. J. Dally. 1995. The named-state register file: Implementation and performance. In HPCA.Google Scholar
- NVIDIA. 2014. C Programming Guide V6.5. NVIDIA.Google Scholar
- NVIDIA. 2014. White Paper: NVIDIA GeForce GTX 980. Technical Report. NVIDIA.Google Scholar
- NVIDIA. 2016. White Paper: NVIDIA Tesla P100. Technical Report. NVIDIA.Google Scholar
- David W. Oehmke, Nathan L. Binkert, Trevor Mudge, and Steven K. Reinhardt. 2005. How to fake 1000 registers. In MICRO.Google Scholar
- Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Arch. Code Optimiz. 15, 4 (2018), 1--30.Google Scholar
Digital Library
- Stuart S. P. Parkin, Masamitsu Hayashi, and Luc Thomas. 2008. Magnetic domain-wall racetrack memory. Science (2008).Google Scholar
- Anjul Patney and William J. Dally. 2013. Conflict-free register allocation using a multi-bank register file with input operand alignment. US Patent 8,555,035.Google Scholar
- R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles.Google Scholar
Digital Library
- Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2015. Toggle-aware compression for GPUs. IEEE CAL 14, 2 (2015), 164--168.Google Scholar
- Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In HPCA.Google Scholar
- Massimiliano Poletto and Vivek Sarkar. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5 (1999), 895--913.Google Scholar
Digital Library
- Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In HPCA.Google Scholar
- Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In ISCA.Google Scholar
- Tanausú Ramírez, Alex Pajuelo, Oliverio J Santana, Onur Mutlu, and Mateo Valero. 2010. Efficient runahead threads. In PACT.Google Scholar
- William M. Reddick and Gehan A. J. Amaratunga. 1995. Silicon surface tunnel transistor. Appl. Phys. Lett. 67, 4 (1995), 494--496.Google Scholar
Cross Ref
- Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In ISCA.Google Scholar
- Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO.Google Scholar
- Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In ISCA.Google Scholar
- Richard M. Russell. 1978. The CRAY-1 computer system. Commun. ACM 21, 1 (1978), 63--72.Google Scholar
Digital Library
- Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM Trans. Arch. Code Optimiz. 16, 1 (2019), 1--26.Google Scholar
Digital Library
- M. Sadrosadati, A. Mirhosseini, H. Aghilinasab, and H. Sarbazi-Azad. 2015. An efficient DVS scheme for on-chip networks using reconfigurable Virtual Channel allocators. In ISLPED.Google Scholar
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In ASPLOS.Google Scholar
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Shahin Roozkhosh, Hazhir Bakhishi, and Hamid Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In DATE.Google Scholar
- Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In DAC.Google Scholar
- Mohammad Hossein Samavatian, Mohammad Arjomand, Ramin Bashizade, and Hamid Sarbazi-Azad. 2015. Architecting the last-level cache for GPUs using STT-RAM technology. ACM Trans. Des. Autom. Electr. Syst. 20, 4 (2015), 1--24.Google Scholar
Digital Library
- Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In PACT.Google Scholar
- Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic tuning of gpu resources for efficient execution. In MICRO.Google Scholar
- Mrigank Sharad, Rangharajan Venkatesan, Anand Raghunathan, and Kaushik Roy. 2013. Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches. In ISLPED.Google Scholar
- Ahmad Sharif and Hsien-Hsin S. Lee. 2011. Data prefetching by exploiting global and local access patterns. ACM J. Instr. Level Parallel. 13 (2011), 1--17.Google Scholar
- Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai. 2010. Register cache system not for latency reduction purpose. In MICRO.Google Scholar
- Jawar Singh, Krishnan Ramakrishnan, S. Mookerjea, Suman Datta, Narayanan Vijaykrishnan, and D. Pradhan. 2010. A novel si-tunnel FET based SRAM design for ultra low-power 0.3V VDD applications. In ASP-DAC.Google Scholar
- Yan Solihin, Jaejin Lee, and Josep Torrellas. 2002. Using a user-level memory thread for correlation prefetching. In ISCA.Google Scholar
- Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In ISCA.Google Scholar
- Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In ISCA.Google Scholar
- Seung Woo Son and Mahmut Kandemir. 2006. Energy-aware data prefetching for multi-speed disks. In CFC.Google Scholar
- Minseok Song. 2007. Energy-aware data prefetching for multi-speed disks in video servers. In ACM MM.Google Scholar
- Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA.Google Scholar
- John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, UIUC.Google Scholar
- Jim Summers, Tim Brecht, Derek Eager, Tyler Szepesi, Ben Cassell, and Bernard Wong. 2014. Automated control of aggressive prefetching for HTTP streaming video servers. In SYSTOR.Google Scholar
- Karthik Sundaramoorthy, Zach Purser, and Eric Rotenberg. 2000. Slipstream processors: Improving both performance and fault tolerance. In ASPLOS.Google Scholar
- J. A. Swensen and Y. N. Patt. 1988. Hierarchical registers for scientific computers. In ICS.Google Scholar
- Arash Tavakkol, Aasheesh Kolli, Stanko Novakovic, Kaveh Razavi, Juan Gomez-Luna, Hasan Hassan, Claude Barthels, Yaohua Wang, Mohammad Sadrosadati, Saugata Ghose, Ankit Singla, Pratap Subrahmanyam, and Onur Mutlu. 2018. Enabling efficient RDMA-based synchronous mirroring of persistent memory transactions. arxiv:cs.DC/1810.09360. Retrieved from https://arxiv.org/abs/1810.09360.Google Scholar
- Luc Thomas, Rai Moriya, Charles Rettner, and Stuart S. P. Parkin. 2010. Dynamics of magnetic domain walls under their own inertia. Science 330, 6012 (2010), 1810--1813.Google Scholar
Cross Ref
- Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing using GPUS. 25--35.Google Scholar
- Steve VanDeBogart, Christopher Frost, and Eddie Kohler. 2009. Reducing seek overhead with application-directed prefetching. In ATC.Google Scholar
- Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. Stag: Spintronic-tape architecture for GPGPU cache hierarchies. In ISCA.Google Scholar
- Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, and Anand Raghunathan. 2013. DWM-TAPESTRI-an energy efficient all-spin cache using domain wall shift based writes. In DATE.Google Scholar
- Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In ISCA.Google Scholar
- N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In MICRO.Google Scholar
- Nandita Vijaykumar, Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Gennady Pekhimenko, Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. Gibbons, and Onur Mutlu. 2018. A case for richer cross-layer abstractions: Bridging the semantic gap with expressive memory. In ISCA.Google Scholar
- Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In ISCA.Google Scholar
- Jue Wang and Yuan Xie. 2015. A write-aware STTRAM-based register file architecture for GPGPU. ACM JETC 12, 1 (2015), 1--12.Google Scholar
Digital Library
- Peng-Fei Wang. 2003. Complementary Tunneling-FETs (CTFET) in CMOS Technology. Ph.D. Dissertation. Technische Universität München, Universitätsbibliothek.Google Scholar
- Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided region prefetching: A cooperative hardware/software approach. In ISCA.Google Scholar
- Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2010. Making address-correlated prefetching practical. In IEEE MICRO.Google Scholar
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In MICRO.Google Scholar
- Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In ICCAD.Google Scholar
- Yi Yang, Ping Xiang, Jingfei Kong, Mike Mantor, and Huiyang Zhou. 2012. A unified optimizing compiler framework for different GPGPU architectures. ACM Trans. Arch. Code Optimiz. 9, 2 (2012), 1--33.Google Scholar
Digital Library
- Wing-kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G. Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In ISCA.Google Scholar
- Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In MICRO.Google Scholar
Digital Library
- R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD.Google Scholar
- Hui Zeng and Kanad Ghose. 2006. Register file caching for energy efficiency. In ISLPED.Google Scholar
- Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and adapting precomputation threads for effcient prefetching. In HPCA.Google Scholar
- Xiaotong Zhuang and Santosh Pande. 2003. Resolving register bank conflicts for a network processor. In PACT.Google Scholar
- Craig Zilles and Gurindar Sohi. 2001. Execution-based prediction using speculative slices. In ISCA.Google Scholar
- William K. Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. US Patent 5,630,096.Google Scholar
Index Terms
Highly Concurrent Latency-tolerant Register Files for GPUs
Recommendations
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsGraphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
ASPLOS '18Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
An L2-miss-driven early register deallocation for SMT processors
ICS '07: Proceedings of the 21st annual international conference on SupercomputingThe register file is one of the most critical datapath components limiting the number of threads that can be supported on a Simultaneous Multithreading (SMT) processor. To allow the use of smaller register files without degrading performance, techniques ...






Comments