skip to main content
research-article

Highly Concurrent Latency-tolerant Register Files for GPUs

Published:04 January 2021Publication History
Skip Abstract Section

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache.

In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.

References

  1. Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCGoogle ScholarGoogle Scholar
  2. Mohammad Abdel-Majeed, Alireza Shafaei, Hyeran Jeon, Massoud Pedram, and Murali Annavaram. 2017. Pilot register file: Energy efficient partitioned register file for GPUs. In HPCA.Google ScholarGoogle Scholar
  3. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA.Google ScholarGoogle Scholar
  4. Alfred V. Aho. 2003. Compilers: Principles, Techniques and Tools (for Anna University) (2nd ed.).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2001. Data prefetching by dependence graph precomputation. In ISCA.Google ScholarGoogle Scholar
  6. A. J. Annunziata, M. C. Gaidis, L. Thomas, C. W. Chien, C. C. Hung, P. Chevalier, E. J. O’Sullivan, J. P. Hummel, E. A. Joseph, Y. Zhu, et al. 2011. Racetrack memory cell array with integrated magnetic tunnel junction readout. In IEDM.Google ScholarGoogle Scholar
  7. Hodjat Asghari Esfeden, Amirali Abdolrashidi, Shafiur Rahman, Daniel Wong, and Nael Abu-Ghazaleh. 2020. BOW: Breathing operand windows to exploit bypassing in GPUs. In MICRO.Google ScholarGoogle Scholar
  8. Islam Atta, Xin Tong, Vijayalakshmi Srinivasan, Ioana Baldini, and Andreas Moshovos. 2015. Self-contained, accurate precomputation prefetching. In MICRO.Google ScholarGoogle Scholar
  9. C. Augustine, A. Raychowdhury, B. Behin-Aein, S. Srinivasan, J. Tschanz, Vivek K. De, and K. Roy. 2011. Numerical analysis of domain wall propagation for dense memory arrays. In IEDM.Google ScholarGoogle Scholar
  10. Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In PACT.Google ScholarGoogle Scholar
  11. Sung Hoon Baek and Kyu Ho Park. 2008. Prefetching with adaptive cache culling for striped disk arrays. In USENIX ATC.Google ScholarGoogle Scholar
  12. Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. On-chip network design considerations for compute accelerators. In PACT.Google ScholarGoogle Scholar
  13. Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In MICRO.Google ScholarGoogle Scholar
  14. Ali Bakhoda, John Kim, and Tor M. Aamodt. 2013. Designing on-chip networks for throughput accelerators. ACM Trans. Arch. Code Optimiz. 10, 3 (2013), 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS.Google ScholarGoogle Scholar
  16. M. Bakhshalipour, P. Lotfi-Kamran, A. Mazloumi, F. Samandi, M. Naderan-Tahan, M. Modarressi, and H. Sarbazi-Azad. 2018. Fast data delivery for many-core processors. IEEE Trans. Comput. 67, 10 (2018), 1416--1429.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2017. An efficient temporal data prefetcher for L1 caches. IEEE CAL 16, 2 (2017), 99--102.Google ScholarGoogle Scholar
  18. M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2018. Domino temporal data prefetcher. In HPCA.Google ScholarGoogle Scholar
  19. M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In HPCA.Google ScholarGoogle Scholar
  20. Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluation of hardware data prefetchers on server processors. ACM Comput. Surv. 52, 3 (2019), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO.Google ScholarGoogle Scholar
  22. Rahul Bera, Anant V. Nori, Onur Mutlu, and Sreenivas Subramoney. 2019. Dspatch: Dual spatial pattern prefetcher. In MICRO.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Krishna Kumar Bhuwalka, Stefan Sedlmaier, Alexandra Katherina Ludsteck, Carolin Tolksdorf, Joerg Schulze, and Ignaz Eisele. 2004. Vertical tunnel field-effect transistor. IEEE Trans. Electr. Dev. 51, 2 (2004), 279--282.Google ScholarGoogle ScholarCross RefCross Ref
  24. Eric Borch, Eric Tune, Srilatha Manne, and Joel Emer. 2002. Loose loops sink chips. In HPCA.Google ScholarGoogle Scholar
  25. Preston Briggs. 1992. Register Allocation via Graph Coloring. Technical Report.Google ScholarGoogle Scholar
  26. Jeffery A. Brown, Hong Wang, George Chrysos, Perry H. Wang, and John P. Shen. 2002. Speculative precomputation on chip multiprocessors. In Proceedings of the MTEAC.Google ScholarGoogle Scholar
  27. Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1996. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Trans. Comput. Syst. 14, 4 (1996), 311--343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Benjamin Cassell, Tyler Szepesi, Jim Summers, Tim Brecht, Derek Eager, and Bernard Wong. 2018. Disk prefetching mechanisms for increasing HTTP streaming video server throughput. ACM Trans. Model. Perf. Eval. Comput. Syst. 3, 2 (2018), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gregory J Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins, and Peter W. Markstein. 1981. Register allocation via coloring. Computer Languages 6, 1 (1981), 47--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous subordinate microthreading (SSMT). In ISCA.Google ScholarGoogle Scholar
  31. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC.Google ScholarGoogle Scholar
  32. Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (1995), 609--623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut Kademir, Anand Sivasubramaniam, Onur Mutlu, and Chita R. Das. 2012. Application-aware prefetch prioritization in on-chip networks. In PACT.Google ScholarGoogle Scholar
  34. Trishul M. Chilimbi and Martin Hirzel. 2002. Dynamic hot data stream prefetching for general-purpose programs. In PLDI.Google ScholarGoogle Scholar
  35. Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John P. Shen. 2001. Dynamic speculative precomputation. In MICRO.Google ScholarGoogle Scholar
  36. Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In ISCA.Google ScholarGoogle Scholar
  37. Keith D. Cooper and Timothy J. Harvey. 1998. Compiler-controlled memory. In ASPLOS.Google ScholarGoogle Scholar
  38. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms.Google ScholarGoogle Scholar
  39. J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. 2000. Multiple-banked register file architectures. In ISCA.Google ScholarGoogle Scholar
  40. Xiaoning Ding, Song Jiang, Feng Chen, Kei Davis, and Xiaodong Zhang. 2007. DiskSeen: Exploiting disk layout and access history to enhance I/O prefetch. In USENIX ATC.Google ScholarGoogle Scholar
  41. X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. Trans. Comput.-Aid. Des. Integr. Circ. Syst. 31, 7 (2012), 994--1007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA.Google ScholarGoogle Scholar
  43. Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009. Coordinated control of multiple prefetchers in multi-core systems. In MICRO.Google ScholarGoogle Scholar
  44. Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA.Google ScholarGoogle Scholar
  45. Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. 2019. CORF: Coalescing operand register file for GPUs. In ASPLOS.Google ScholarGoogle Scholar
  46. Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In MICRO.Google ScholarGoogle Scholar
  47. Adi Fuchs, Shie Mannor, Uri Weiser, and Yoav Etsion. 2014. Loop-aware memory prefetching using code block working sets. In MICRO.Google ScholarGoogle Scholar
  48. S. Fukami, T. Suzuki, K. Nagahara, N. Ohshima, Y. Ozaki, S. Saito, R. Nebashi, N. Sakimura, H. Honjo, K. Mori, et al. 2009. Low-current perpendicular domain wall motion cell for scalable high-speed MRAM. In Proceedings of the Symposium on VLSI Technology (VLSIT'09). 230--231.Google ScholarGoogle Scholar
  49. Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA.Google ScholarGoogle Scholar
  50. Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In MICRO.Google ScholarGoogle Scholar
  51. Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In MICRO.Google ScholarGoogle Scholar
  52. Knuth Stener Grimsrud, James K. Archibald, and Brent E. Nelson. 1993. Multiple prefetch adaptive disk caching. IEEE Trans. Knowl. Data Eng. 5, 1 (1993), 88--103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. SAFARI Research Group. 2018. LTRF Register-Interval-Algorithm. Retrieved from https://github.com/CMU-SAFARI/Register-Interval.Google ScholarGoogle Scholar
  54. Jayanth Gummaraju and Mendel Rosenblum. 2005. Stream programming on general-purpose processors. In MICRO.Google ScholarGoogle Scholar
  55. Sebastian Hack, Daniel Grund, and Gerhard Goos. 2006. Register allocation for programs in SSA-form. In CC.Google ScholarGoogle Scholar
  56. Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO.Google ScholarGoogle Scholar
  57. Milad Hashemi and Yale N. Patt. 2015. Filtered runahead execution with a runahead buffer. In MICRO.Google ScholarGoogle Scholar
  58. Matthew S. Hecht. 1977. Flow Analysis of Computer Programs. Elsevier Science Inc.Google ScholarGoogle Scholar
  59. Chih-Chieh Hsiao, Slo-Li Chu, and Chiu-Cheng Hsieh. 2014. An adaptive thread scheduling mechanism with low-power register file for mobile GPUs. IEEE Trans. Multimedia 1, 16 (2014), 60--67.Google ScholarGoogle ScholarCross RefCross Ref
  60. Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg. 2003. Slipstream execution mode for CMP-based multiprocessors. In HPCA.Google ScholarGoogle Scholar
  61. Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access map pattern matching for data cache prefetch. In ICS.Google ScholarGoogle Scholar
  62. Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In MICRO.Google ScholarGoogle Scholar
  63. Hyunjun Jang, Jinchun Kim, Paul Gratz, Ki Hwan Yum, and Eun Jung Kim. 2015. Bandwidth-efficient on-chip interconnect designs for GPGPUs. In DAC.Google ScholarGoogle Scholar
  64. H. Jeon, H. A. Esfeden, N. B. Abu-Ghazaleh, D. Wong, and S. Elango. 2019. Locality-aware GPU register file. IEEE CAL 18, 2 (2019), 153--156.Google ScholarGoogle Scholar
  65. Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In MICRO.Google ScholarGoogle Scholar
  66. Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, David E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In ISPASS.Google ScholarGoogle Scholar
  67. Song Jiang, Xiaoning Ding, Yuehai Xu, and Kei Davis. 2013. A prefetching scheme exploiting both data layout and access history on disk. ACM Trans. Stor. 9, 3 (2013), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Naifeng Jing, Li Jiang, Tao Zhang, Chao Li, Fengfeng Fan, and Xiaoyao Liang. 2016. Energy-efficient eDRAM-based on-chip storage architecture for GPGPUs. IEEE Trans. Comput. 65, 1 (2016), 122--135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Naifeng Jing, Haopeng Liu, Yao Lu, and Xiaoyao Liang. 2013. Compiler assisted dynamic register file in GPGPU. In ISLPED.Google ScholarGoogle Scholar
  70. Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal Corretger, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In ISCA.Google ScholarGoogle Scholar
  71. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS.Google ScholarGoogle Scholar
  72. Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In ISCA.Google ScholarGoogle Scholar
  73. Timothy M. Jones, Michael F. P. O’Boyle, Jaume Abella, Antonio González, and Oğuz Ergin. 2009. Energy-efficient register caching with compiler assistance. ACM Trans. Arch. Code Optimiz. 6, 4 (2009), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA.Google ScholarGoogle Scholar
  75. David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-fetch: Branch prediction directed prefetching for chip-multiprocessors. In MICRO.Google ScholarGoogle Scholar
  76. Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Inter-core prefetching for multicore processors using migrating helper threads. In ASPLOS.Google ScholarGoogle Scholar
  77. Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany. 2002. The imagine stream processor. In ICCD.Google ScholarGoogle Scholar
  78. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In PACT.Google ScholarGoogle Scholar
  79. Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. C-states: Fine-grained GPU datapath power management. In PACT.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In MICRO.Google ScholarGoogle Scholar
  81. Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. Regmutex: Inter-warp GPU register time-sharing. In ISCA.Google ScholarGoogle Scholar
  82. Dongkeun Kim, S. S.-W. Liao, Perry H. Wang, Juan Del Cuvillo, Xinmin Tian, Xiang Zou, Hong Wang, Donald Yeung, Milind Girkar, and John Paul Shen. 2004. Physical experimentation with prefetching helper threads on Intel’s hyper-threaded processors. In CGO.Google ScholarGoogle Scholar
  83. Dongkeun Kim and Donald Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. In ASPLOS.Google ScholarGoogle Scholar
  84. J. Kim, J. Balfour, and W. Dally. 2007. Flattened butterfly topology for on-chip networks. In MICRO.Google ScholarGoogle Scholar
  85. Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence based lookahead prefetching. In MICRO.Google ScholarGoogle Scholar
  86. Jon Kleinberg and Eva Tardos. 2006. Algorithm Design. Pearson Education India.Google ScholarGoogle Scholar
  87. John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In MICRO.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Emre Kültürsay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu. 2013. Evaluating STT-RAM as an energy-efficient main memory alternative. In ISPASS.Google ScholarGoogle Scholar
  89. An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction 8 dead-block correlating prefetchers. In ISCA.Google ScholarGoogle Scholar
  90. Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In CGO.Google ScholarGoogle Scholar
  91. N. B. Lakshminarayana and H. Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs. In HPCA.Google ScholarGoogle Scholar
  92. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In CGO.Google ScholarGoogle Scholar
  93. Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. In ISCA.Google ScholarGoogle Scholar
  94. Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger. 2010. Phase-change technology and the future of main memory. In IEEE MICRO.Google ScholarGoogle Scholar
  95. Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2011. Prefetch-aware memory controllers. IEEE Trans. Comput. 60, 10 (2011), 1406--1430.Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.Google ScholarGoogle Scholar
  97. Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO.Google ScholarGoogle Scholar
  98. Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In ISCA.Google ScholarGoogle Scholar
  99. Jingwen Leng, Tayler Hetherington, Ahmed Eltantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA.Google ScholarGoogle Scholar
  100. E. R. Lewis, D. Petit, L. O’Brien, A. Fernandez-Pacheco, J. Sampaio, A. V. Jausovec, H. T. Zeng, D. E. Read, and R. P. Cowburn. 2010. Fast domain wall motion in magnetic comb structures. Nat. Mater. 9, 12 (2010), 980--983.Google ScholarGoogle ScholarCross RefCross Ref
  101. Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A framework for memory oversubscription management in graphics processing units. In ASPLOS.Google ScholarGoogle Scholar
  102. Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In ICS.Google ScholarGoogle Scholar
  103. Zhi Li, Jingweijia Tan, and Xin Fu. 2013. Hybrid CMOS-TFET based register files for energy-efficient GPGPUs. In ISQED.Google ScholarGoogle Scholar
  104. Steve S. W. Liao, Perry H. Wang, Hong Wang, Gerolf Hoflehner, Daniel Lavery, and John P. Shen. 2002. Post-pass binary adaptation for software-based speculative precomputation. In PLDI.Google ScholarGoogle Scholar
  105. John Erik Lindholm, Ming Y. Siu, Simon S. Moy, Samuel Liu, and John R. Nickolls. 2008. Simulating multiported memories using lower port count memories. US Patent 7,339,592.Google ScholarGoogle Scholar
  106. Mikko H. Lipasti, William J. Schmidt, Steven R. Kunkel, and Robert R. Roediger. 1995. SPAID: Software prefetching in pointer-and call-intensive environments. In MICRO.Google ScholarGoogle Scholar
  107. Xiaoxiao Liu, Yong Li, Yaojun Zhang, Alex K. Jones, and Yiran Chen. 2014. STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures. In ASP-DAC.Google ScholarGoogle Scholar
  108. Xiaoxiao Liu, Mengjie Mao, Xiuyuan Bi, Hai Li, and Yiran Chen. 2015. An efficient STT-RAM-based register file in GPU architectures. In ASP-DAC.Google ScholarGoogle Scholar
  109. Yang Liu and Wang Wei. 2014. FLAP: Flash-aware prefetching for improving SSD-based disk cache. J. Netw. 9, 10 (2014), 2766.Google ScholarGoogle Scholar
  110. Jiwei Lu, Howard Chen, Rao Fu, Wei-Chung Hsu, Bobbie Othmer, Pen-Chung Yew, and Dong-Yuan Chen. 2003. The performance of runtime data cache prefetching in a dynamic optimization system. In MICRO.Google ScholarGoogle Scholar
  111. Jiwei Lu, Abhinav Das, Wei-Chung Hsu, Khoa Nguyen, and Santosh G. Abraham. 2005. Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In MICRO.Google ScholarGoogle Scholar
  112. Chi-Keung Luk. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In ISCA.Google ScholarGoogle Scholar
  113. Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. In ASPLOS.Google ScholarGoogle Scholar
  114. A. Magni, C. Dubach, and M. F. P. O’Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In SC.Google ScholarGoogle Scholar
  115. Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory. In DAC.Google ScholarGoogle Scholar
  116. Pierre Michaud. 2016. Best-offset hardware prefetching. In HPCA.Google ScholarGoogle Scholar
  117. Amirhossein Mirhosseini, Mohammad Sadrosadati, Fatemeh Aghamohammadi, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2019. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM Trans. Parallel Comput. 5, 3 (2019), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. A. Mirhosseini, M. Sadrosadati, A. Fakhrzadehgan, M. Modarressi, and H. Sarbazi-Azad. 2015. An energy-efficient virtual channel power-gating mechanism for on-chip networks. In DATE.Google ScholarGoogle Scholar
  119. Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In NOCS.Google ScholarGoogle Scholar
  120. A. Mirhosseini, M. Sadrosadati, M. Zare, and H. Sarbazi-Azad. 2016. Quantifying the difference in resource demand among classic and modern NoC workloads. In ICCD.Google ScholarGoogle Scholar
  121. Amirhossein Mirhosseini, Akshitha Sriraman, and Thomas F. Wenisch. 2019. Enhancing server efficiency in the face of killer microseconds. In HPCA.Google ScholarGoogle Scholar
  122. Saurabh Mookerjea and Suman Datta. 2008. Comparative study of Si, Ge and InAs based steep subthreshold slope tunnel transistors for 0.25 V supply voltage logic applications. In DRC.Google ScholarGoogle Scholar
  123. Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS.Google ScholarGoogle Scholar
  124. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories.Google ScholarGoogle Scholar
  125. G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan. 2010. Optimal loop unrolling for GPGPU programs. In IPDPS.Google ScholarGoogle Scholar
  126. Onur Mutlu, Hyesoon Kim, David N. Armstrong, and Yale N. Patt. 2005. An analysis of the performance impact of wrong-path memory references on out-of-order and runahead execution processors. IEEE Trans. Comput. 54, 12 (2005), 1556--1571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In MICRO.Google ScholarGoogle Scholar
  128. Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Techniques for efficient processing in runahead execution engines. In ISCA.Google ScholarGoogle Scholar
  129. Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2006. Efficient runahead execution: Power-efficient memory latency tolerance. In IEEE MICRO.Google ScholarGoogle Scholar
  130. Onur Mutlu, Hyesoon Kim, Jared Stark, and Yale N. Patt. 2005. On reusing the results of pre-executed instructions in a runahead execution processor. In IEEE CAL.Google ScholarGoogle Scholar
  131. Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA.Google ScholarGoogle Scholar
  132. Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An effective alternative to large instruction windows. In IEEE MICRO.Google ScholarGoogle Scholar
  133. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO.Google ScholarGoogle Scholar
  134. N. Nematollahi, M. Sadrosadati, H. Falahati, M. Barkhordar, and H. Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE CAL 17, 2 (2018), 225--229.Google ScholarGoogle Scholar
  135. P. R. Nuth and W. J. Dally. 1995. The named-state register file: Implementation and performance. In HPCA.Google ScholarGoogle Scholar
  136. NVIDIA. 2014. C Programming Guide V6.5. NVIDIA.Google ScholarGoogle Scholar
  137. NVIDIA. 2014. White Paper: NVIDIA GeForce GTX 980. Technical Report. NVIDIA.Google ScholarGoogle Scholar
  138. NVIDIA. 2016. White Paper: NVIDIA Tesla P100. Technical Report. NVIDIA.Google ScholarGoogle Scholar
  139. David W. Oehmke, Nathan L. Binkert, Trevor Mudge, and Steven K. Reinhardt. 2005. How to fake 1000 registers. In MICRO.Google ScholarGoogle Scholar
  140. Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Arch. Code Optimiz. 15, 4 (2018), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Stuart S. P. Parkin, Masamitsu Hayashi, and Luc Thomas. 2008. Magnetic domain-wall racetrack memory. Science (2008).Google ScholarGoogle Scholar
  142. Anjul Patney and William J. Dally. 2013. Conflict-free register allocation using a multi-bank register file with input operand alignment. US Patent 8,555,035.Google ScholarGoogle Scholar
  143. R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles.Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2015. Toggle-aware compression for GPUs. IEEE CAL 14, 2 (2015), 164--168.Google ScholarGoogle Scholar
  145. Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In HPCA.Google ScholarGoogle Scholar
  146. Massimiliano Poletto and Vivek Sarkar. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5 (1999), 895--913.Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In HPCA.Google ScholarGoogle Scholar
  148. Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In ISCA.Google ScholarGoogle Scholar
  149. Tanausú Ramírez, Alex Pajuelo, Oliverio J Santana, Onur Mutlu, and Mateo Valero. 2010. Efficient runahead threads. In PACT.Google ScholarGoogle Scholar
  150. William M. Reddick and Gehan A. J. Amaratunga. 1995. Silicon surface tunnel transistor. Appl. Phys. Lett. 67, 4 (1995), 494--496.Google ScholarGoogle ScholarCross RefCross Ref
  151. Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In ISCA.Google ScholarGoogle Scholar
  152. Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO.Google ScholarGoogle Scholar
  153. Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In ISCA.Google ScholarGoogle Scholar
  154. Richard M. Russell. 1978. The CRAY-1 computer system. Commun. ACM 21, 1 (1978), 63--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM Trans. Arch. Code Optimiz. 16, 1 (2019), 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  156. M. Sadrosadati, A. Mirhosseini, H. Aghilinasab, and H. Sarbazi-Azad. 2015. An efficient DVS scheme for on-chip networks using reconfigurable Virtual Channel allocators. In ISLPED.Google ScholarGoogle Scholar
  157. Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In ASPLOS.Google ScholarGoogle Scholar
  158. Mohammad Sadrosadati, Amirhossein Mirhosseini, Shahin Roozkhosh, Hazhir Bakhishi, and Hamid Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In DATE.Google ScholarGoogle Scholar
  159. Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In DAC.Google ScholarGoogle Scholar
  160. Mohammad Hossein Samavatian, Mohammad Arjomand, Ramin Bashizade, and Hamid Sarbazi-Azad. 2015. Architecting the last-level cache for GPUs using STT-RAM technology. ACM Trans. Des. Autom. Electr. Syst. 20, 4 (2015), 1--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In PACT.Google ScholarGoogle Scholar
  162. Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic tuning of gpu resources for efficient execution. In MICRO.Google ScholarGoogle Scholar
  163. Mrigank Sharad, Rangharajan Venkatesan, Anand Raghunathan, and Kaushik Roy. 2013. Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches. In ISLPED.Google ScholarGoogle Scholar
  164. Ahmad Sharif and Hsien-Hsin S. Lee. 2011. Data prefetching by exploiting global and local access patterns. ACM J. Instr. Level Parallel. 13 (2011), 1--17.Google ScholarGoogle Scholar
  165. Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai. 2010. Register cache system not for latency reduction purpose. In MICRO.Google ScholarGoogle Scholar
  166. Jawar Singh, Krishnan Ramakrishnan, S. Mookerjea, Suman Datta, Narayanan Vijaykrishnan, and D. Pradhan. 2010. A novel si-tunnel FET based SRAM design for ultra low-power 0.3V VDD applications. In ASP-DAC.Google ScholarGoogle Scholar
  167. Yan Solihin, Jaejin Lee, and Josep Torrellas. 2002. Using a user-level memory thread for correlation prefetching. In ISCA.Google ScholarGoogle Scholar
  168. Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In ISCA.Google ScholarGoogle Scholar
  169. Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In ISCA.Google ScholarGoogle Scholar
  170. Seung Woo Son and Mahmut Kandemir. 2006. Energy-aware data prefetching for multi-speed disks. In CFC.Google ScholarGoogle Scholar
  171. Minseok Song. 2007. Energy-aware data prefetching for multi-speed disks in video servers. In ACM MM.Google ScholarGoogle Scholar
  172. Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA.Google ScholarGoogle Scholar
  173. John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, UIUC.Google ScholarGoogle Scholar
  174. Jim Summers, Tim Brecht, Derek Eager, Tyler Szepesi, Ben Cassell, and Bernard Wong. 2014. Automated control of aggressive prefetching for HTTP streaming video servers. In SYSTOR.Google ScholarGoogle Scholar
  175. Karthik Sundaramoorthy, Zach Purser, and Eric Rotenberg. 2000. Slipstream processors: Improving both performance and fault tolerance. In ASPLOS.Google ScholarGoogle Scholar
  176. J. A. Swensen and Y. N. Patt. 1988. Hierarchical registers for scientific computers. In ICS.Google ScholarGoogle Scholar
  177. Arash Tavakkol, Aasheesh Kolli, Stanko Novakovic, Kaveh Razavi, Juan Gomez-Luna, Hasan Hassan, Claude Barthels, Yaohua Wang, Mohammad Sadrosadati, Saugata Ghose, Ankit Singla, Pratap Subrahmanyam, and Onur Mutlu. 2018. Enabling efficient RDMA-based synchronous mirroring of persistent memory transactions. arxiv:cs.DC/1810.09360. Retrieved from https://arxiv.org/abs/1810.09360.Google ScholarGoogle Scholar
  178. Luc Thomas, Rai Moriya, Charles Rettner, and Stuart S. P. Parkin. 2010. Dynamics of magnetic domain walls under their own inertia. Science 330, 6012 (2010), 1810--1813.Google ScholarGoogle ScholarCross RefCross Ref
  179. Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing using GPUS. 25--35.Google ScholarGoogle Scholar
  180. Steve VanDeBogart, Christopher Frost, and Eddie Kohler. 2009. Reducing seek overhead with application-directed prefetching. In ATC.Google ScholarGoogle Scholar
  181. Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. Stag: Spintronic-tape architecture for GPGPU cache hierarchies. In ISCA.Google ScholarGoogle Scholar
  182. Rangharajan Venkatesan, Mrigank Sharad, Kaushik Roy, and Anand Raghunathan. 2013. DWM-TAPESTRI-an energy efficient all-spin cache using domain wall shift based writes. In DATE.Google ScholarGoogle Scholar
  183. Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In ISCA.Google ScholarGoogle Scholar
  184. N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In MICRO.Google ScholarGoogle Scholar
  185. Nandita Vijaykumar, Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Gennady Pekhimenko, Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. Gibbons, and Onur Mutlu. 2018. A case for richer cross-layer abstractions: Bridging the semantic gap with expressive memory. In ISCA.Google ScholarGoogle Scholar
  186. Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In ISCA.Google ScholarGoogle Scholar
  187. Jue Wang and Yuan Xie. 2015. A write-aware STTRAM-based register file architecture for GPGPU. ACM JETC 12, 1 (2015), 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  188. Peng-Fei Wang. 2003. Complementary Tunneling-FETs (CTFET) in CMOS Technology. Ph.D. Dissertation. Technische Universität München, Universitätsbibliothek.Google ScholarGoogle Scholar
  189. Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided region prefetching: A cooperative hardware/software approach. In ISCA.Google ScholarGoogle Scholar
  190. Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2010. Making address-correlated prefetching practical. In IEEE MICRO.Google ScholarGoogle Scholar
  191. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In MICRO.Google ScholarGoogle Scholar
  192. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In ICCAD.Google ScholarGoogle Scholar
  193. Yi Yang, Ping Xiang, Jingfei Kong, Mike Mantor, and Huiyang Zhou. 2012. A unified optimizing compiler framework for different GPGPU architectures. ACM Trans. Arch. Code Optimiz. 9, 2 (2012), 1--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  194. Wing-kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G. Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In ISCA.Google ScholarGoogle Scholar
  195. Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In MICRO.Google ScholarGoogle ScholarDigital LibraryDigital Library
  196. R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD.Google ScholarGoogle Scholar
  197. Hui Zeng and Kanad Ghose. 2006. Register file caching for energy efficiency. In ISLPED.Google ScholarGoogle Scholar
  198. Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and adapting precomputation threads for effcient prefetching. In HPCA.Google ScholarGoogle Scholar
  199. Xiaotong Zhuang and Santosh Pande. 2003. Resolving register bank conflicts for a network processor. In PACT.Google ScholarGoogle Scholar
  200. Craig Zilles and Gurindar Sohi. 2001. Execution-based prediction using speculative slices. In ISCA.Google ScholarGoogle Scholar
  201. William K. Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. US Patent 5,630,096.Google ScholarGoogle Scholar

Index Terms

  1. Highly Concurrent Latency-tolerant Register Files for GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!