skip to main content
research-article

NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs

Authors Info & Claims
Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g., spatial multitasking) have limited opportunity to improve resource utilization, while other works, e.g., simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensures fairness and Quality-of-Service (QoS). The key idea is that each streaming multiprocessor (SM) executes Cooperative Thread Arrays (CTAs) belong to only one application (similar to the spatial multi-tasking) and shares its unused resources with the SMs running other applications demanding more resources. NURA handles resource sharing process mainly using a software approach to provide simplicity, low hardware cost, and flexibility. We also perform some hardware modifications as an architectural support for our software-based proposal. We conservatively analyze the hardware cost of our proposal, and observe less than 1.07% area overhead with respect to the whole GPU die. Our experimental results over various mixes of GPU workloads show that NURA improves GPU system throughput by 26% compared to state-of-the-art spatial multi-tasking, on average, while meeting the QoS target. In terms of fairness, NURA has almost similar results to spatial multitasking, while it outperforms simultaneous multi-kernel by an average of 76%.

References

  1. 2014. Nvidia Kepler GK110 GK210 Whitepaper. Technical Report. NVIDIA Corporation. https://images.nvidia.com/ content/pdf/tesla/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfGoogle ScholarGoogle Scholar
  2. T. M. Aamodt, P. Chow, P. Hammarlund, Hong Wang, and J. P. Shen. 2004. Hardware Support for Prescient Instruction Prefetch. In 10th International Symposium on High Performance Computer Architecture (HPCA'04). 84--84. https: //doi.org/10.1109/HPCA.2004.10028Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In IEEE International Symposium on High-Performance Comp Architecture. 1--12. https://doi.org/10.1109/HPCA.2012.6168946Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Aguilera, K. Morrow, and N. S. Kim. 2014. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC). 726--731. https://doi.org/10.1109/ ASPDAC.2014.6742976Google ScholarGoogle ScholarCross RefCross Ref
  5. Akhil Arunkumar, Evgeny Bolotin, David Nellans, and Carole-Jean Wu. 2019. Understanding the Future of Energy Efficiency in Multi-Module GPUs. In to appear) Proceedings of the IEEE International Symposium on High Performance Computer Architecture.Google ScholarGoogle Scholar
  6. Abu Asaduzzaman, Srinivas Jojigiri, Thushar Sabu, and Sanath Tailam. 2021. Studying Execution Time and Memory Transfer Time of Image Processing Using GPU Cards. In 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 0689--0695.Google ScholarGoogle Scholar
  7. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google ScholarGoogle ScholarCross RefCross Ref
  8. Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, et al. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: an open-source RTL implementation of a GPGPU. ACM Transactions on Architecture and Code Optimization (TACO) 12, 2 (2015), 21.Google ScholarGoogle Scholar
  9. Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, and Karthikeyan Sankaralingam. 2015. Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU. ACM Trans. Archit. Code Optim. 12, 2, Article 21 (June 2015), 25 pages. https://doi.org/10.1145/2764908Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Can Basaran and Kyoung-Don Kang. 2012. Supporting preemptive task executions and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time Systems. IEEE, 287--296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Thomas Bradley. 2012. GPU performance analysis and optimisation. NVIDIA Corporation (2012).Google ScholarGoogle Scholar
  12. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. Effisha: A software framework for enabling effficient preemptive scheduling of gpu. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 3--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Choukse, M. B. Sullivan, M. O'Connor, M. Erez, J. Pool, D. Nellans, and S. W. Keckler. 2020. Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 926--939. https://doi.org/10.1109/ISCA45697.2020.00080Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K Panda. 2020. Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. D. Collins, D. M. Tullsen, Hong Wang, and J. P. Shen. 2001. Dynamic speculative precomputation. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. 306--317. https://doi.org/10.1109/MICRO. 2001.991128 Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 1, Article 16. Publication date: March 2022. 16:24 Sina Darabi, et al.Google ScholarGoogle Scholar
  17. H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou. 2018. Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 208--220. https://doi.org/10.1109/HPCA.2018.00027Google ScholarGoogle ScholarCross RefCross Ref
  18. Enrique De Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, and Antonio González. 2019. Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence. IEEE Transactions on Parallel and Distributed Systems 30, 2 (2019), 473--485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Cooperative caching for GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Zhao et al. 2019. Themis: Predicting and Reining in Application-level Slowdown on Spatial Multitasking GPUs. In 2019 IEEE International Parallel Distributed Processing Symposium (IPDPS).Google ScholarGoogle Scholar
  21. Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. 2020. Estimating gpu memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342--1352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mark Gebhart, Stephen W Keckler, Brucek Khailany, Ronny Krashinsky, and William J Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on. IEEE, 96--106.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and Uri C Weiser. 2009. Many-core vs. many-thread machines: Stay away from the valley. IEEE Computer Architecture Letters 8, 1 (2009), 25--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Robert Haase, Loic A Royer, Peter Steinbach, Deborah Schmidt, Alexandr Dibrov, Uwe Schmidt, MartinWeigert, Nicola Maghelli, Pavel Tomancak, Florian Jug, et al. 2020. CLIJ: GPU-accelerated image processing for everyone. Nature methods 17, 1 (2020), 5--6.Google ScholarGoogle Scholar
  25. Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the 48th International Symposium on Microarchitecture. 420--432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Zhen Hang Jiang, Yunsi Fei, and David Kaeli. 2019. Exploiting Bank Conflict-Based Side-Channel Timing Leakage of GPUs. ACM Trans. Archit. Code Optim. 16, 4, Article 42 (Nov. 2019), 24 pages. https://doi.org/10.1145/3361870Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12. https://doi.org/10.1109/MICRO.2016.7783717Google ScholarGoogle ScholarCross RefCross Ref
  28. Jiwei Lu, A. Das, Wei-Chung Hsu, Khoa Nguyen, and S. G. Abraham. 2005. Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05). 12 pp.--104. https://doi.org/10.1109/MICRO.2005.18Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (Portland, OR) (USENIXATC'11). USENIX Association, USA, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE micro 31, 5 (2011), 7--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473--486.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. Regmutex: Inter-warp gpu register time-sharing. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 816--828.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dongkeun Kim and Donald Yeung. 2002. Design and Evaluation of Compiler Algorithms for Pre-Execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 159--170. https://doi.org/10.1145/605397.605415Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Kim, J. Cha, J. J. K. Park, D. Jeon, and Y. Park. 2019. Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 1--5. https://doi.org/10.1109/LCA.2018.2889042Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jiho Kim, John Kim, and Yongjun Park. 2020. Navigator: dynamic multi-kernel scheduling to improve GPU performance. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  36. K. Kim and W. W. Ro. 2018. WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 389--402. https://doi.org/10. 1109/HPCA.2018.00041Google ScholarGoogle ScholarCross RefCross Ref
  37. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: enabling energy optimizations in GPGPUs. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 487--498. Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 1, Article 16. Publication date: March 2022. 16:25Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ruipeng Li and Chaoyu Zhang. 2020. Efficient parallel implementations of sparse triangular solves for GPU architectures. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 106--117.Google ScholarGoogle ScholarCross RefCross Ref
  39. Y. Liang, X. Li, and X. Xie. 2017. Exploring cache bypassing and partitioning for multi-tasking on GPUs. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 9--16. https://doi.org/10.1109/ICCAD.2017. 8203754Google ScholarGoogle ScholarCross RefCross Ref
  40. X. Long, X. Gong, Y. Liu, X. Que, and W. Wang. 2020. Toward OS-Level and Device-Level Cooperative Scheduling for Multitasking GPUs. IEEE Access 8 (2020), 65711--65725. https://doi.org/10.1109/ACCESS.2020.2983731Google ScholarGoogle ScholarCross RefCross Ref
  41. Xinjian Long, Xiangyang Gong, Yaguang Liu, Xirong Que, and Wendong Wang. 2020. Toward OS-Level and Device- Level Cooperative Scheduling for Multitasking GPUs. IEEE Access 8 (2020), 65711--65725.Google ScholarGoogle ScholarCross RefCross Ref
  42. Hoda Naghibijouybari, Khaled N. Khasawneh, and Nael Abu-Ghazaleh. 2017. Constructing and Characterizing Covert Channels on GPGPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, Massachusetts) (MICRO-50 '17). Association for Computing Machinery, New York, NY, USA, 354--366. https://doi.org/10.1145/3123939.3124538Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu-Ghazaleh. 2018. Rendered Insecure: GPU Side Channel Attacks Are Practical (CCS '18). Association for Computing Machinery, New York, NY, USA, 2139--2153. https://doi.org/10.1145/3243734.3243831Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, Mario Paulo Drumond, Hamid Sarbazi-Azad, and Babak Falsafi. 2020. Efficient Nearest-Neighbor Data Sharing in GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 18, 1 (2020), 1--26.Google ScholarGoogle Scholar
  45. Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225--229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. NVIDIA. 2017. Volta architecture Whitepaper - NVIDIA File Downloads. https://images.nvidia.com/content/voltaarchitecture/ pdf/volta-architecture-whitepaper.pdfGoogle ScholarGoogle Scholar
  47. NVIDIA. 2021. Ampere Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-amperearchitecture- whitepaper.pdf, [48] NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. 2020. CUDA, release: 10.2.89. https://developer.nvidia.com/cudatoolkitGoogle ScholarGoogle Scholar
  48. Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 527--540.Google ScholarGoogle ScholarCross RefCross Ref
  49. Nezam Rohbani, Sina Darabi, and Hamid Sarbazi-Azad. 2021. PF-DRAM: a precharge-free DRAM structure. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 126--138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 489--502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 31--36. https: //doi.org/10.23919/DATE.2017.7926954Google ScholarGoogle Scholar
  52. V Skala, SAA Karim, and EA Kadir. 2020. Scientific computing and computer graphics with GPU: application of projective geometry and principle of duality. Int. J. Math. Comput. Sci 15, 3 (2020), 769--777.Google ScholarGoogle Scholar
  53. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).Google ScholarGoogle Scholar
  54. John N Tsitsiklis and Kuang Xu. 2011. On the power of (even a little) centralization in distributed processing. ACM SIGMETRICS Performance Evaluation Review 39, 1 (2011), 121--132.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  56. Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 41--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Stavros Volos, Kapil Vaswani, and Rodrigo Bruno. 2018. Graviton: Trusted Execution Environments on GPUs. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 681--696. https://www.usenix.org/conference/osdi18/presentation/volos Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 1, Article 16. Publication date: March 2022. 16:26 Sina Darabi, et al.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Siqi Wang, Guanwen Zhong, and Tulika Mitra. 2017. CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications. ACM Trans. Embed. Comput. Syst. 16, 5s, Article 146 (Sept. 2017), 22 pages. https: //doi.org/10.1145/3126546Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 358--369.Google ScholarGoogle ScholarCross RefCross Ref
  60. Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2017. Quality of service support for fine-grained sharing on gpus. ACM SIGARCH Computer Architecture News 45, 2 (2017), 269--281.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. J. Wei, Y. Zhang, Z. Zhou, Z. Li, and M. A. Al Faruque. 2020. Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 125--137. https://doi.org/10.1109/DSN48063.2020.00031Google ScholarGoogle Scholar
  62. Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 230--242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Qiumin Xu, Hoda Naghibijouybari, Shibo Wang, Nael Abu-Ghazaleh, and Murali Annavaram. 2019. GPUGuard: Mitigating Contention Based Side and Covert Channel Attacks on GPUs. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 497--509. https://doi.org/10.1145/3330345.3330389Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 283--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 609--621.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Chao Yu, Yuebin Bai, Qingxiao Sun, and Hailong Yang. 2018. Improving Thread-Level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory. ACM Trans. Archit. Code Optim. 15, 4, Article 48 (Nov. 2018), 24 pages. https://doi.org/10.1145/3280849Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Wei Zhang, Weihao Cui, Kaihua Fu, Quan Chen, Daniel Edward Mawhirter, Bo Wu, Chao Li, and Minyi Guo. 2019. Laius: TOwards LAtency AWareness and IMproved UTilization of SPatial Multitasking Accelerators in Datacenters. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 58--68. https://doi.org/10.1145/3330345.3330351Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Effcient Prefetching. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society, USA, 85--95. https://doi.org/10.1109/HPCA.2007.346187Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. W. Zhao, Q. Chen, H. Lin, J. Zhang, J. Leng, C. Li, W. Zheng, L. Li, and M. Guo. 2019. Themis: Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 653--663. https://doi.org/10.1109/IPDPS.2019.00074Google ScholarGoogle Scholar
  70. Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A Hybrid Slowdown Model for Multitasking GPUs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 1371--1385. https://doi.org/10.1145/3373376.3378457Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Xia Zhao, Zhiying Wang, and Lieven Eeckhout. 2018. Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs. In Proceedings of the 2018 International Conference on Supercomputing (Beijing, China) (ICS '18). Association for Computing Machinery, New York, NY, USA, 65--75. https://doi.org/10.1145/3205289.3205311Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. X. Zhao, Z. Wang, and L. Eeckhout. 2019. HeteroCore GPU to Exploit TLP-Resource Diversity. IEEE Transactions on Parallel and Distributed Systems 30, 1 (Jan 2019), 93--106. https://doi.org/10.1109/TPDS.2018.2854764Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, andWenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 587--599.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. GPES: A preemptive execution system for GPGPU computing. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 87--97.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!