Abstract
Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g., spatial multitasking) have limited opportunity to improve resource utilization, while other works, e.g., simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensures fairness and Quality-of-Service (QoS). The key idea is that each streaming multiprocessor (SM) executes Cooperative Thread Arrays (CTAs) belong to only one application (similar to the spatial multi-tasking) and shares its unused resources with the SMs running other applications demanding more resources. NURA handles resource sharing process mainly using a software approach to provide simplicity, low hardware cost, and flexibility. We also perform some hardware modifications as an architectural support for our software-based proposal. We conservatively analyze the hardware cost of our proposal, and observe less than 1.07% area overhead with respect to the whole GPU die. Our experimental results over various mixes of GPU workloads show that NURA improves GPU system throughput by 26% compared to state-of-the-art spatial multi-tasking, on average, while meeting the QoS target. In terms of fairness, NURA has almost similar results to spatial multitasking, while it outperforms simultaneous multi-kernel by an average of 76%.
- 2014. Nvidia Kepler GK110 GK210 Whitepaper. Technical Report. NVIDIA Corporation. https://images.nvidia.com/ content/pdf/tesla/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfGoogle Scholar
- T. M. Aamodt, P. Chow, P. Hammarlund, Hong Wang, and J. P. Shen. 2004. Hardware Support for Prescient Instruction Prefetch. In 10th International Symposium on High Performance Computer Architecture (HPCA'04). 84--84. https: //doi.org/10.1109/HPCA.2004.10028Google Scholar
Digital Library
- J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In IEEE International Symposium on High-Performance Comp Architecture. 1--12. https://doi.org/10.1109/HPCA.2012.6168946Google Scholar
Digital Library
- P. Aguilera, K. Morrow, and N. S. Kim. 2014. QoS-aware dynamic resource allocation for spatial-multitasking GPUs. In 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC). 726--731. https://doi.org/10.1109/ ASPDAC.2014.6742976Google Scholar
Cross Ref
- Akhil Arunkumar, Evgeny Bolotin, David Nellans, and Carole-Jean Wu. 2019. Understanding the Future of Energy Efficiency in Multi-Module GPUs. In to appear) Proceedings of the IEEE International Symposium on High Performance Computer Architecture.Google Scholar
- Abu Asaduzzaman, Srinivas Jojigiri, Thushar Sabu, and Sanath Tailam. 2021. Studying Execution Time and Memory Transfer Time of Image Processing Using GPU Cards. In 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 0689--0695.Google Scholar
- Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google Scholar
Cross Ref
- Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, et al. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: an open-source RTL implementation of a GPGPU. ACM Transactions on Architecture and Code Optimization (TACO) 12, 2 (2015), 21.Google Scholar
- Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, and Karthikeyan Sankaralingam. 2015. Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU. ACM Trans. Archit. Code Optim. 12, 2, Article 21 (June 2015), 25 pages. https://doi.org/10.1145/2764908Google Scholar
Digital Library
- Can Basaran and Kyoung-Don Kang. 2012. Supporting preemptive task executions and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time Systems. IEEE, 287--296.Google Scholar
Digital Library
- Thomas Bradley. 2012. GPU performance analysis and optimisation. NVIDIA Corporation (2012).Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54.Google Scholar
Digital Library
- Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. 2017. Effisha: A software framework for enabling effficient preemptive scheduling of gpu. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 3--16.Google Scholar
Digital Library
- E. Choukse, M. B. Sullivan, M. O'Connor, M. Erez, J. Pool, D. Nellans, and S. W. Keckler. 2020. Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 926--939. https://doi.org/10.1109/ISCA45697.2020.00080Google Scholar
Digital Library
- Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K Panda. 2020. Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.Google Scholar
Digital Library
- J. D. Collins, D. M. Tullsen, Hong Wang, and J. P. Shen. 2001. Dynamic speculative precomputation. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. 306--317. https://doi.org/10.1109/MICRO. 2001.991128 Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 1, Article 16. Publication date: March 2022. 16:24 Sina Darabi, et al.Google Scholar
- H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou. 2018. Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 208--220. https://doi.org/10.1109/HPCA.2018.00027Google Scholar
Cross Ref
- Enrique De Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, and Antonio González. 2019. Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence. IEEE Transactions on Parallel and Distributed Systems 30, 2 (2019), 473--485.Google Scholar
Digital Library
- Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Cooperative caching for GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 1--25.Google Scholar
Digital Library
- W. Zhao et al. 2019. Themis: Predicting and Reining in Application-level Slowdown on Spatial Multitasking GPUs. In 2019 IEEE International Parallel Distributed Processing Symposium (IPDPS).Google Scholar
- Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. 2020. Estimating gpu memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342--1352.Google Scholar
Digital Library
- Mark Gebhart, Stephen W Keckler, Brucek Khailany, Ronny Krashinsky, and William J Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on. IEEE, 96--106.Google Scholar
Digital Library
- Zvika Guz, Evgeny Bolotin, Idit Keidar, Avinoam Kolodny, Avi Mendelson, and Uri C Weiser. 2009. Many-core vs. many-thread machines: Stay away from the valley. IEEE Computer Architecture Letters 8, 1 (2009), 25--28.Google Scholar
Digital Library
- Robert Haase, Loic A Royer, Peter Steinbach, Deborah Schmidt, Alexandr Dibrov, Uwe Schmidt, MartinWeigert, Nicola Maghelli, Pavel Tomancak, Florian Jug, et al. 2020. CLIJ: GPU-accelerated image processing for everyone. Nature methods 17, 1 (2020), 5--6.Google Scholar
- Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the 48th International Symposium on Microarchitecture. 420--432.Google Scholar
Digital Library
- Zhen Hang Jiang, Yunsi Fei, and David Kaeli. 2019. Exploiting Bank Conflict-Based Side-Channel Timing Leakage of GPUs. ACM Trans. Archit. Code Optim. 16, 4, Article 42 (Nov. 2019), 24 pages. https://doi.org/10.1145/3361870Google Scholar
Digital Library
- N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12. https://doi.org/10.1109/MICRO.2016.7783717Google Scholar
Cross Ref
- Jiwei Lu, A. Das, Wei-Chung Hsu, Khoa Nguyen, and S. G. Abraham. 2005. Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05). 12 pp.--104. https://doi.org/10.1109/MICRO.2005.18Google Scholar
Digital Library
- Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (Portland, OR) (USENIXATC'11). USENIX Association, USA, 2.Google Scholar
Digital Library
- Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE micro 31, 5 (2011), 7--17.Google Scholar
Digital Library
- Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473--486.Google Scholar
Digital Library
- Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. Regmutex: Inter-warp gpu register time-sharing. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 816--828.Google Scholar
Digital Library
- Dongkeun Kim and Donald Yeung. 2002. Design and Evaluation of Compiler Algorithms for Pre-Execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 159--170. https://doi.org/10.1145/605397.605415Google Scholar
Digital Library
- J. Kim, J. Cha, J. J. K. Park, D. Jeon, and Y. Park. 2019. Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 1--5. https://doi.org/10.1109/LCA.2018.2889042Google Scholar
Digital Library
- Jiho Kim, John Kim, and Yongjun Park. 2020. Navigator: dynamic multi-kernel scheduling to improve GPU performance. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Cross Ref
- K. Kim and W. W. Ro. 2018. WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 389--402. https://doi.org/10. 1109/HPCA.2018.00041Google Scholar
Cross Ref
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: enabling energy optimizations in GPGPUs. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 487--498. Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 1, Article 16. Publication date: March 2022. 16:25Google Scholar
Digital Library
- Ruipeng Li and Chaoyu Zhang. 2020. Efficient parallel implementations of sparse triangular solves for GPU architectures. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 106--117.Google Scholar
Cross Ref
- Y. Liang, X. Li, and X. Xie. 2017. Exploring cache bypassing and partitioning for multi-tasking on GPUs. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 9--16. https://doi.org/10.1109/ICCAD.2017. 8203754Google Scholar
Cross Ref
- X. Long, X. Gong, Y. Liu, X. Que, and W. Wang. 2020. Toward OS-Level and Device-Level Cooperative Scheduling for Multitasking GPUs. IEEE Access 8 (2020), 65711--65725. https://doi.org/10.1109/ACCESS.2020.2983731Google Scholar
Cross Ref
- Xinjian Long, Xiangyang Gong, Yaguang Liu, Xirong Que, and Wendong Wang. 2020. Toward OS-Level and Device- Level Cooperative Scheduling for Multitasking GPUs. IEEE Access 8 (2020), 65711--65725.Google Scholar
Cross Ref
- Hoda Naghibijouybari, Khaled N. Khasawneh, and Nael Abu-Ghazaleh. 2017. Constructing and Characterizing Covert Channels on GPGPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, Massachusetts) (MICRO-50 '17). Association for Computing Machinery, New York, NY, USA, 354--366. https://doi.org/10.1145/3123939.3124538Google Scholar
Digital Library
- Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu-Ghazaleh. 2018. Rendered Insecure: GPU Side Channel Attacks Are Practical (CCS '18). Association for Computing Machinery, New York, NY, USA, 2139--2153. https://doi.org/10.1145/3243734.3243831Google Scholar
Digital Library
- Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, Mario Paulo Drumond, Hamid Sarbazi-Azad, and Babak Falsafi. 2020. Efficient Nearest-Neighbor Data Sharing in GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 18, 1 (2020), 1--26.Google Scholar
- Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225--229.Google Scholar
Digital Library
- NVIDIA. 2017. Volta architecture Whitepaper - NVIDIA File Downloads. https://images.nvidia.com/content/voltaarchitecture/ pdf/volta-architecture-whitepaper.pdfGoogle Scholar
- NVIDIA. 2021. Ampere Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-amperearchitecture- whitepaper.pdf, [48] NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. 2020. CUDA, release: 10.2.89. https://developer.nvidia.com/cudatoolkitGoogle Scholar
- Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 527--540.Google Scholar
Cross Ref
- Nezam Rohbani, Sina Darabi, and Hamid Sarbazi-Azad. 2021. PF-DRAM: a precharge-free DRAM structure. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 126--138.Google Scholar
Digital Library
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 489--502.Google Scholar
Digital Library
- M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 31--36. https: //doi.org/10.23919/DATE.2017.7926954Google Scholar
- V Skala, SAA Karim, and EA Kadir. 2020. Scientific computing and computer graphics with GPU: application of projective geometry and principle of duality. Int. J. Math. Comput. Sci 15, 3 (2020), 769--777.Google Scholar
- John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).Google Scholar
- John N Tsitsiklis and Kuang Xu. 2011. On the power of (even a little) centralization in distributed processing. ACM SIGMETRICS Performance Evaluation Review 39, 1 (2011), 121--132.Google Scholar
Digital Library
- Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--14.Google Scholar
Cross Ref
- Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 41--53.Google Scholar
Digital Library
- Stavros Volos, Kapil Vaswani, and Rodrigo Bruno. 2018. Graviton: Trusted Execution Environments on GPUs. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 681--696. https://www.usenix.org/conference/osdi18/presentation/volos Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 1, Article 16. Publication date: March 2022. 16:26 Sina Darabi, et al.Google Scholar
Digital Library
- Siqi Wang, Guanwen Zhong, and Tulika Mitra. 2017. CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications. ACM Trans. Embed. Comput. Syst. 16, 5s, Article 146 (Sept. 2017), 22 pages. https: //doi.org/10.1145/3126546Google Scholar
Digital Library
- Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 358--369.Google Scholar
Cross Ref
- Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2017. Quality of service support for fine-grained sharing on gpus. ACM SIGARCH Computer Architecture News 45, 2 (2017), 269--281.Google Scholar
Digital Library
- J. Wei, Y. Zhang, Z. Zhou, Z. Li, and M. A. Al Faruque. 2020. Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 125--137. https://doi.org/10.1109/DSN48063.2020.00031Google Scholar
- Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 230--242.Google Scholar
Digital Library
- Qiumin Xu, Hoda Naghibijouybari, Shibo Wang, Nael Abu-Ghazaleh, and Murali Annavaram. 2019. GPUGuard: Mitigating Contention Based Side and Covert Channel Attacks on GPUs. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 497--509. https://doi.org/10.1145/3330345.3330389Google Scholar
Digital Library
- Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 283--292.Google Scholar
Digital Library
- Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual thread: maximizing thread-level parallelism beyond GPU scheduling limit. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 609--621.Google Scholar
Digital Library
- Chao Yu, Yuebin Bai, Qingxiao Sun, and Hailong Yang. 2018. Improving Thread-Level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory. ACM Trans. Archit. Code Optim. 15, 4, Article 48 (Nov. 2018), 24 pages. https://doi.org/10.1145/3280849Google Scholar
Digital Library
- Wei Zhang, Weihao Cui, Kaihua Fu, Quan Chen, Daniel Edward Mawhirter, Bo Wu, Chao Li, and Minyi Guo. 2019. Laius: TOwards LAtency AWareness and IMproved UTilization of SPatial Multitasking Accelerators in Datacenters. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 58--68. https://doi.org/10.1145/3330345.3330351Google Scholar
Digital Library
- Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Effcient Prefetching. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society, USA, 85--95. https://doi.org/10.1109/HPCA.2007.346187Google Scholar
Digital Library
- W. Zhao, Q. Chen, H. Lin, J. Zhang, J. Leng, C. Li, W. Zheng, L. Li, and M. Guo. 2019. Themis: Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 653--663. https://doi.org/10.1109/IPDPS.2019.00074Google Scholar
- Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A Hybrid Slowdown Model for Multitasking GPUs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 1371--1385. https://doi.org/10.1145/3373376.3378457Google Scholar
Digital Library
- Xia Zhao, Zhiying Wang, and Lieven Eeckhout. 2018. Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs. In Proceedings of the 2018 International Conference on Supercomputing (Beijing, China) (ICS '18). Association for Computing Machinery, New York, NY, USA, 65--75. https://doi.org/10.1145/3205289.3205311Google Scholar
Digital Library
- X. Zhao, Z. Wang, and L. Eeckhout. 2019. HeteroCore GPU to Exploit TLP-Resource Diversity. IEEE Transactions on Parallel and Distributed Systems 30, 1 (Jan 2019), 93--106. https://doi.org/10.1109/TPDS.2018.2854764Google Scholar
Digital Library
- Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, andWenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 587--599.Google Scholar
Digital Library
- Husheng Zhou, Guangmo Tong, and Cong Liu. 2015. GPES: A preemptive execution system for GPGPU computing. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 87--97.Google Scholar
Cross Ref
Index Terms
NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs
Recommendations
NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsMulti-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g. spatial multitasking) have limited opportunity to improve resource utilization, while others,...
NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs
SIGMETRICS '22Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g. spatial multitasking) have limited opportunity to improve resource utilization, while others,...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...






Comments