Abstract
Heterogeneous chip-multiprocessors with CPU and GPU integrated on the same die allow sharing of critical memory system resources among the CPU and GPU applications. Such architectures give rise to challenging resource scheduling problems. In this paper, we explore memory access scheduling algorithms driven by criticality of GPU accesses in such systems. Different GPU access streams originate from different parts of the GPU rendering pipeline, which behaves very differently from the typical CPU pipeline requiring new techniques for GPU access criticality estimation. We propose a novel queuing network model to estimate the performance-criticality of the GPU access streams. If a GPU application performs below the quality of service requirement (e.g., frame rate in 3D scene rendering), the memory access scheduler uses the estimated criticality information to accelerate the critical GPU accesses. Detailed simulations done on a heterogeneous chip-multiprocessor model with one GPU and four CPU cores running heterogeneous mixes of DirectX, OpenGL, and CPU applications show that our proposal improves the GPU performance by 15% on average without degrading the CPU performance much. Extensions proposed for the mixes containing GPGPU applications, which do not have any quality of service requirement, improve the performance by 7% on average for these mixes.
- R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 25--38. Google Scholar
Digital Library
- R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. 2012. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In Proceedings of the 39th International Symposium on Computer Architecture. 416--427. Google Scholar
Digital Library
- D. Bouvier, B. Cohen, W. Fry, S. Godey, and M. Mantor. 2014. Kabini: An AMD Accelerated Processing Unit System on a Chip. In IEEE Micro, 34, 2, 22--33.Google Scholar
Cross Ref
- N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 128--139. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54. Google Scholar
Digital Library
- S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. 2010. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. 1--11. Google Scholar
Digital Library
- R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core Mapping Policies to Reduce Memory System Interference in Multi-core Systems. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. 107--118. Google Scholar
Digital Library
- M. Demler. 2013. Iris Pro Takes On Discrete GPUs. In Microprocessor Report.Google Scholar
- G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-synchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques. 353--364. Google Scholar
Digital Library
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. 2010. Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. 335--346. Google Scholar
Digital Library
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. 2011. Parallel Application Memory Scheduling. In Proceedings of the 44th International Symposium on Microarchitecture. 362--373. Google Scholar
Digital Library
- S. Ghose, H. Lee, and J. F. Martinez. 2013. Improving Memory Scheduling via Processor-side Load Criticality Information. In Proceedings of the 40th International Symposium on Computer Architecture. 84--95. Google Scholar
Digital Library
- N. Greene, M. Kass, and G. Miller. 1993. Hierarchical Z-buffer Visibility. In Proceedings of the 20th SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques. 231--238. Google Scholar
Digital Library
- P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, J. Hong, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D’Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. 2014. Haswell: The Fourth Generation Intel Core Processor. In IEEE Micro, 34, 2, 6--20.Google Scholar
Cross Ref
- M. Harris. Dynamic Texturing. Available at http://developer.download.nvidia.com/assets/gamedev/docs/DynamicTexturing.pdf.Google Scholar
- I. Hur and C. Lin. 2016. Adaptive History-Based Memory Schedulers. In Proceedings of the 37th International Symposium on Microarchitecture. 343--354. Google Scholar
Digital Library
- Intel Corporation. Intel Core i7-4770 Processor. Available at http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz.Google Scholar
- E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana. 2008. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. In Proceedings of the 35th International Symposium on Computer Architecture. 39--50. Google Scholar
Digital Library
- A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer. 2010. High Performance Cache Replacement using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture. 60--71. Google Scholar
Digital Library
- M. K. Jeong, M. Erez, C. Sudanthi, and N. C. Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference. 850--855. Google Scholar
Digital Library
- A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th International Symposium on Computer Architecture. 332--343. Google Scholar
Digital Library
- A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. 395--406. Google Scholar
Digital Library
- A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Science (SIGMETRICS). 351--363. Google Scholar
Digital Library
- D. Kanter. Intel’s Ivy Bridge Graphics Architecture. April 2012. Available at http://www.realworldtech.com/ivy-bridge-gpu/.Google Scholar
- D. Kanter. Intel’s Sandy Bridge Graphics Architecture. August 2011. Available at http://www.realworldtech.com/sandy-bridge-gpu/.Google Scholar
- D. Kanter. AMD Fusion Architecture and Llano. June 2011. Available at http://www.realworldtech.com/fusion-llano/.Google Scholar
- O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th International Symposium on Microarchitecture. 114--126. Google Scholar
Digital Library
- O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 157--166. Google Scholar
Digital Library
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. 2010. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In Proceedings of the 16th International Conference on High-Performance Computer Architecture.Google Scholar
- H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho. 2012. MacSim: A CPU-GPU Heterogeneous Simulation Framework. Available at https://code.google.com/p/macsim/.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. 2010. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In Proceedings of the 43rd International Symposium on Microarchitecture. 65--76. Google Scholar
Digital Library
- N. Kirman, M. Kirman, M. Chaudhuri, and J. F. Martinez. 2005. Checkpointed Early Load Retirement. In Proceedings of the 11th International Conference on High-Performance Computer Architecture. 16--27. Google Scholar
Digital Library
- N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. 2012. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. In IEEE Computer Architecture Letters, 11, 2, 33--36. Google Scholar
Digital Library
- S.-Y. Lee, A. Arunkumar, and C.-J. Wu. 2015. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd International Symposium on Computer Architecture. 515--527. Google Scholar
Digital Library
- S.-Y. Lee and C.-J. Wu. 2014. CAWS: Criticality-aware Warp Scheduling for GPGPU Workloads. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 175--186. Google Scholar
Digital Library
- J. Lee and H. Kim. 2012. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. 91--102. Google Scholar
Digital Library
- F. D. Luna. Introduction to 3D Game Programming with DirectX 10. Wordware Publishing Inc.Google Scholar
- R. Manikantan and R. Govindarajan. 2008. Focused Prefetching: Performance Oriented Prefetching Based on Commit Stalls. In Proceedings of the 22nd International Conference on Supercomputing. 339--348. Google Scholar
Digital Library
- V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. 2013. Managing Shared Last-level Cache in a Heterogeneous Multicore Processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 225--234. Google Scholar
Digital Library
- V. Moya, C. Gonzalez, J. Roca, A. Fernandez, and R. Espasa. 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 231--241. Source and traces available at http://attila.ac.upc.edu/wiki/index.php/Main_Page.Google Scholar
- S. P. Muralidhara, L. Subramanian, O. Mutlu, M. T. Kandemir, and T. Moscibroda. 2011. Reducing Memory Interference in Multicore Systems via Application-aware Memory Channel Partitioning. In Proceedings of the 44th International Symposium on Microarchitecture. 374--385. Google Scholar
Digital Library
- O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 129--140. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. 146--160. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. 2008. Parallelism-aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proceedings of the 35th International Symposium on Computer Architecture. 63--74. Google Scholar
Digital Library
- N. C. Nachiappan, P. Yedlapalli, N. Soundararajan, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das. 2014. GemDroid: A Framework to Evaluate Mobile Platforms. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 355--366. Google Scholar
Digital Library
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. 2006. Fair Queuing Memory Systems. In Proceedings of the 39th International Symposium on Microarchitecture. 208--222. Google Scholar
Digital Library
- T. Olson. 2010. Mali 400 MP: A Scalable GPU for Mobile and Embedded Devices. In Symposium on High-Performance Graphics.Google Scholar
- T. Piazza. 2012. Intel Processor Graphics. In Symposium on High-Performance Graphics.Google Scholar
- S. Rai and M. Chaudhuri. 2016. Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors. In Proceedings of the 30th International Conference on Supercomputing. Google Scholar
Digital Library
- S. Rai and M. Chaudhuri. 2017. Improving CPU Performance through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors. In Proceedings of the 26th IEEE International Heterogeneity in Computing Workshop. 18--29.Google Scholar
- M. Ribble. 2008. Next-gen Tile-based GPUs. In Game Developers’ Conference.Google Scholar
- S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. 2000. Memory Access Scheduling. In Proceedings of the 27th International Symposium on Computer Architecture. 128--138. Google Scholar
Digital Library
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 10, 1, 16--19. Google Scholar
Digital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57. Google Scholar
Digital Library
- A. L. Shimpi. Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested. June 2013. Available at http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested.Google Scholar
- D. Shingari, A. Arunkumar, and C.-J. Wu. 2015. Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones. In Proceedings of the International Symposium on Workload Characterization. 22--33. Google Scholar
Digital Library
- A. Stevens. 2010. QoS for High-performance and Power-efficient HD Multimedia. ARM White Paper.Google Scholar
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report IMPACT-12-01.Google Scholar
- S. Subramaniam, A. Bracy, H. Wang, and G. H. Loh. 2009. Criticality-based Optimizations for Efficient Load Processing. In Proceedings of the 15th International Conference on High-Performance Computer Architecture. 419--430.Google Scholar
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. 2014. The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost. In Proceedings of the 32nd International Conference on Computer Design. 8--15.Google Scholar
- L. Subramanian, V. Seshadri, A. Ghosh, S. M. Khan, and O. Mutlu. 2015. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory. In Proceedings of the 48th International Symposium on Microarchitecture. 62--75. Google Scholar
Digital Library
- L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu. 2013. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. 639--650. Google Scholar
Digital Library
- R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques. 335--344. Google Scholar
Digital Library
- H. Usui, L. Subramanian, K. K.-W. Chang, and O. Mutlu. 2016. DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators. In ACM Transactions on Architecture and Code Optimization, 12, 4. Google Scholar
Digital Library
- J. Walton. The AMD Trinity Review (A10-4600M): A New Hope. May 2012. Available at http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/.Google Scholar
- M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts. 2011. A Fully Integrated Multi-CPU, GPU, and Memory Controller 32 nm Processor. In Proceedings of the International Solid-State Circuits Conference. 264--266.Google Scholar
- 3D Mark Benchmark. http://www.3dmark.com/.Google Scholar
Index Terms
Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors
Recommendations
Analyzing memory management methods on integrated CPU-GPU systems
ISMM 2017: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory ManagementHeterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...






Comments