skip to main content
research-article

Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors

Authors Info & Claims
Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Heterogeneous chip-multiprocessors with CPU and GPU integrated on the same die allow sharing of critical memory system resources among the CPU and GPU applications. Such architectures give rise to challenging resource scheduling problems. In this paper, we explore memory access scheduling algorithms driven by criticality of GPU accesses in such systems. Different GPU access streams originate from different parts of the GPU rendering pipeline, which behaves very differently from the typical CPU pipeline requiring new techniques for GPU access criticality estimation. We propose a novel queuing network model to estimate the performance-criticality of the GPU access streams. If a GPU application performs below the quality of service requirement (e.g., frame rate in 3D scene rendering), the memory access scheduler uses the estimated criticality information to accelerate the critical GPU accesses. Detailed simulations done on a heterogeneous chip-multiprocessor model with one GPU and four CPU cores running heterogeneous mixes of DirectX, OpenGL, and CPU applications show that our proposal improves the GPU performance by 15% on average without degrading the CPU performance much. Extensions proposed for the mixes containing GPGPU applications, which do not have any quality of service requirement, improve the performance by 7% on average for these mixes.

References

  1. R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 25--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. 2012. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In Proceedings of the 39th International Symposium on Computer Architecture. 416--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Bouvier, B. Cohen, W. Fry, S. Godey, and M. Mantor. 2014. Kabini: An AMD Accelerated Processing Unit System on a Chip. In IEEE Micro, 34, 2, 22--33.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 128--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron. 2010. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of the IEEE International Symposium on Workload Characterization. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core Mapping Policies to Reduce Memory System Interference in Multi-core Systems. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Demler. 2013. Iris Pro Takes On Discrete GPUs. In Microprocessor Report.Google ScholarGoogle Scholar
  9. G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A Dynamic Optimization Framework for Bulk-synchronous Applications in Heterogeneous Systems. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques. 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. 2010. Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. 335--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. 2011. Parallel Application Memory Scheduling. In Proceedings of the 44th International Symposium on Microarchitecture. 362--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Ghose, H. Lee, and J. F. Martinez. 2013. Improving Memory Scheduling via Processor-side Load Criticality Information. In Proceedings of the 40th International Symposium on Computer Architecture. 84--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Greene, M. Kass, and G. Miller. 1993. Hierarchical Z-buffer Visibility. In Proceedings of the 20th SIGGRAPH Annual Conference on Computer Graphics and Interactive Techniques. 231--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, J. Hong, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D’Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. 2014. Haswell: The Fourth Generation Intel Core Processor. In IEEE Micro, 34, 2, 6--20.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. Harris. Dynamic Texturing. Available at http://developer.download.nvidia.com/assets/gamedev/docs/DynamicTexturing.pdf.Google ScholarGoogle Scholar
  16. I. Hur and C. Lin. 2016. Adaptive History-Based Memory Schedulers. In Proceedings of the 37th International Symposium on Microarchitecture. 343--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Intel Corporation. Intel Core i7-4770 Processor. Available at http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz.Google ScholarGoogle Scholar
  18. E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana. 2008. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. In Proceedings of the 35th International Symposium on Computer Architecture. 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer. 2010. High Performance Cache Replacement using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th International Symposium on Computer Architecture. 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. K. Jeong, M. Erez, C. Sudanthi, and N. C. Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference. 850--855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th International Symposium on Computer Architecture. 332--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2016. Exploiting Core Criticality for Enhanced GPU Performance. In Proceedings of the International Conference on Measurement and Modeling of Computer Science (SIGMETRICS). 351--363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Kanter. Intel’s Ivy Bridge Graphics Architecture. April 2012. Available at http://www.realworldtech.com/ivy-bridge-gpu/.Google ScholarGoogle Scholar
  25. D. Kanter. Intel’s Sandy Bridge Graphics Architecture. August 2011. Available at http://www.realworldtech.com/sandy-bridge-gpu/.Google ScholarGoogle Scholar
  26. D. Kanter. AMD Fusion Architecture and Llano. June 2011. Available at http://www.realworldtech.com/fusion-llano/.Google ScholarGoogle Scholar
  27. O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In Proceedings of the 47th International Symposium on Microarchitecture. 114--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. 2010. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In Proceedings of the 16th International Conference on High-Performance Computer Architecture.Google ScholarGoogle Scholar
  30. H. Kim, J. Lee, N. B. Lakshminarayana, J. Sim, J. Lim, and T. Pho. 2012. MacSim: A CPU-GPU Heterogeneous Simulation Framework. Available at https://code.google.com/p/macsim/.Google ScholarGoogle Scholar
  31. Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. 2010. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In Proceedings of the 43rd International Symposium on Microarchitecture. 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Kirman, M. Kirman, M. Chaudhuri, and J. F. Martinez. 2005. Checkpointed Early Load Retirement. In Proceedings of the 11th International Conference on High-Performance Computer Architecture. 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. 2012. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. In IEEE Computer Architecture Letters, 11, 2, 33--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S.-Y. Lee, A. Arunkumar, and C.-J. Wu. 2015. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd International Symposium on Computer Architecture. 515--527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S.-Y. Lee and C.-J. Wu. 2014. CAWS: Criticality-aware Warp Scheduling for GPGPU Workloads. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 175--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Lee and H. Kim. 2012. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. 91--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. D. Luna. Introduction to 3D Game Programming with DirectX 10. Wordware Publishing Inc.Google ScholarGoogle Scholar
  38. R. Manikantan and R. Govindarajan. 2008. Focused Prefetching: Performance Oriented Prefetching Based on Commit Stalls. In Proceedings of the 22nd International Conference on Supercomputing. 339--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. 2013. Managing Shared Last-level Cache in a Heterogeneous Multicore Processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. V. Moya, C. Gonzalez, J. Roca, A. Fernandez, and R. Espasa. 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 231--241. Source and traces available at http://attila.ac.upc.edu/wiki/index.php/Main_Page.Google ScholarGoogle Scholar
  41. S. P. Muralidhara, L. Subramanian, O. Mutlu, M. T. Kandemir, and T. Moscibroda. 2011. Reducing Memory Interference in Multicore Systems via Application-aware Memory Channel Partitioning. In Proceedings of the 44th International Symposium on Microarchitecture. 374--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture. 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. O. Mutlu and T. Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. 146--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. O. Mutlu and T. Moscibroda. 2008. Parallelism-aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In Proceedings of the 35th International Symposium on Computer Architecture. 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. N. C. Nachiappan, P. Yedlapalli, N. Soundararajan, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das. 2014. GemDroid: A Framework to Evaluate Mobile Platforms. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 355--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. 2006. Fair Queuing Memory Systems. In Proceedings of the 39th International Symposium on Microarchitecture. 208--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. T. Olson. 2010. Mali 400 MP: A Scalable GPU for Mobile and Embedded Devices. In Symposium on High-Performance Graphics.Google ScholarGoogle Scholar
  48. T. Piazza. 2012. Intel Processor Graphics. In Symposium on High-Performance Graphics.Google ScholarGoogle Scholar
  49. S. Rai and M. Chaudhuri. 2016. Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors. In Proceedings of the 30th International Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Rai and M. Chaudhuri. 2017. Improving CPU Performance through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors. In Proceedings of the 26th IEEE International Heterogeneity in Computing Workshop. 18--29.Google ScholarGoogle Scholar
  51. M. Ribble. 2008. Next-gen Tile-based GPUs. In Game Developers’ Conference.Google ScholarGoogle Scholar
  52. S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. 2000. Memory Access Scheduling. In Proceedings of the 27th International Symposium on Computer Architecture. 128--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 10, 1, 16--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. A. L. Shimpi. Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested. June 2013. Available at http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested.Google ScholarGoogle Scholar
  56. D. Shingari, A. Arunkumar, and C.-J. Wu. 2015. Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones. In Proceedings of the International Symposium on Workload Characterization. 22--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. Stevens. 2010. QoS for High-performance and Power-efficient HD Multimedia. ARM White Paper.Google ScholarGoogle Scholar
  58. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report IMPACT-12-01.Google ScholarGoogle Scholar
  59. S. Subramaniam, A. Bracy, H. Wang, and G. H. Loh. 2009. Criticality-based Optimizations for Efficient Load Processing. In Proceedings of the 15th International Conference on High-Performance Computer Architecture. 419--430.Google ScholarGoogle Scholar
  60. L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu. 2014. The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost. In Proceedings of the 32nd International Conference on Computer Design. 8--15.Google ScholarGoogle Scholar
  61. L. Subramanian, V. Seshadri, A. Ghosh, S. M. Khan, and O. Mutlu. 2015. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory. In Proceedings of the 48th International Symposium on Microarchitecture. 62--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu. 2013. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In Proceedings of the 19th International Symposium on High Performance Computer Architecture. 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A Simulation Framework for CPU-GPU Computing. In Proceedings of the 21st International Conference on Parallel Architecture and Compilation Techniques. 335--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. H. Usui, L. Subramanian, K. K.-W. Chang, and O. Mutlu. 2016. DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators. In ACM Transactions on Architecture and Code Optimization, 12, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. J. Walton. The AMD Trinity Review (A10-4600M): A New Hope. May 2012. Available at http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope/.Google ScholarGoogle Scholar
  66. M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts. 2011. A Fully Integrated Multi-CPU, GPU, and Memory Controller 32 nm Processor. In Proceedings of the International Solid-State Circuits Conference. 264--266.Google ScholarGoogle Scholar
  67. 3D Mark Benchmark. http://www.3dmark.com/.Google ScholarGoogle Scholar

Index Terms

  1. Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!