Abstract
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices can carry large overheads. Memory on CPU/GPU systems is typically managed by a software framework such as OpenCL or CUDA, which includes a runtime library, and communicates with a GPU driver. These frameworks offer a range of memory management methods that vary in ease of use, consistency guarantees and performance. In this study, we analyze some of the common memory management methods of the most widely used software frameworks for heterogeneous systems: CUDA, OpenCL 1.2, OpenCL 2.0, and HSA, on NVIDIA and AMD hardware. We focus on performance/functionality trade-offs, with the goal of exposing their performance impact and simplifying the choice of memory management methods for programmers.
- AMD Graphics Core Next (GCN) Architecture. https://www.amd. com/Documents/GCN_Architecture_whitepaper.pdf, 2012.Google Scholar
- CL Offline Compiler. https://github.com/HSAFoundation/ CLOC, 2017.Google Scholar
- N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. Selective gpu caches to eliminate cpu-gpu hw cache coherence. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 494–506. IEEE, 2016.Google Scholar
- R. Barik, R. Kaleem, D. Majeti, B. T. Lewis, T. Shpeisman, C. Hu, Y. Ni, and A.-R. Adl-Tabatabai. Efficient mapping of irregular c++ applications to integrated gpus. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, page 33. ACM, 2014. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization IISWC, 2009. Google Scholar
Digital Library
- J. Choi, A. Chandramowlishwaran, K. Madduri, and R. Vuduc. A cpugpu hybrid implementation and model-driven scheduling of the fast multipole method. In Proceedings of Workshop on General Purpose Processing Using GPUs, page 64. ACM, 2014. Google Scholar
Digital Library
- O. Gabbay. AMD HSA kernel driver. https://lwn.net/Articles/ 605153, 2014.Google Scholar
- J. Gomez-Luna, T. Grass, A. Rico, E. Ayguade, A. J. Pena, et al. Evaluating the effect of last-level cache sharing on integrated gpu-cpu systems with heterogeneous applications. In Workload Characterization (IISWC), 2016 IEEE International Symposium on, pages 1–10. IEEE, 2016.Google Scholar
- J. Hestness, S. W. Keckler, and D. A. Wood. A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior. In IISWC. IEEE, 2014.Google Scholar
Cross Ref
- J. Hestness, S. W. Keckler, and D. A. Wood. Gpu computing pipeline inefficiencies and optimization opportunities in heterogeneous cpugpu processors. In IEEE International Symposium on Workload Characterization (IISWC), 2015, pages 87–97. IEEE, 2015. Google Scholar
Digital Library
- R. Kaleem, R. Barik, T. Shpeisman, B. T. Lewis, C. Hu, and K. Pingali. Adaptive heterogeneous scheduling for integrated gpus. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 151–162. ACM, 2014. Google Scholar
Digital Library
- D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In ACM SIGPLAN Notices, volume 47, pages 117–128. ACM, 2012. Google Scholar
Digital Library
- S. Mukherjee, Y. Sun, P. Blinzer, A. K. Ziabari, and D. Kaeli. A comprehensive performance analysis of hsa and opencl 2.0. In Performance Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on, pages 183–193. IEEE, 2016.Google Scholar
- G. L. Peterson. Myths about the mutual exclusion problem. Information Processing Letters, 12(3):115–116, 1981.Google Scholar
Cross Ref
- P. Singh, C.-R. M, P. Raghavendra, T. Tye, D. Das, and A. N. Platform coherency and soc verification challenges. In 13th International Systemon-Chip (SoC) Conference, Exhibit & Workshops, 2013.Google Scholar
- Y. Sun, X. Gong, A. K. Ziabari, L. Yu, X. Li, S. Mukherjee, C. Mccardwell, A. Villegas, and D. Kaeli. Hetero-mark, a benchmark suite for cpu-gpu collaborative computing. In Workload Characterization (IISWC), 2016 IEEE International Symposium on, pages 1–10. IEEE, 2016.Google Scholar
- J. Vesely, A. Basu, M. Oskin, G. Loh, and A. Bhattacharjee. Observations and opportunities in architecting shared virtual memory for heterogenous systems. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2016.Google Scholar
- T. Vijayaraghavan, Y. Eckert, G. H. Loh, M. J. Schulte, M. Ignatowski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang, A. Karunanithi, et al. Design and analysis of an apu for exascale computing. In HPCA, 2017.Google Scholar
Cross Ref
Index Terms
Analyzing memory management methods on integrated CPU-GPU systems
Recommendations
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on MicroarchitectureSuboptimal management of memory and bandwidth is one of the primary causes of low performance on systems comprising multiple GPUs. Existing memory management solutions like Unified Memory (UM) offer simplified programming but come at the cost of ...
Analyzing memory management methods on integrated CPU-GPU systems
ISMM 2017: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory ManagementHeterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...






Comments