skip to main content
research-article

Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Published:14 March 2015Publication History
Skip Abstract Section

Abstract

Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making choices about memory page placement critical to performance. In this work we show that current page placement policies are not sufficient to maximize GPU performance in these heterogeneous memory systems. We propose two new page placement policies that improve GPU performance: one application agnostic and one using application profile information. Our application agnostic policy, bandwidth-aware (BW-AWARE) placement, maximizes GPU throughput by balancing page placement across the memories based on the aggregate memory bandwidth available in a system. Our simulation-based results show that BW-AWARE placement outperforms the existing Linux INTERLEAVE and LOCAL policies by 35% and 18% on average for GPU compute workloads. We build upon BW-AWARE placement by developing a compiler-based profiling mechanism that provides programmers with information about GPU application data structure access patterns. Combining this information with simple program-annotated hints about memory placement, our hint-based page placement approach performs within 90% of oracular page placement on average, largely mitigating the need for costly dynamic page tracking and migration.

References

  1. T. M. Aamodt, W. W. L. Fung, I. Singh, A. El-Shafiey, J. Kwa, T. Hetherington, A. Gubran, A. Boktor, T. Rogers, A. Bakhoda, and H. Jooybar. GPGPU-Sim 3.x Manual. http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual, 2014. {Online; accessed 4-December-2014}.Google ScholarGoogle Scholar
  2. M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the Problems and Opportunities Posed by Multiple On- Chip Memory Controllers. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 319--330, September 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. A. Bheda, J. A. Poovey, J. G. Beu, and T. M. Conte. Energy Efficient Phase Change Memory Based Main Memory for Future High Performance Systems. In International Green Computing Conference (IGCC), pages 1--8, July 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A Case for NUMA-aware Contention Management on Multicore Systems. In USENIX Annual Technical Conference (USENIXATC), pages 1--15, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Bolosky, R. Fitzgerald, and M. Scott. Simple but Effective Techniques for NUMA Memory Management. In Symposium on Operating Systems Principles (SOSP), pages 19--31, December 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS), pages 1--18, September 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Chan, D. Unat, M. Lijewski, W. Zhang, J. Bell, and J. Shalf. Software Design Space Exploration for Exascale Combustion Co-design. In International Supercomputing Conference (ISC), pages 196--212, June 2013.Google ScholarGoogle Scholar
  9. N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis, Z. Fang, R. Illikkal, and R. Iyer. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In International Symposium on Microarchitecture (MICRO), pages 13--24, December 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC), pages 44--54, October 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Corbet. AutoNUMA: the other approach to NUMA scheduling. http://lwn.net/Articles/488709/, 2012. {Online; accessed 29-May-2014}.Google ScholarGoogle Scholar
  12. M. Daga, A. M. Aji, and W.-C. Feng. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC), pages 141--149, July 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 381--394, March 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Dong, Y. Xie, N. Muralimanohar, and N. Jouppi. Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support. In International Conference on High Performance Networking and Computing (Supercomputing), pages 1--11, November 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Free Software Foundation. GNU Binutils. http://www.gnu.org/software/binutils/, 2014. {Online; accessed 5-August-2014}.Google ScholarGoogle Scholar
  16. B. Gerofi, A.Shimada, A. Hori, T. Masamichi, and Y. Ishikawa. CMCP: A Novel Page Replacement Policy for System Level Hierarchical Memory Management on Many-cores. In International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 73--84, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Heroux, D. Doerfler, J. Crozier, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich. Improving Performance via Mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, September 2009.Google ScholarGoogle Scholar
  18. HSA Foundation. HSA Platform System Architecture Specification - Provisional 1.0. http://www.slideshare.net/hsafoundation/hsa-platform-system-architecture-specification-provisional-verl-10-ratifed, 2014. {Online; accessed 28-May-2014}.Google ScholarGoogle Scholar
  19. Hynix Semiconductor. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf, 2009. {Online; accessed 30-Jul-2014}.Google ScholarGoogle Scholar
  20. HyperTransport Consortium. HyperTransport 3.1 Specification. http://www.hypertransport.org/docs/twgdocs/HTC20051222-0046-0035.pdf, 2010. {Online; accessed 7-July-2014}.Google ScholarGoogle Scholar
  21. Intel Corporation. An Introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html, 2009. {Online; accessed 7-July-2014}.Google ScholarGoogle Scholar
  22. Intel Corporation. Intel Xeon Processor E7-4870 . http://ark.intel.com/products/75260/Intel-Xeon-Processor-E7-8893-v2-37_5M-Cache-3_40-GHz, 2014. {Online; accessed 28-May-2014}.Google ScholarGoogle Scholar
  23. R. Iyer, H. Wang, and L. Bhuyan. Design and Analysis of Static Memory Management Policies for CC-NUMA Multiprocessors. Journal of Systems Architecture, 48(1):59--80, September 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. JEDEC. High Bandwidth Memory(HBM) DRAM - JESD235. http://www.jedec.org/standards-documents/docs/jesd235, 2013. {Online; accessed 28-May-2014}.Google ScholarGoogle Scholar
  25. X. Jiang, N. Madan, L. Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. CHOP: Integrating DRAM Caches for CMP Server Platforms. IEEE Micro, 31(1):99--108, March 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Y. Kim. Wide IO2 (WIO2) Memory Overview. http://www.cs.utah.edu/events/thememoryforum/joon.PDF, 2014. {Online; accessed 30-Jul-2014}.Google ScholarGoogle Scholar
  27. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):54--66, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. Evaluating STT-RAM as an Energy-efficient Main Memory Alternative. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 256--267, April 2013.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. LaRowe, Jr., C. Ellis, and M. Holliday. Evaluation of NUMA Memory Management Through Modeling and Measurements. IEEE Transactions on Parallel Distribibuted Systems, 3(6):686--701, November 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In International Symposium on Computer Architecture (ISCA), pages 267--278, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, December 1995.Google ScholarGoogle Scholar
  32. J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. IEEE Computer Architecture Letters, 11 (2):61--64, July 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Minkin and O. Rubinstein. Circuit and method for prefetching data for a texture cache. US Patent 6,629,188, issued September 20, 2003.Google ScholarGoogle Scholar
  34. J. Mogul, E. Argollo, M. Shah, and P. Faraboschi. Operating System Support for NVM+DRAM Hybrid Main Memory. In Workshop on Hot Topics in Operating Systems (HotOS), pages 14--18, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Mohd-Yusof and N. Sakharnykh. Optimizing CoMD: A Molecular Dynamics Proxy Application Study. In GPU Technology Conference (GTC), March 2014.Google ScholarGoogle Scholar
  36. NVIDIA Corporation. Unified Memory in CUDA 6. http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/, 2013. {Online; accessed 28-May-2014}.Google ScholarGoogle Scholar
  37. NVIDIA Corporation. CUDA C Best Practices Guide. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#allocation, 2014. {Online; accessed 28-July-2014}.Google ScholarGoogle Scholar
  38. NVIDIA Corporation. NVIDIA Launches World's First High-Speed GPU Interconnect, Helping Pave the Way to Exascale Computing. http://nvidianews.nvidia.com/News/NVIDIA-Launches-World-s-First-High-Speed-GPU-Interconnect-Helping-Pave-the-Way-to-Exascale-Computin-ad6.aspx, 2014. {Online; accessed 28-May-2014}.Google ScholarGoogle Scholar
  39. NVIDIA Corporation. Compute Unified Device Architecture. https://developer.nvidia.com/cuda-zone, 2014. {Online; accessed 28-May-2014}.Google ScholarGoogle Scholar
  40. M. Pavlovic, N. Puzovic, and A. Ramirez. Data Placement in HPC Architectures with Heterogeneous Off-chip Memory. In International Conference on Computer Design (ICCD), pages 193--200, October 2013.Google ScholarGoogle Scholar
  41. S. Phadke and S. Narayanasamy. MLP-Aware Heterogeneous Memory System. In Design, Automation & Test in Europe (DATE), pages 1--6, March 2011.Google ScholarGoogle Scholar
  42. L. Ramos, E. Gorbatov, and R. Bianchini. Page Placement in Hybrid Memory Systems. In International Conference on Supercomputing (ICS), pages 85--99, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Sim, G. Loh, H. Kim, M. O'Connor, and M. Thottethodi. A Mostly- Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In International Symposium on Microarchitecture (MICRO), pages 247--257, December 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, v.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A Revised Bench- mark Suite for Scientific and Commercial Throughput Computing. Technical report, IMPACT Technical Report, IMPACT-12-01, University of Illinois, at Urbana-Champaign, March 2012.Google ScholarGoogle Scholar
  45. D. Tam, R. Azimi, and M. Stumm. Thread Clustering: Sharing- aware Scheduling on SMP-CMP-SMT Multiprocessors. In European Conference on Computer Systems (EuroSys), pages 47--58, March 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Tramm, A. Siegel, T. Islam, and M. Schulz. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR), September 2014.Google ScholarGoogle Scholar
  47. J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory-Level Parallelism. In International Symposium on Microarchitecture (MICRO), pages 409--422, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 279--289, September 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. B. Wang, B. Wu, D. Li, X. Shen, W. Yu, Y. Jiao, and J. Vetter. Exploring Hybrid Memory for GPU Energy Efficiency Through Software- hardware Co-design. In International Conference on Parallel Archi- tectures and Compilation Techniques (PACT), pages 93--103, September 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. K. Wilson and B. Aglietti. Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C. In International Conference on High Performance Networking and Computing (Supercomputing), pages 33--35, November 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Zhao, G. Sun, G. Loh, and Y. Xie. Energy-efficient GPU Design with Reconfigurable In-package Graphics Memory. In International Symposium on Low Power Electronics and Design (ISLPED), pages 403--408, July 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. J. Zhao, G. Sun, G. Loh, and Y. Xie. Optimizing GPU Energy Efficiency with 3D Die-stacking Graphics Memory and Reconfigurable Memory Interface. ACM Transactions on Architecture and Code Optimization, 10(4):24:1--24:25, December 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 129--142, March 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Page Placement Strategies for GPUs within Heterogeneous Memory Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 50, Issue 4
          ASPLOS '15
          April 2015
          676 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2775054
          • Editor:
          • Andy Gill
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
            March 2015
            720 pages
            ISBN:9781450328357
            DOI:10.1145/2694344

          Copyright © 2015 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 March 2015

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!