skip to main content
research-article
Public Access

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Published:19 March 2018Publication History
Skip Abstract Section

Abstract

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems," http://download.tensorflow.org/paper/whitepaper2015.pdf, 2015.Google ScholarGoogle Scholar
  2. J. Adriaens, K. Compton, N. S. Kim, and M. Schulte, "The Case for GPGPU Spatial Multitasking," in HPCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Advanced Micro Devices, Inc., "AMD Accelerated Processing Units," http://www.amd.com/us/products/technologies/apu/Pages/apu.aspx.Google ScholarGoogle Scholar
  4. Advanced Micro Devices, Inc., "AMD Radeon R9 290X," http://www.amd.com/us/press-releases/Pages/amd-radeon-r9--290x-2013oct24.aspx.Google ScholarGoogle Scholar
  5. Advanced Micro Devices, Inc., "ATI Radeon GPGPUs," http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/Pages/amd-radeon-hd-6000.aspx.Google ScholarGoogle Scholar
  6. Advanced Micro Devices, Inc., "OpenCL: The Future of Accelerated Application Performance Is Now," https://www.amd.com/Documents/FirePro_OpenCL_Whitepaper.pdf.Google ScholarGoogle Scholar
  7. Advanced Micro Devices, Inc., AMD-V Nested Paging, 2010, http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%201-final-TM.pdf.Google ScholarGoogle Scholar
  8. Advanced Micro Devices, Inc., "Heterogeneous System Architecture: A Technical Review," http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/hsa10.pdf, 2012.Google ScholarGoogle Scholar
  9. Advanced Micro Devices, Inc., "AMD I/O Virtualization Technology (IOMMU) Specification," http://support.amd.com/TechDocs/48882_IOMMU.pdf, 2016.Google ScholarGoogle Scholar
  10. N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking Bandwidth for GPUs in CC-NUMA Systems," in HPCA, 2015.Google ScholarGoogle Scholar
  11. J. B. Alex Chen and X. Amatriain, "Distributed Neural Networks with GPUs in the AWS Cloud," http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html, 2014.Google ScholarGoogle Scholar
  12. ARM Holdings PLC, "Take GPU Processing Power Beyond Graphics with Mali GPU Computing," 2012.Google ScholarGoogle Scholar
  13. A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, and C.-J. Wu, "MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability," in ISCA, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Ausavarungnirun, "Techniques for Shared Resource Management in Systems with Throughput Processors," Ph.D. dissertation, Carnegie Mellon Univ., 2017.Google ScholarGoogle Scholar
  15. R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems," in ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance," in PACT, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, "Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes," in MICRO, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. J. Rossbach, and O. Mutlu, "Spatial Multiplexing Support for Multi-Application Concurrency in GPUs," Carnegie Mellon Univ., SAFARI Research Group, Tech. Rep. TR-2018-002, 2018.Google ScholarGoogle Scholar
  19. R. Ausavarungnirun, C. J. Rossbach, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, and O. Mutlu, "Improving Multi-Application Concurrency Support Within the GPU Memory System," arXiv:1708.04911 {cs.AR}, 2017.Google ScholarGoogle Scholar
  20. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.Google ScholarGoogle Scholar
  21. T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient Virtual Memory for Big Memory Servers," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Bhattacharjee, "Large-Reach Memory Management Unit Caches," in MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-Level TLBs for Chip Multiprocessors," in HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Bhattacharjee and M. Martonosi, "Inter-Core Cooperative TLB for Chip Multiprocessors," in ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, "Translation Lookaside Buffer Consistency: A Software Approach," in ASPLOS, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Bouvier and B. Sander, "Applying AMD's Kaveri APU for Heterogeneous Computing," in Hot Chips, 2014.Google ScholarGoogle Scholar
  28. M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in IISWC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Chatterjee, M. O'Connor, D. Lee, D. R. Johnson, S. W. Keckler, M. Rhu, and W. J. Dally, "Architecting an Energy-Efficient DRAM System for GPUs," in HPCA, 2017.Google ScholarGoogle Scholar
  30. N. Chatterjee, M. O'Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian, "Managing DRAM Latency Divergence in Irregular GPGPU Applications," in SC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. W. Hwu, "Adaptive Cache Management for Energy-Efficient GPU Computing," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W. W. Hwu, "Adaptive Cache Bypass and Insertion for Many-Core Accelerators," in MES, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Clark, "A New Xc6 Core Architecture for the Next Generation of Computing," in Hot Chips, 2016.Google ScholarGoogle Scholar
  35. J. Cong, Z. Fang, Y. Hao, and G. Reinman, "Supporting Address Translation for Accelerator-Centric Architectures," in HPCA, 2017.Google ScholarGoogle Scholar
  36. G. Cox and A. Bhattacharjee, "Efficient Address Translation with Multiple Page Sizes," in ASPLOS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in GPGPU, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Aérgia: Exploiting Packet Latency Slack in On-Chip Networks," in ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, "rCUDA: Reducing the Number of GPU-Based Accelerators in High Performance Clusters," in HPCS, 2010.Google ScholarGoogle Scholar
  41. E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, "Coordinated Control of Multiple Prefetchers in Multi-core Systems," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Eyerman and L. Eeckhout, "System-Level Performance Metrics for Multiprogram Workloads," IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Eyerman and L. Eeckhout, "Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance," CAL, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: A MapReduce Framework on Graphics Processors," in PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Herrera, "NVIDIA GRID: Graphics Accelerated VDI with the Visual Performance of a Workstation," NVIDIA White Paper, 2014.Google ScholarGoogle Scholar
  47. Intel Corp., "Intel® Microarchitecture Codename Sandy Bridge," http://www.intel.com/technology/architecture-silicon/2ndgen/.Google ScholarGoogle Scholar
  48. Intel Corp., "Product Speficiations: Products Formerly Ivy Bridge," http://ark.intel.com/products/codename/29902/, 2012.Google ScholarGoogle Scholar
  49. Intel Corp., "Introduction to Intel Architecture," http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-introduction-basics-paper.pdf, 2014.Google ScholarGoogle Scholar
  50. Intel Corp., "Intel 64 and IA-32 Architectures Software Developers Manual," 2016, https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.Google ScholarGoogle Scholar
  51. Intel Corp., "Intel Virtualization Technology for Directed I/O," http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf, 2016.Google ScholarGoogle Scholar
  52. Intel Corp., "Intel® 64 and IA-32 Architectures Optimization Reference Manual," 2016.Google ScholarGoogle Scholar
  53. Intel Corp., "6th Generation Intel Core Processor Family Datasheet, Vol. 1," http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/desktop-6th-gen-core-family-datasheet-vol-1.pdf, 2017.Google ScholarGoogle Scholar
  54. B. Jacob and T. Mudge, "Virtual Memory in Contemporary Microprocessors," in IEEE Micro, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer, "Adaptive Insertion Policies for Managing Shared Caches," in PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP)," in ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, "A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC," in DAC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in HPCA, 2014.Google ScholarGoogle Scholar
  59. A. Jog, "Design and Analysis of Scheduling Techniques for Throughput Processors," Ph.D. dissertation, Pennsylvania State Univ., 2015.Google ScholarGoogle Scholar
  60. A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das, "Anatomy of GPU Memory System for Multi-Application Execution," in MEMSYS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. A. Jog, O. Kayıran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. A. Jog, O. Kayıran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Exploiting Core Criticality for Enhanced GPU Performance," in SIGMETRICS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal, "Redundant Memory Mappings for Fast Access to Large Memories," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. I. Karlin, A. Bhatele, J. Keasler, B. Chamberlain, J. Cohen, Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, and C. Still, "Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application," in IPDPS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. I. Karlin, J. Keasler, and R. Neely, "LULESH 2.0 Updates and Changes," 2013.Google ScholarGoogle Scholar
  67. S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, "Gdev: First-Class GPU Resource Management in the Operating System," in USENIX ATC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. O. Kayiran, N. Chidambaram, A. Jog, R. Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, and C. Das, "Managing GPU Concurrency in Heterogeneous Architectures," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs," in PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. O. Kayıran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU Concurrency in Heterogeneous Architectures," in MICRO, 2014.Google ScholarGoogle Scholar
  71. Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers," in HPCA, 2010.Google ScholarGoogle Scholar
  72. Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior," in MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. B. Langmead and S. L. Salzberg, "Fast Gapped-Read Alignment with Bowtie 2," Nature Methods, 2012.Google ScholarGoogle Scholar
  74. J. Lee and H. Kim, "TAP: A TLP-Aware Cache Management Policy for a CPU--GPU Heterogeneous Architecture," in HPCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. J. Lee, M. Samadi, and S. Mahlke, "VAST: The Illusion of a Large Memory Space for GPUs," in PACT, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-Driven Dynamic GPU Cache Bypassing," in ICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. D. Li, M. Rhu, D. Johnson, M. O'Connor, M. Erez, D. Burger, D. Fussell, and S. Redder, "Priority-Based Cache Allocation in Throughput Processors," in HPCA, 2015.Google ScholarGoogle Scholar
  78. T. Li, V. K. Narayana, and T. El-Ghazawi, "Symbiotic Scheduling of Concurrent GPU Kernels for Performance and Energy Optimizations," in CF, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig, "Accelerating Molecular Dynamics Simulations using Graphics Processing Units with CUDA," Computer Physics Communications, vol. 179, no. 9, pp. 634--641, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  81. D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," in TACO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. T. Mashimo, Y. Fukunishi, N. Kamiya, Y. Takano, I. Fukuda, and H. Nakamura, "Molecular Dynamics Simulations Accelerated by GPU for Biological Macromolecules with a Non-Ewald Scheme for Electrostatic Interactions," Journal of Chemical Theory and Computation, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  83. J. Menon, M. de Kruijf, and K. Sankaralingam, "iGPU: Exception Support and Speculative Execution on GPUs," in ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. T. Moscibroda and O. Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," in USENIX Security, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. D. Mrozek, M. Brozek, and B. Malysiak-Mrozek, "Parallel Implementation of 3D Protein Structure Similarity Searches Using a GPU and the CUDA," Journal of Molecular Modeling, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  86. S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems," in ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. M. S. Nobile, P. Cazzaniga, A. Tangherloni, and D. Besozzi, "Graphics Processing Units in Bioinformatics, Computational Biology and Systems Biology," Briefings in Bioinformatics, 2016.Google ScholarGoogle Scholar
  92. NVIDIA Corp., "NVIDIA Tegra K1," http://www.nvidia.com/content/pdf/tegra_white_papers/tegra-k1-whitepaper-v1.0.pdf.Google ScholarGoogle Scholar
  93. NVIDIA Corp., "NVIDIA Tegra X1," https://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf.Google ScholarGoogle Scholar
  94. NVIDIA Corp., "CUDA C/C+ SDK Code Samples," http://developer.nvidia.com/cuda-cc-sdk-code-samples, 2011.Google ScholarGoogle Scholar
  95. NVIDIA Corp., "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf, 2011.Google ScholarGoogle Scholar
  96. NVIDIA Corp., "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.Google ScholarGoogle Scholar
  97. NVIDIA Corp., "Tesla K40 GPU Active Accelerator," https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf, 2013.Google ScholarGoogle Scholar
  98. NVIDIA Corp., "NVIDIA GeForce GTX 750 Ti," http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf, 2014.Google ScholarGoogle Scholar
  99. NVIDIA Corp., "Multi-Process Service," https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf, 2015.Google ScholarGoogle Scholar
  100. NVIDIA Corp., "NVIDIA GeForce GTX 1080," https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf, 2016.Google ScholarGoogle Scholar
  101. NVIDIA Corp., "NVIDIA Tesla P100," https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf, 2016.Google ScholarGoogle Scholar
  102. NVIDIA Corp., "CUDA Toolkit Documentation," http://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html, 2017.Google ScholarGoogle Scholar
  103. S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, "Improving GPGPU Concurrency with Elastic Kernels," in ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler, "A Case for Toggle-Aware Compression for GPU Systems," in HPCA, 2016.Google ScholarGoogle Scholar
  105. B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces," in ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. J. Power, M. D. Hill, and D. A. Wood, "Supporting x86--64 Address Translation for 100s of GPU Lanes," in HPCA, 2014.Google ScholarGoogle Scholar
  107. PowerVR, "PowerVR Hardware Architecture Overview for Developers," http://cdn.imgtec.com/sdk-documentation/PowerVR+Hardware.Architecture+Overview+for+Developers.pdf, 2016.Google ScholarGoogle Scholar
  108. M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive Insertion Policies for High Performance Caching," in ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. M. K. Qureshi and Y. N. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," in MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory Access Scheduling," in ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. T. G. Rogers, "Locality and Scheduling in the Massively Multithreaded Era," Ph.D. dissertation, Univ. of British Columbia, 2015.Google ScholarGoogle Scholar
  112. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. B. F. Romanescu, A. R. Lebeck, D. J. Sorin, and A. Bracy, "UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule them All," in HPCA, 2010.Google ScholarGoogle Scholar
  114. C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: Operating System Abstractions to Manage GPUs as Compute Devices," in SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. SAFARI Research Group, "Mosaic -- GitHub Repository," https://github.com/Carnegie Mellon University-SAFARI/Mosaic/.Google ScholarGoogle Scholar
  116. V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, "The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing," in PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks," in TACO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. SK Hynix Inc., "Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0," http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf.Google ScholarGoogle Scholar
  119. B. Smith, "Architecture and Applications of the HEP Multiprocessor Computer System," SPIE, 1981.Google ScholarGoogle Scholar
  120. B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in ICPP, 1978.Google ScholarGoogle Scholar
  121. J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Univ. of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-12-01, March 2012.Google ScholarGoogle Scholar
  122. L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, "BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling," in TPDS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, "The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost," in ICCD, 2014.Google ScholarGoogle Scholar
  124. L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu, "The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory," in MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems," in HPCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Y. Suzuki, S. Kato, H. Yamada, and K. Kono, "GPUvm: Why Not Virtualizing GPUs at the Hypervisor?" in USENIX ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero, "Enabling Preemptive Multiprogramming on GPUs," in ISCA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. J. E. Thornton, "Parallel Operation in the Control Data 6600," AFIPS FJCC, 1964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. J. E. Thornton, Design of a Computer: The Control Data 6600. hskip 1em plus 0.5em minus 0.4emrelax Scott Foresman & Co, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. K. Tian, Y. Dong, and D. Cowperthwaite, "A Full GPU Virtualization Solution with Mediated Pass-Through," in USENIX ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. H. Usui, L. Subramanian, K. Chang, and O. Mutlu, "SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators," arXiv:1505.07502 {cs.AR}, 2015.Google ScholarGoogle Scholar
  132. H. Usui, L. Subramanian, K. Chang, and O. Mutlu, "DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators," in TACO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. H. Vandierendonck and A. Seznec, "Fairness Metrics for Multi-Threaded Processors," CAL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, "Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems," in ISPASS, 2016.Google ScholarGoogle Scholar
  135. T. Vijayaraghavany, Y. Eckert, G. H. Loh, M. J. Schulte, M. Ignatowski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang, A. Karunanithi, O. Kayiran, M. Meswani, I. Paul, M. Poremba, S. Raasch, S. K. Reinhardt, G. Sadowski, and V. Sridharan, "Design and Analysis of an APU for Exascale Computing," in HPCA, 2017.Google ScholarGoogle Scholar
  136. N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A Holistic Approach to Resource Virtualization in GPUs," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. N. Vijaykumar, G. Pekhimenko, A. Jog, S. Ghose, A. Bhowmick, R. Ausavarungnirun, C. R. Das, M. T. Kandemir, T. C. Mowry, and O. Mutlu, "A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps," arXiv:1602.01348 {cs.AR}, 2016.Google ScholarGoogle Scholar
  139. Vivante, "Vivante Vega GPGPU Technology," http://www.vivantecorp.com/index.php/en/technology/gpgpu.html, 2016.Google ScholarGoogle Scholar
  140. L. Vu, H. Sivaraman, and R. Bidarkar, "GPU Virtualization for High Performance General Purpose Computing on the ESX Hypervisor," in HPC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Z. Wang, J. Yang, R. Melhem, B. R. Childers, Y. Zhang, and M. Guo, "Simultaneous Multikernel GPU: Multi-Tasking Throughput Processors via Fine-Grained Sharing," in HPCA, 2016.Google ScholarGoogle Scholar
  142. S. Wasson, "AMD's A8--3800 Fusion APU," http://techreport.com/articles.x/21730, 2011.Google ScholarGoogle Scholar
  143. H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, "Demystifying GPU Microarchitecture Through Microbenchmarking," in ISPASS, 2010.Google ScholarGoogle Scholar
  144. C.-J. Wu and M. Martonosi, "Characterization and Dynamic Mitigation of Intra-application Cache Interference," in ISPASS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. X. Xie, Y. Liang, G. Sun, and D. Chen, "An Efficient Compiler Framework for Cache Bypassing on GPUs," in ICCAD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, "Coordinated Static and Dynamic Cache Bypassing for GPUs," in HPCA, 2015.Google ScholarGoogle Scholar
  147. Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram, "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. T. T. Yeh, A. Sabne, P. Sakdhnagool, R. Eigenmann, and T. G. Rogers, "Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks," in PPoPP, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, "Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation," in MICRO, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. G. Yuan, A. Bakhoda, and T. Aamodt, "Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, "Towards High Performance Paged Memory for GPUs," in HPCA, 2016.Google ScholarGoogle Scholar
  152. W. K. Zuravleff and T. Robinson, "Controller for a Synchronous DRAM That Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order," U.S. Patent Number 5,630,096, 1997.Google ScholarGoogle Scholar

Index Terms

  1. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!