Abstract
Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.
In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.
- AMD. Radeon and FirePro Graphics Cards, Nov. 2011.Google Scholar
- AMD. Heterogeneous Computing: OpenCL and the ATI Radeon HD 5870 (Evergreen) Architecture, Oct. 2012.Google Scholar
- R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012. Google Scholar
Digital Library
- A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010. Google Scholar
Digital Library
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google Scholar
Cross Ref
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS 2008. Google Scholar
Digital Library
- M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In CC/ETAPS 2010. Google Scholar
Digital Library
- M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011. Google Scholar
Digital Library
- J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In HPCA, 1999. Google Scholar
Digital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009. Google Scholar
Digital Library
- X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012. Google Scholar
Digital Library
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel Application Memory Scheduling. MICRO, 2011. Google Scholar
Digital Library
- E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.Google Scholar
Cross Ref
- W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google Scholar
Digital Library
- W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011. Google Scholar
Digital Library
- W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011. Google Scholar
Digital Library
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, 2011. Google Scholar
Digital Library
- S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011. Google Scholar
Digital Library
- B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, 2008. Google Scholar
Digital Library
- M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012. Google Scholar
Digital Library
- W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012. Google Scholar
Digital Library
- A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012. Google Scholar
Digital Library
- D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. IEEE Trans. Comput., 1999. Google Scholar
Digital Library
- O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.Google Scholar
- S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 2011. Google Scholar
Digital Library
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010. Google Scholar
Digital Library
- D. Kirk and Wen-mei. W. Hwu. Programming Massively Parallel Processors. 2010. Google Scholar
Digital Library
- K. Krewell. Amd's Fusion Finally Arrives. MPR, 2011.Google Scholar
- K. Krewell. Ivy Bridge Improves Graphics. MPR, 2011.Google Scholar
- K. Krewell. Most Significant Bits. MPR, 2011.Google Scholar
- K. Krewell. Nvidia Lowers the Heat on Kepler. MPR, 2012.Google Scholar
- N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. Computer Architecture Letters, 2012. Google Scholar
Digital Library
- C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt. Prefetch-Aware DRAM Controllers. In MICRO, 2008. Google Scholar
Digital Library
- C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009. Google Scholar
Digital Library
- J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010. Google Scholar
Digital Library
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008. Google Scholar
Digital Library
- T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007. Google Scholar
Digital Library
- A. Munshi. The OpenCL Specification, June 2011.Google Scholar
- S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007. Google Scholar
Digital Library
- N. Chidambaram Nachiappan, A. K. Mishra, M. Kandemir, A. Sivasubramaniam, O. Mutlu, and C. R. Das. Application-aware Prefetch Prioritization in On-chip Networks. In PACT, 2012. Google Scholar
Digital Library
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance Via Large Warps and Two-level Warp Scheduling. In MICRO, 2011. Google Scholar
Digital Library
- K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004. Google Scholar
Digital Library
- NVIDIA. CUDA C Programming Guide, Oct. 2010.Google Scholar
- NVIDIA. CUDA C/C++ SDK code samples, 2011.Google Scholar
- NVIDIA. Fermi: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.Google Scholar
- M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012. Google Scholar
Digital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000. Google Scholar
Digital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious Wavefront Scheduling. In MICRO, 2012. Google Scholar
Digital Library
- S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007. Google Scholar
Digital Library
- J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. 2012.Google Scholar
- I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010. Google Scholar
Digital Library
- R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994. Google Scholar
Digital Library
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU Microarchitecture Through Microbenchmarking. In ISPASS, 2010.Google Scholar
Cross Ref
- H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD, 2012. Google Scholar
Digital Library
- G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling forMany-core Accelerator Architectures. InMICRO, 2009. Google Scholar
Digital Library
- W. K. Zuravleff and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order. U.S. Patent Number 5,630,096, 1997.Google Scholar
Index Terms
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Recommendations
Orchestrated scheduling and prefetching for GPGPUs
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureIn this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp ...
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsEmerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...







Comments