skip to main content
research-article

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.

In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

References

  1. AMD. Radeon and FirePro Graphics Cards, Nov. 2011.Google ScholarGoogle Scholar
  2. AMD. Heterogeneous Computing: OpenCL and the ATI Radeon HD 5870 (Evergreen) Architecture, Oct. 2012.Google ScholarGoogle Scholar
  3. R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  6. M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In CC/ETAPS 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In HPCA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel Application Memory Scheduling. MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. IEEE Trans. Comput., 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.Google ScholarGoogle Scholar
  25. S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.Google ScholarGoogle Scholar
  27. Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Kirk and Wen-mei. W. Hwu. Programming Massively Parallel Processors. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Krewell. Amd's Fusion Finally Arrives. MPR, 2011.Google ScholarGoogle Scholar
  30. K. Krewell. Ivy Bridge Improves Graphics. MPR, 2011.Google ScholarGoogle Scholar
  31. K. Krewell. Most Significant Bits. MPR, 2011.Google ScholarGoogle Scholar
  32. K. Krewell. Nvidia Lowers the Heat on Kepler. MPR, 2012.Google ScholarGoogle Scholar
  33. N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. Computer Architecture Letters, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt. Prefetch-Aware DRAM Controllers. In MICRO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Munshi. The OpenCL Specification, June 2011.Google ScholarGoogle Scholar
  40. S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. N. Chidambaram Nachiappan, A. K. Mishra, M. Kandemir, A. Sivasubramaniam, O. Mutlu, and C. R. Das. Application-aware Prefetch Prioritization in On-chip Networks. In PACT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance Via Large Warps and Two-level Warp Scheduling. In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. NVIDIA. CUDA C Programming Guide, Oct. 2010.Google ScholarGoogle Scholar
  47. NVIDIA. CUDA C/C++ SDK code samples, 2011.Google ScholarGoogle Scholar
  48. NVIDIA. Fermi: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.Google ScholarGoogle Scholar
  49. M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious Wavefront Scheduling. In MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. 2012.Google ScholarGoogle Scholar
  54. I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU Microarchitecture Through Microbenchmarking. In ISPASS, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  57. H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling forMany-core Accelerator Architectures. InMICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. W. K. Zuravleff and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order. U.S. Patent Number 5,630,096, 1997.Google ScholarGoogle Scholar

Index Terms

  1. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 48, Issue 4
          ASPLOS '13
          April 2013
          540 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2499368
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
            March 2013
            574 pages
            ISBN:9781450318709
            DOI:10.1145/2451116

          Copyright © 2013 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 March 2013

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!