skip to main content
research-article

COMPASS: a programmable data prefetcher using idle GPU shaders

Published:13 March 2010Publication History
Skip Abstract Section

Abstract

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost scientific computing. To further reduce the cost and form factor, an emerging trend is to integrate GPU along with the memory controllers onto the same die with the processor cores. However, given such a system-on-chip, the GPU, while occupying a substantial part of the silicon, will sit idle and contribute nothing to the overall system performance when running non-graphics workloads or applications lack of data-level parallelism. In this paper, we propose COMPASS, a compute shader-assisted data prefetching scheme, to leverage the GPU resource for improving single-threaded performance on an integrated system. By harnessing the GPU shader cores with very lightweight architectural support, COMPASS can emulate the functionality of a hardware-based prefetcher using the idle GPU and successfully improve the memory performance of single-thread applications. Moreover, thanks to its flexibility and programmability, one can implement the best performing prefetch scheme to improve each specific application as demonstrated in this paper. With COMPASS, we envision that a future application vendor can provide a custom-designed COMPASS shader bundled with its software to be loaded at runtime to optimize the performance. Our simulation results show that COMPASS can improve the single-thread performance of memory-intensive applications by 68% on average.

References

  1. Advanced Micro Devices Inc. R700-Family Instruction Set Architecture, March 2009. http://developer.amd.com/gpu assets/R700-Family Instruction Set Architecture.pdf.Google ScholarGoogle Scholar
  2. M. Annavaram, J. Patel, and E. Davidson. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the International Symposium on Computer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T.-F. Chen and J.-L. Baer. Reducing Memory Latency via Nonblocking and Prefetching Caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu. Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching. In Proceedings of the International Symposium on Microarchitecture, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the International Symposium on Computer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Cooksey, S. Jourdan, and D. Grunwald. A Stateless, Content-Directed Data Prefetching Mechanism. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Dimitrov and H. Zhou. Combining Local and Global History for High Performance Data Prefetching. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.Google ScholarGoogle Scholar
  9. J. Dundas and T. Mudge. Improving Data Cache Performance by Preexecuting Instructions Under a Cache Miss. In Proceedings of the International Conference on Supercomputing, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Fedorova, M. Seltzer, and M. Smith. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the International Symposium on Microarchitecture, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Ganusov and M. Burtscher. Efficient Emulation of Hardware Prefetchers via Event--Driven Helper Threading. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proceedings of the International Symposium on Computer Architecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Huddy. ATI RadeondTM HD 2000 SeriesTechnology Overview. In AMD Technical Day, The Develop Conference & Expo, 2007.Google ScholarGoogle Scholar
  17. Intel Corporation. Optimizing Application Performance on IntelR CoreTM Microarchitecture Using Hardware-Implemented Prefetchers, http://software.intel.com/en-us/articles/optimizingapplication-performance-on-intel-coret-microarchitecture-usinghardware-implemented--prefetchers, September 2008.Google ScholarGoogle Scholar
  18. Intel Corporation. Intel R CoreTM i7-900 Desktop Processor Extreme Edition Series and IntelR CoreTM i7-900 Desktop Processor Series, October 2009.Google ScholarGoogle Scholar
  19. D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the International Symposium on Computer Architecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. B. Kandiraju and A. Sivasubramaniam. Going the Distance for TLB Prefetching: An Application-driven Study. In Proceedings of the International Symposium on Computer Architecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):54.66, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. S. Liao, P. H. Wang, H. Wang, G. Ho_ehner, D. Lavery, and J. P. Shen. Post-Pass Binary Adaptation for Software--Based Speculative Precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Lin, S. Reinhardt, and D. Burger. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In Proceedings of the International Symposium on High Performance Computer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn. GPGPU: General Purpose Computation on Graphics Hardware. In Proceedings of the conference on SIGGRAPH 2004 course notes, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C.-K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the International Symposium on Computer Architecture, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Mantor. Radeon R600, a 2nd Generation Unified Shader Architecture. In Proceedings of the 19th Hot Chips Conference, August, 2007.Google ScholarGoogle Scholar
  27. M. Mantor. Entering the Golden Age of Heterogeneous Computing. In Performance Enhancement on Emerging Parallel Processing Platforms, 2008.Google ScholarGoogle Scholar
  28. C. Moore. The Role of Accelerated Computing in the Multi-core Era. In Workshop on Manycore and Multicore Computing: Architectures, Applications And Directions, 2007.Google ScholarGoogle Scholar
  29. O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proceedings of the International Symposium on High Performance Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Nesbit and J. Smith. Data Cache Prefetching Using a Global History Buffer. In Proceedings of the International Symposium on High Performance Computer Architecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. G. Perez, G. Mouchard, and O. Temam. MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms. In Proceedings of the International Symposium on Microarchitecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.Google ScholarGoogle Scholar
  34. N. Rubin. Issues And Challenges In Compiling for Graphics Processors (Keynote speech). In Proceedings of the International Symposium on Code Generation and Optimization, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Sharif and H.-H. S. Lee. Data Prefetching Mechanism by Exploiting Global and Local Access Patterns. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.Google ScholarGoogle Scholar
  36. S. L. Smith. Intel Roadmap Overview. In Intel Developer Forum, 2008.Google ScholarGoogle Scholar
  37. G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In Proceedings of the International Symposium on Computer Architecture, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. In Proceedings of the International Symposium on Computer Architecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM Technical White Paper, October 2001.Google ScholarGoogle Scholar
  40. N. Tuck and D. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. COMPASS: a programmable data prefetcher using idle GPU shaders

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 3
        ASPLOS '10
        March 2010
        399 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1735971
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
          March 2010
          422 pages
          ISBN:9781605588391
          DOI:10.1145/1736020
          • General Chair:
          • James C. Hoe,
          • Program Chair:
          • Vikram S. Adve

        Copyright © 2010 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 March 2010

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!