Abstract
A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost scientific computing. To further reduce the cost and form factor, an emerging trend is to integrate GPU along with the memory controllers onto the same die with the processor cores. However, given such a system-on-chip, the GPU, while occupying a substantial part of the silicon, will sit idle and contribute nothing to the overall system performance when running non-graphics workloads or applications lack of data-level parallelism. In this paper, we propose COMPASS, a compute shader-assisted data prefetching scheme, to leverage the GPU resource for improving single-threaded performance on an integrated system. By harnessing the GPU shader cores with very lightweight architectural support, COMPASS can emulate the functionality of a hardware-based prefetcher using the idle GPU and successfully improve the memory performance of single-thread applications. Moreover, thanks to its flexibility and programmability, one can implement the best performing prefetch scheme to improve each specific application as demonstrated in this paper. With COMPASS, we envision that a future application vendor can provide a custom-designed COMPASS shader bundled with its software to be loaded at runtime to optimize the performance. Our simulation results show that COMPASS can improve the single-thread performance of memory-intensive applications by 68% on average.
- Advanced Micro Devices Inc. R700-Family Instruction Set Architecture, March 2009. http://developer.amd.com/gpu assets/R700-Family Instruction Set Architecture.pdf.Google Scholar
- M. Annavaram, J. Patel, and E. Davidson. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the International Symposium on Computer Architecture, 2001. Google Scholar
Digital Library
- D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. Google Scholar
Digital Library
- T.-F. Chen and J.-L. Baer. Reducing Memory Latency via Nonblocking and Prefetching Caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992. Google Scholar
Digital Library
- W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu. Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching. In Proceedings of the International Symposium on Microarchitecture, 1991. Google Scholar
Digital Library
- J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the International Symposium on Computer Architecture, 2001. Google Scholar
Digital Library
- R. Cooksey, S. Jourdan, and D. Grunwald. A Stateless, Content-Directed Data Prefetching Mechanism. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. Google Scholar
Digital Library
- M. Dimitrov and H. Zhou. Combining Local and Global History for High Performance Data Prefetching. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.Google Scholar
- J. Dundas and T. Mudge. Improving Data Cache Performance by Preexecuting Instructions Under a Cache Miss. In Proceedings of the International Conference on Supercomputing, 1997. Google Scholar
Digital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005. Google Scholar
Digital Library
- A. Fedorova, M. Seltzer, and M. Smith. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2007. Google Scholar
Digital Library
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the International Symposium on Microarchitecture, 2007. Google Scholar
Digital Library
- I. Ganusov and M. Burtscher. Efficient Emulation of Hardware Prefetchers via Event--Driven Helper Threading. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006. Google Scholar
Digital Library
- L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998. Google Scholar
Digital Library
- Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proceedings of the International Symposium on Computer Architecture, 2002. Google Scholar
Digital Library
- R. Huddy. ATI RadeondTM HD 2000 SeriesTechnology Overview. In AMD Technical Day, The Develop Conference & Expo, 2007.Google Scholar
- Intel Corporation. Optimizing Application Performance on IntelR CoreTM Microarchitecture Using Hardware-Implemented Prefetchers, http://software.intel.com/en-us/articles/optimizingapplication-performance-on-intel-coret-microarchitecture-usinghardware-implemented--prefetchers, September 2008.Google Scholar
- Intel Corporation. Intel R CoreTM i7-900 Desktop Processor Extreme Edition Series and IntelR CoreTM i7-900 Desktop Processor Series, October 2009.Google Scholar
- D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the International Symposium on Computer Architecture, 1997. Google Scholar
Digital Library
- G. B. Kandiraju and A. Sivasubramaniam. Going the Distance for TLB Prefetching: An Application-driven Study. In Proceedings of the International Symposium on Computer Architecture, 2002. Google Scholar
Digital Library
- R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):54.66, 2008. Google Scholar
Digital Library
- S. S. Liao, P. H. Wang, H. Wang, G. Ho_ehner, D. Lavery, and J. P. Shen. Post-Pass Binary Adaptation for Software--Based Speculative Precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2002. Google Scholar
Digital Library
- W. Lin, S. Reinhardt, and D. Burger. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In Proceedings of the International Symposium on High Performance Computer Architecture, 2001. Google Scholar
Digital Library
- D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn. GPGPU: General Purpose Computation on Graphics Hardware. In Proceedings of the conference on SIGGRAPH 2004 course notes, 2004. Google Scholar
Digital Library
- C.-K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the International Symposium on Computer Architecture, 2001. Google Scholar
Digital Library
- M. Mantor. Radeon R600, a 2nd Generation Unified Shader Architecture. In Proceedings of the 19th Hot Chips Conference, August, 2007.Google Scholar
- M. Mantor. Entering the Golden Age of Heterogeneous Computing. In Performance Enhancement on Emerging Parallel Processing Platforms, 2008.Google Scholar
- C. Moore. The Role of Accelerated Computing in the Multi-core Era. In Workshop on Manycore and Multicore Computing: Architectures, Applications And Directions, 2007.Google Scholar
- O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proceedings of the International Symposium on High Performance Computer Architecture, 2003. Google Scholar
Digital Library
- K. Nesbit and J. Smith. Data Cache Prefetching Using a Global History Buffer. In Proceedings of the International Symposium on High Performance Computer Architecture, 2004. Google Scholar
Digital Library
- D. G. Perez, G. Mouchard, and O. Temam. MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms. In Proceedings of the International Symposium on Microarchitecture, 2004. Google Scholar
Digital Library
- N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006. Google Scholar
Digital Library
- J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.Google Scholar
- N. Rubin. Issues And Challenges In Compiling for Graphics Processors (Keynote speech). In Proceedings of the International Symposium on Code Generation and Optimization, 2008. Google Scholar
Digital Library
- A. Sharif and H.-H. S. Lee. Data Prefetching Mechanism by Exploiting Global and Local Access Patterns. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.Google Scholar
- S. L. Smith. Intel Roadmap Overview. In Intel Developer Forum, 2008.Google Scholar
- G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In Proceedings of the International Symposium on Computer Architecture, 1995. Google Scholar
Digital Library
- Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. In Proceedings of the International Symposium on Computer Architecture, 2002. Google Scholar
Digital Library
- J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM Technical White Paper, October 2001.Google Scholar
- N. Tuck and D. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2003. Google Scholar
Digital Library
Index Terms
COMPASS: a programmable data prefetcher using idle GPU shaders
Recommendations
COMPASS: a programmable data prefetcher using idle GPU shaders
ASPLOS '10A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-...
COMPASS: a programmable data prefetcher using idle GPU shaders
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsA traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...







Comments