ABSTRACT
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.
- A. E. Eichenberger et al., "Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture," IBM Sytems Journal, Vol. 45, No. 1, 2006. Google Scholar
Digital Library
- M. Kistler et al., "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro, Vol. 26, Issue 3, 2006. Google Scholar
Digital Library
- D. Pham et al., "The Design and Implementation of a First-Generation CELL Processor," in the Proceedings of the IEEE International Solid-State Circuits Conference, 2005.Google Scholar
- M. Gschwind et al., "A Novel SIMD Architecture for the CELL Heterogeneous Chip-Multiprocessor," In Hot Chips 17, 2005.Google Scholar
- T. Chen et al., "Optimizing the use of static buffers for DMA on a Cell chip," in the Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2006. Google Scholar
Digital Library
- A. E. Eichenberger et al., "Optimizing Compiler for a Cell Processor," in the proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2005. Google Scholar
Digital Library
- D. Bailey et al. "The NAS parallel benchmarks," Technical Report TR RNR-91-002, NASA Ames, 1991.Google Scholar
- E. Witchel et al. "Direct Addressed Caches for Reduced Power Consumption," in the Proceedings of the Annual International Symposium on Microarchitecture, , 2001. Google Scholar
Digital Library
- C. A. Moritz et al., "Hot Pages: Software Caching for Raw Microprocessors," MIT-LCS Technical Memo LCS-TM-599, 1999.Google Scholar
- J. B. Fryman et al., "SoftCache: A Technique for Power and Area Reduction in Embedded Systems," CERCS; GIT-CERCS-03-06Google Scholar
- J. E. Miller and A. Agarwal, "Software-based Instruction Caching for Embedded Processors," in the Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems, 2006. Google Scholar
Digital Library
- C. A. Moritz et al., "FlexCache: A framework for flexible compiler generated data caching," in the Proceedings of the 2nd Workshop on Intelligent Memory Systems, 2000. Google Scholar
Digital Library
- S. Udayakumaran et al., "Dynamic Allocation for Scratch-Pad Memory Using Compile-Time Decisions," ACM Transactions on Embedded Computing Systems, Vol. 5, No. 2, 2006. Google Scholar
Digital Library
- B. Sinharoy et al., "POWER 5 system micro-architecture," IBM Journal of Research and Development, Vol. 49, No. 4/5, 2005. Google Scholar
Digital Library
- J. Hoeflinger and B. de Supinski, "The OpenMP Memory Model," in the Proceedings of the First International Workshop on OpenMP, 2005. Google Scholar
Digital Library
- P. Altevogt et al., "IBM BladeCenter QS21 Hardware Performance," IBM Technical White Paper WP101245, 2008.Google Scholar
- T. Chen et al., "Orchestrating Data Transfer for the Cell B.E. processor," in the Proceedings of the Annual International Conference on Supercomputing, 2008. Google Scholar
Digital Library
- T. Chen et al., "Prefetching Irregular References for Software Cache on Cell, Proceedings of the sixth Annual International Symposium on Code Generation and Optimization. Google Scholar
Digital Library
Index Terms
Hybrid access-specific software cache techniques for the cell BE architecture
Recommendations
Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture
Ease of programming is one of the main requirements for the broad acceptance of multicore systems without hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a ...
Prefetching irregular references for software cache on cell
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimizationThe IBM Single Source Research Compiler for the Cell processor (the SSC Research Compiler) was developed to manage the complexity of programming the heterogeneous multicore Cell processor. The compiler accepts conventional source programs as input, and ...
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture
Languages and Compilers for Parallel ComputingEase of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a ...





Comments