Abstract
On multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. This task can be complex and error-prone even for expert programmers. Before we can allow compilers to handle this complexity for us, we must identify the abstractions that are general enough to allow us to write applications with reasonable effort, yet specific enough to exploit the vast on-chip memory bandwidth of EMM multi-processors. To this end, we compare two programming models against hand-tuned codes on the STI Cell, paying attention to programmability and performance. The first programming model, Sequoia, abstracts the memory hierarchy as private address spaces, each corresponding to a parallel task. The second, Cellgen, is a new framework which provides OpenMP-like semantics and the abstraction of a shared address space divided into private and shared data. We compare three applications programmed using these models against their hand-optimized counterparts in terms of abstractions, programming complexity, and performance.
- A. M. Aji, W. Feng, F. Blagojevic, and D. S. Nikolopoulos. Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine. In Proceedings of the 2008 ACM Conference on Computing Frontiers (CF08), pages 13--22, 2008. Google Scholar
Digital Library
- J. Balart, M. González, X. Martorell, E. Ayguadé, Z. Sura, T. Chen, T. Zhang, K. O'Brien, and K. M. O'Brien. A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor. In Proc. of the 20th International Workshop on Languages and Compilers for Parallel Computing, LNCS Vol. 5234, pages 125--140, Oct. 2007.Google Scholar
- P. Bellens, J. M. Pérez, R. M. Badia, and J. Labarta. CellSs: A Programming Model for the Cell BE Architecture. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (Supercomputing'2006), page 86, 2006. Google Scholar
Digital Library
- W. P. L. Carter. Documentation Of The Saprc-99 Chemical Mechanism For Voc Reactivity Assessment. Final Report Contract No. 92-329, California Air Resources Board, May 8 2000.Google Scholar
- JT. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell Broadband Engine and Its First Implementation -- A Performance View. IBM Journal of Research and Development, 51(5):559--572, Sept. 2007. Google Scholar
Digital Library
- T. Chen, Z. Sura, K. M. O'Brien, and J. K. O'Brien. Optimizing the Use of Static Buffers for DMA on a CELL Chip. In Languages and Compilers for Parallel Computing, 19th International Workshop (LCPC), pages 314--329, 2006. Google Scholar
Digital Library
- C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating Computing With the Cell Broadband Engine Processor. In Proceedings of the 2008 ACM Conference on Computing Frontiers (CF08), pages 3--12, 2008. Google Scholar
Digital Library
- W. J. Dally, F. Labonte, A. Das, P. Hanrahan, J. H. Ahn, J. Gummaraju, M. Erez, N. Jayasena, I. Buck, T. J. Knight, and U. J. Kapasi. Merri-mac: Supercomputing with Streams. In Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computin (Supercomputing'2003), page 35, 2003. Google Scholar
Digital Library
- A. Duran, J. M. Perez, E. Ayguade, R. M. Badia, and J. Labarta. Extending the OpenMP Tasking Model to Allow Dependent Tasks. In OpenMP in a New Era of Parallelism, Proceedings of the 4th International Workshop on OpenMP, LNCS Vol. 5004, pages 111--122, July 2008. Google Scholar
Digital Library
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the Memory Hierarchy. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (Supercomputing'2006), page 83, 2006. Google Scholar
Digital Library
- X. Feng, K. W. Cameron, and D. A. Buell. PBPI: A High Performance Implementation of Bayesian Phylogenetic Inference. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (Supercomputing'2006), page 75, 2006. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, and S. P. Amarasinghe. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASP-LOS), pages 151--162, 2006. Google Scholar
Digital Library
- J. Gummaraju, J. Coburn, Y. Turner, and M. Rosenblum. Streamware: Programming General-Purpose Multicore Processors Using Streams. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASP-LOS), pages 297--307, 2008. Google Scholar
Digital Library
- W. Hundsdorfer. Numerical Solution of Advection-Diffusion-Reaction Equations. Technical report, Centrum voor Wiskunde en Informatica, 1996.Google Scholar
- IBM Corporation. Software development kit for multi-core acceleration version 3.0. Oct. 2007.Google Scholar
- D. Jimenez-Gonzalez, X. Martorell, and A. Ramirez. Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications. Performance Analysis of Systems & Software, 2007. ISPASS 2007. IEEE International Symposium on, pages 210--219, April 2007.Google Scholar
- J. C. Linford and A. Sandu. Optimizing Large Scale Chemical Transport Models for Multicore Platforms. In Proceedings of the 2008 Spring Simulation Multiconference, Ottawa, Canada, April 14-18 2008. Google Scholar
Digital Library
- T. Mattson. Introduction to OpenMP -- Tutorial. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (Supercomputing'2006), page 209, 2006. Google Scholar
Digital Library
- M. D. McCool and B. D'Amora. Programming using RapidMind on the Cell BE -- Tutorial. In Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (Super-computing'2006), page 222, 2006. Google Scholar
Digital Library
- N. Mitchell, L. Carter, and J. Ferrante. Localizing Non-Affine Array References. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 192--202, 1999. Google Scholar
Digital Library
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. Proceedings of the IEEE, 95(6):879--899, May 2008.Google Scholar
Cross Ref
- B. Rose. Cellstream. http://www.cs.vt.edu/~bar234/cellstream.Google Scholar
- A. Sandu, D. Daescu, G. Carmichael, and T. Chai. Adjoint Sensitivity Analysis of Regional Air Quality Models. Journal of Computational Physics, 204:222--252, 2005. Google Scholar
Digital Library
- P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multi-threaded System. In PLDI'07: Proceedings of the 2007 ACM SIG-PLAN conference on Programming Language Design and Implemen-tation, pages 156--166, 2007. Google Scholar
Digital Library
Index Terms
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies
Recommendations
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingOn multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. This task can be complex and error-prone even for expert programmers. Before we can allow compilers to ...
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies
Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex ...
Programming Multiprocessors with Explicitly Managed Memory Hierarchies
A study of two applications programmed using three models of varying complexity reveals that implicit management of locality can produce code with performance comparable to code generated from explicit management of locality.







Comments