Abstract
While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16-core Cell platform compared to a single core.
- Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta. Cellss: a programming model for the cell be architecture. Proceedings Supercomputing '06, 00(1):5, 2006. Google Scholar
Digital Library
- Filip Blagojevic, Dimitris S. Nikolopoulos, Alexandros Stamatakis, and Christos D. Antonopoulos. Dynamic multigrain parallelization on the cell broadband engine. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 90--100, New York, NY, USA, 2007. ACM Press. Google Scholar
Digital Library
- I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777--786, August 2004. Google Scholar
Digital Library
- M. Chen, X. Li, R. Lian, J. Lin, L. Liu, T. Liu, and R. Ju. Shangri-la: Achieving high performance from compiled network applications while enabling ease of programming. In Proc. of the SIGPLAN '05 Conference on Programming Language Design and Implementation, pages 224--236, June 2005. Google Scholar
Digital Library
- W. Eatherton. The push of network processing to the top of the pyramid, 2005.Google Scholar
- Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151--162, New York, NY, USA, 2006. ACM Press. Google Scholar
Digital Library
- Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. A stream compiler for communication-exposed architectures. In Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291--303, October 2002. Google Scholar
Digital Library
- Jayanth Gummaraju and Mendel Rosenblum. Stream programming on general-purpose processors. In Proc. of the 38th Annual International Symposium on Microarchitecture, pages 343--354, Washington, DC, USA, 2005. IEEE Computer Society. Google Scholar
Digital Library
- Soonhoi Ha and Edward A. Lee. Compile-time scheduling and assignment of data-flow program graphs with data-dependent iteration. IEEE Transactions on Computers, 40(11):1225--1238, 1991. Google Scholar
Digital Library
- H. P. Hofstee. Power efficient processor design and the Cell processor. In Proc. of the 11th International Symposium on High-Performance Computer Architecture, pages 258--262, February 2005. Google Scholar
Digital Library
- IBM. Cell Broadband Engine Architecture, March 2006.Google Scholar
- G. Karypis and V. Kumar. Metis: A Software Package for Paritioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Orderings of Sparce Matrices. University of Minnesota, September 1998.Google Scholar
- Timothy J. Knight, Ji Young Park, Manman Ren, Mike Houston, Mattan Erez, Kayvon Fatahalian, Alex Aiken, William J. Dally, and Pat Hanrahan. Compilation for explicitly managed memory hierarchies. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 226--236, New York, NY, USA, 2007. ACM Press. Google Scholar
Digital Library
- P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2):21--29, February 2005. Google Scholar
Digital Library
- E. Lee and D. Messerschmitt. Synchronous data flow. IEEE Proceedings of, 75(9):1235--1245, 1987.Google Scholar
Cross Ref
- E. A. Lee and D. Messerschmitt. Pipeline interleaved programmable dsp's: Synchronous data flow programming. 35(9):1334--1345, 1987.Google Scholar
- Edward Ashford Lee and David G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, 36(1):24--35, 1987. Google Scholar
Digital Library
- W. Mark, R. Glanville, K. Akeley, and J. Kilgard. Cg: A system for programming graphics hardware in a C-like language. In Proc. of the 30thInternational Conference on Computer Graphics and Interactive Techniques, pages 893--907, July 2003. Google Scholar
Digital Library
- J. Nickolls and I. Buck. NVIDIA CUDA software and GPU parallel computing architecture. In Microprocessor Forum, May 2007.Google Scholar
- K.K. Parhi and D.G. Messerschmitt. Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding. IEEE Transactions on Computers, 40(2):178--195, 1991. Google Scholar
Digital Library
- Jose Luis Pino, Shuvra S. Bhattacharyya, and Edward A. Lee. A hierarchical multiprocessor scheduling framework for synchronous dataflow graphs. Technical Report UCB/ERL M95/36, University of California, Berkeley, May 1995.Google Scholar
Digital Library
- B. R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 63--74, November 1994. Google Scholar
Digital Library
- B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. Code generation for modulo scheduled loops. In Proc. of the 25th Annual International Symposium on Microarchitecture, pages 158--169, November 1992. Google Scholar
Digital Library
- J. Sánchez and A. González. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proc. of the 33rd Annual International Symposium on Microarchitecture, pages 124--133, December 2000. Google Scholar
Digital Library
- Michael Bedford Taylor et al. The Raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro, 22(2):25--35, 2002. Google Scholar
Digital Library
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 2002 International Conference on Compiler Construction, pages 179--196, 2002. Google Scholar
Digital Library
- Shih wei Liao, Zhaohui Du, Gansha Wu, and Guei-Yuan Lueh. Data and computation transformations for brook streaming applications on multiprocessors. Proc. of the 2006 International Symposium on Code Generation and Optimization, 0(1):196--207, 2006. Google Scholar
Digital Library
- D. Zhang, Z. Li, H. Song, and L Liu. A programming model for an embedded media processing architecture. In Proc. of the 5thInternational Symposium on Systems, Architectures, Modeling, and Simulation, volume 3553 of Lecture Notes in Computer Science, pages 251--261, July 2005. Google Scholar
Digital Library
Index Terms
Orchestrating the execution of stream programs on multicore platforms
Recommendations
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 2006 ASPLOS ConferenceAs multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (...
Orchestrating the execution of stream programs on multicore platforms
PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and ImplementationWhile multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsThe StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...







Comments