skip to main content
10.1145/1542452.1542466acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article

Synergistic execution of stream programs on multicores with accelerators

Published:19 June 2009Publication History

ABSTRACT

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware.

In this paper, we describe a novel method to orchestrate the execution of a StreamIt program on a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program (ILP) which can then be solved by an ILP solver.We also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark suite. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.84X with a maximum of 51.96X over a single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over a partitioning strategy that maps only the filters that cannot be executed on the GPU -- the filters with state that is persistent across firings -- onto the CPU.

References

  1. NVIDIA CUDA Programming Guide. URL http://www.nvidia.com/cuda.Google ScholarGoogle Scholar
  2. OpenCL Overview. URL http://www.khronos.org/developers/library/overview/opencl_overview.pdf.Google ScholarGoogle Scholar
  3. StreamIt Home Page. URL http://www.cag.lcs.mit.edu/streamit/.Google ScholarGoogle Scholar
  4. S. S. Bhattacharyya and E. A. Lee. Looped Schedules for Dataflow Descriptions of Multirate Signal Processing Algorithms. Formal Methods in System Design, 5(3), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ian Buck et. al. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Trans. on Graphics, 23(3), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. A. Kahle et. al. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4--5), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michael Bedford Taylor et. al. The RAW Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, 22(2), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael I. Gordon et. al. A Stream Compiler for Communication-Exposed Architectures. In ASPLOS-X: Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shane Ryoo et. al. Program Optimization Space Pruning for a Multithreaded GPU. In CGO '08: Proc. of the sixth annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shane Ryoo et. al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA. In PPoPP'08: Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. R. Gao, R. Govindarajan, and P. Panangaden. Well-Behaved Dataflow Programs for DSP Computation. ICASSP-92: IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1992., 5, Mar 1992.Google ScholarGoogle Scholar
  12. Michael I Gordon, William Thies, and Saman Amarasinghe. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In ASPLOS-XII: Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Govindarajan and Guang R. Gao. A Novel Framework for Multirate Scheduling in DSP Applications. In ASAP '93: Proc. of the 1993 Intl. Conf. on Application--Specific Array Processors, Oct 1993.Google ScholarGoogle Scholar
  14. R. Govindarajan, Guang R. Gao, and Palash Desai. Minimizing Memory Requirements in Rate-optimal Schedules. In ASAP '94: Proc. of the 1994 Intl. Conf. on Application Specific Array Processors, Aug 1994.Google ScholarGoogle Scholar
  15. Junwei Hou and Wayne Wolf. Process Partitioning for Distributed Embedded Systems. In CODES '96: Proc. of the 4th Intl. Workshop on Hardware/Software Co-Design, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Karypis and V. Kumar. Multilevel k-way Partitioning Scheme for Irregular Graphs. Journal of Parallel and Distributed Computing, 48, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B.W. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell System Tech. Journal, 49, Feb. 1970.Google ScholarGoogle Scholar
  18. Manjunath Kudlur and Scott Mahlke. Orchestrating the Execution of Stream Programs on Multicore Platforms. In PLDI '08: Proc. of the 2008 ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. A. Lee and D. G. Messerschmitt. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. on Computers, 36(1), 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E.A. Lee and D.G. Messerschmitt. Synchronous Data Flow. Proc. of the IEEE, 75(9), Sept. 1987.Google ScholarGoogle ScholarCross RefCross Ref
  21. B. R. Rau. Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. In MICRO 27: Proc. of the 27th annual Intl. Symp. on Microarchitecture, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. R. Rau, Michael S. Schlansker, and P. P. Tirumalai. Code Generation Schema for Modulo Scheduled Loops. In MICRO 25: Proc. of the 25th annual Intl. Symp. on Microarchitecture, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. John Ruttenberg, Guang R. Gao, A. Stoutchinin, and W. Lichtenstein. Software Pipelining Showdown: Optimal vs. Heuristic Methods in a Production Compiler. In PLDI '96: Proc. of the ACM SIGPLAN 1996 Conf. on Programming Language Design and Implementation, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Janis Sermulins, William Thies, Rodric Rabbah, and Saman Amarasinghe Cache Aware Optimization of Stream Programs. In LCTES'05: Proc. of the 2005 ACM SIGPLAN/SIGBED Conf. on Languages, Compilers, and Tools for Embedded Systems, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. David Tarditi, Sidd Puri, and Jose Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In ASPLOSXII: Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. William Thies, Michal Karczmarek, and Saman Amarasinghe. StreamIt: A Language for Streaming Applications. In CC '02: Proc. of the 11th Intl. Conf. on Compiler Construction, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. Software Pipelined Execution of Stream Programs on GPUs. In CGO'09: Proc. of the seventh annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ti-Yen Yen and Wayne Wolf. Communication Synthesis for Distributed Embedded Systems. In ICCAD '95: Proc. of the 1995 IEEE/ACM Intl. Conf. on Computer-aided Design, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Zhang, Qiuyuan J. Li, Rodric Rabbah, and Saman Amarasinghe. A Lightweight Streaming Layer for Multicore Execution. SIGARCH Computer Architecture News, 36(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Synergistic execution of stream programs on multicores with accelerators

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
                June 2009
                188 pages
                ISBN:9781605583563
                DOI:10.1145/1542452
                • cover image ACM SIGPLAN Notices
                  ACM SIGPLAN Notices  Volume 44, Issue 7
                  LCTES '09
                  July 2009
                  176 pages
                  ISSN:0362-1340
                  EISSN:1558-1160
                  DOI:10.1145/1543136
                  Issue’s Table of Contents

                Copyright © 2009 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 19 June 2009

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                LCTES '09 Paper Acceptance Rate18of81submissions,22%Overall Acceptance Rate116of438submissions,26%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!