skip to main content
research-article
Public Access

Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

Published:30 January 2018Publication History
Skip Abstract Section

Abstract

The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task, and data-level parallelism to meet throughput requirements of digital signal processing applications. Moreover, in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inefficient and error prone and can therefore, greatly slow down development time or result in highly underutilized processing resources. In this article, we present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level memory constraints. The methods operate on synchronous dataflow representations, which are widely used in the design of embedded systems for signal and information processing. We show that our novel methods can significantly improve system throughput compared to previous vectorization and scheduling approaches under the same memory constraints. In addition, we present a practical case-study of applying our methods to significantly improve the throughput of an orthogonal frequency division multiplexing receiver system for wireless communications.

References

  1. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. J. Concurr. Comput.: Pract. Exper. 23, 2 (Feb. 2011), 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala (Eds.). 2013. Handbook of Signal Processing Systems (second ed.). Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. 1996. Software Synthesis from Dataflow Graphs. Kluwer Academic. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Chen and H. Zhou. 2012. Buffer minimization in pipelined SDF scheduling on multi-core platforms. In Proceedings of the Asia South Pacific Design Automation Conference. 127--132.Google ScholarGoogle Scholar
  5. F. Ciccozzi. 2013. Automatic synthesis of heterogeneous CPU-GPU embedded applications from a UML profile. In Proceedings of the International Workshop on Model Based Architecting and Construction of Embedded Systems.Google ScholarGoogle Scholar
  6. K. Desnos, M. Pelcat, J.-F. Nezan, and Slaheddine Aridhi. 2015. Buffer merging technique for minimizing memory footprints of synchronous dataflow specifications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 1111--1115.Google ScholarGoogle ScholarCross RefCross Ref
  7. R. P. Dick, D. L. Rhodes, and W. Wolf. 1998. TGFF: Task graphs for free. In Proceedings of the International Workshop on Hardware/Software Codesign. 97--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 2 (2011), 173--193.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, A. J. M. Moonen, M. J. G. Bekooij, B. D. Theelen, and M. R. Mousavi. 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the International Conference on Application of Concurrency to System Design. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Goli, M. T. Garba, and H. González-Vélez. 2012. Streaming dynamic coarse-grained CPU/GPU workloads with heterogeneous pipelines in FastFlow. In Proceedings of the International Conferences on High Performance Computing and on Communications on Economics and Social Sciences (HPCC-ICESS’12). 445--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Gregg and K. Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 134--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Hagiescu, H. P. Huynh, W.-F. Wong, and R. S. M. Goh. 2011. Automated architecture-aware mapping of streaming applications onto GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing. 467--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Hsu, J. Pino, and S. S. Bhattacharyya. 2011. Multithreaded simulation for synchronous dataflow graphs. ACM Trans. Des. Autom. Electr. Syst. 16, 3 (Jun. 2011), 25--1--25--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Ko, C. Shen, and S. S. Bhattacharyya. 2008. Memory-constrained block processing for DSP software optimization. J. Sign. Process. Syst. 50, 2 (Feb. 2008), 163--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. A. Lee and D. G. Messerschmitt. 1987. Synchronous dataflow. Proc. IEEE 75, 9 (Sep. 1987), 1235--1245.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Lin, Y. Liu, W. Plishker, and S. S. Bhattacharyya. 2016. A design framework for mapping vectorized synchronous dataflow graphs onto CPU--GPU platforms. In Proceedings of the International Workshop on Software and Compilers for Embedded Systems, 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Lund, S. Kanur, J. Ersfolk, L. Tsiopoulos, J. Lilius, J. Haldin, and U. Falk. 2015. Execution of dataflow process networks on OpenCL platforms. In Proceedings of the Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 618--625. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. W. Massey, J. Starr, S. Lee, D. Lee, A. Gerstlauer, and R. W. Heath. 2012. Implementation of a real-time wireless interference alignment network. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers. 104--108.Google ScholarGoogle Scholar
  19. J. Park and W. J. Dally. 2010. Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, USA, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Ritz, M. Pankert, and H. Meyr. 1993. Optimum vectorization of scalable synchronous dataflow graphs. In Proceedings of the International Conference on Application Specific Array Processors.Google ScholarGoogle Scholar
  21. L. Schor, A. Tretter, T. Scherer, and L. Thiele. 2013. Exploiting the parallelism of heterogeneous systems using dataflow graphs on top of OpenCL. In Proceedings of the IEEE Workshop on Embedded Systems for Real-Time Multimedia. 41--50.Google ScholarGoogle Scholar
  22. C. Shen, W. Plishker, H. Wu, and S. S. Bhattacharyya. 2010. A lightweight dataflow approach for design and implementation of SDR systems. In Proceedings of the Wireless Innovation Conference and Product Exposition. 640--645.Google ScholarGoogle Scholar
  23. C. Shen, L. Wang, I. Cho, S. Kim, S. Won, W. Plishker, and S. S. Bhattacharyya. 2011. The DSPCAD Lightweight Dataflow Environment: Introduction to LIDE Version 0.1. Technical Report UMIACS-TR-2011-17. Institute for Advanced Computer Studies, University of Maryland at College Park.Google ScholarGoogle Scholar
  24. S. Sriram and S. S. Bhattacharyya. 2009. Embedded Multiprocessors: Scheduling and Synchronization (2nd ed.). CRC Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Stuijk, M. Geilen, and T. Basten. 2006. Exploring tradeoffs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Topcuoglu, S. Hariri, and M.-Y. Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13, 3 (2002), 260--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Tripakis, D. Bui, M. Geilen, B. Rodiers, and E. A. Lee. 2013. Compositionality in synchronous data flow: Modular code generation from hierarchical SDF graphs. ACM Trans. Embed. Comput. Syst. 12, 3 (2013), 83:1--83:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. 2009. Software pipelined execution of stream programs on GPUs. In Proceedings of the International Symposium on Code Generation and Optimization. 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. 2013. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. J. Sign. Process. Syst. 70, 2 (Feb. 2013), 177--191. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!