Abstract
The increasing use of heterogeneous embedded systems with multi-core CPUs and Graphics Processing Units (GPUs) presents important challenges in effectively exploiting pipeline, task, and data-level parallelism to meet throughput requirements of digital signal processing applications. Moreover, in the presence of system-level memory constraints, hand optimization of code to satisfy these requirements is inefficient and error prone and can therefore, greatly slow down development time or result in highly underutilized processing resources. In this article, we present vectorization and scheduling methods to effectively exploit multiple forms of parallelism for throughput optimization on hybrid CPU-GPU platforms, while conforming to system-level memory constraints. The methods operate on synchronous dataflow representations, which are widely used in the design of embedded systems for signal and information processing. We show that our novel methods can significantly improve system throughput compared to previous vectorization and scheduling approaches under the same memory constraints. In addition, we present a practical case-study of applying our methods to significantly improve the throughput of an orthogonal frequency division multiplexing receiver system for wireless communications.
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. J. Concurr. Comput.: Pract. Exper. 23, 2 (Feb. 2011), 187--198. Google Scholar
Digital Library
- S. S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala (Eds.). 2013. Handbook of Signal Processing Systems (second ed.). Springer. Google Scholar
Digital Library
- S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. 1996. Software Synthesis from Dataflow Graphs. Kluwer Academic. Google Scholar
Digital Library
- Y. Chen and H. Zhou. 2012. Buffer minimization in pipelined SDF scheduling on multi-core platforms. In Proceedings of the Asia South Pacific Design Automation Conference. 127--132.Google Scholar
- F. Ciccozzi. 2013. Automatic synthesis of heterogeneous CPU-GPU embedded applications from a UML profile. In Proceedings of the International Workshop on Model Based Architecting and Construction of Embedded Systems.Google Scholar
- K. Desnos, M. Pelcat, J.-F. Nezan, and Slaheddine Aridhi. 2015. Buffer merging technique for minimizing memory footprints of synchronous dataflow specifications. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 1111--1115.Google Scholar
Cross Ref
- R. P. Dick, D. L. Rhodes, and W. Wolf. 1998. TGFF: Task graphs for free. In Proceedings of the International Workshop on Hardware/Software Codesign. 97--101. Google Scholar
Digital Library
- A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 2 (2011), 173--193.Google Scholar
Cross Ref
- A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, A. J. M. Moonen, M. J. G. Bekooij, B. D. Theelen, and M. R. Mousavi. 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the International Conference on Application of Concurrency to System Design. Google Scholar
Digital Library
- M. Goli, M. T. Garba, and H. González-Vélez. 2012. Streaming dynamic coarse-grained CPU/GPU workloads with heterogeneous pipelines in FastFlow. In Proceedings of the International Conferences on High Performance Computing and on Communications on Economics and Social Sciences (HPCC-ICESS’12). 445--452. Google Scholar
Digital Library
- C. Gregg and K. Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 134--144. Google Scholar
Digital Library
- A. Hagiescu, H. P. Huynh, W.-F. Wong, and R. S. M. Goh. 2011. Automated architecture-aware mapping of streaming applications onto GPUs. In Proceedings of the International Symposium on Parallel and Distributed Processing. 467--478. Google Scholar
Digital Library
- C. Hsu, J. Pino, and S. S. Bhattacharyya. 2011. Multithreaded simulation for synchronous dataflow graphs. ACM Trans. Des. Autom. Electr. Syst. 16, 3 (Jun. 2011), 25--1--25--23. Google Scholar
Digital Library
- M. Ko, C. Shen, and S. S. Bhattacharyya. 2008. Memory-constrained block processing for DSP software optimization. J. Sign. Process. Syst. 50, 2 (Feb. 2008), 163--177. Google Scholar
Digital Library
- E. A. Lee and D. G. Messerschmitt. 1987. Synchronous dataflow. Proc. IEEE 75, 9 (Sep. 1987), 1235--1245.Google Scholar
Cross Ref
- S. Lin, Y. Liu, W. Plishker, and S. S. Bhattacharyya. 2016. A design framework for mapping vectorized synchronous dataflow graphs onto CPU--GPU platforms. In Proceedings of the International Workshop on Software and Compilers for Embedded Systems, 20--29. Google Scholar
Digital Library
- W. Lund, S. Kanur, J. Ersfolk, L. Tsiopoulos, J. Lilius, J. Haldin, and U. Falk. 2015. Execution of dataflow process networks on OpenCL platforms. In Proceedings of the Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 618--625. Google Scholar
Digital Library
- J. W. Massey, J. Starr, S. Lee, D. Lee, A. Gerstlauer, and R. W. Heath. 2012. Implementation of a real-time wireless interference alignment network. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers. 104--108.Google Scholar
- J. Park and W. J. Dally. 2010. Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, USA, 1--10. Google Scholar
Digital Library
- S. Ritz, M. Pankert, and H. Meyr. 1993. Optimum vectorization of scalable synchronous dataflow graphs. In Proceedings of the International Conference on Application Specific Array Processors.Google Scholar
- L. Schor, A. Tretter, T. Scherer, and L. Thiele. 2013. Exploiting the parallelism of heterogeneous systems using dataflow graphs on top of OpenCL. In Proceedings of the IEEE Workshop on Embedded Systems for Real-Time Multimedia. 41--50.Google Scholar
- C. Shen, W. Plishker, H. Wu, and S. S. Bhattacharyya. 2010. A lightweight dataflow approach for design and implementation of SDR systems. In Proceedings of the Wireless Innovation Conference and Product Exposition. 640--645.Google Scholar
- C. Shen, L. Wang, I. Cho, S. Kim, S. Won, W. Plishker, and S. S. Bhattacharyya. 2011. The DSPCAD Lightweight Dataflow Environment: Introduction to LIDE Version 0.1. Technical Report UMIACS-TR-2011-17. Institute for Advanced Computer Studies, University of Maryland at College Park.Google Scholar
- S. Sriram and S. S. Bhattacharyya. 2009. Embedded Multiprocessors: Scheduling and Synchronization (2nd ed.). CRC Press. Google Scholar
Digital Library
- S. Stuijk, M. Geilen, and T. Basten. 2006. Exploring tradeoffs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the Design Automation Conference. Google Scholar
Digital Library
- H. Topcuoglu, S. Hariri, and M.-Y. Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13, 3 (2002), 260--274. Google Scholar
Digital Library
- S. Tripakis, D. Bui, M. Geilen, B. Rodiers, and E. A. Lee. 2013. Compositionality in synchronous data flow: Modular code generation from hierarchical SDF graphs. ACM Trans. Embed. Comput. Syst. 12, 3 (2013), 83:1--83:26. Google Scholar
Digital Library
- A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. 2009. Software pipelined execution of stream programs on GPUs. In Proceedings of the International Symposium on Code Generation and Optimization. 200--209. Google Scholar
Digital Library
- G. Zaki, W. Plishker, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. 2013. Integration of dataflow-based heterogeneous multiprocessor scheduling techniques in GNU radio. J. Sign. Process. Syst. 70, 2 (Feb. 2013), 177--191. Google Scholar
Digital Library
Index Terms
Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms
Recommendations
A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms
SCOPES '16: Proceedings of the 19th International Workshop on Software and Compilers for Embedded SystemsHeterogeneous computing platforms with multicore central processing units (CPUs) and graphics processing units (GPUs) are of increasing interest to designers of embedded signal processing systems since they offer the potential for significant ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...






Comments