Abstract
With advances in manycore and accelerator architectures, the high performance and embedded spaces are rapidly converging. Emerging architectures feature different forms of parallelism. The Polyhedral Processes Networks (PPNs) are a proven model of choice for automated generation of pipeline and task parallel programs from sequential source code, however data parallelism is not addressed. In this paper, we present asystematic approach for identification and extraction of fine grain data parallelism from the PPN specification. The approach is implemented in a tool, called kpn2gpu, which produces fine-grain data parallel CUDA kernels for graphics processing units (GPUs). First experiments indicate that generated applications have a potential to exploit different forms of parallelism provided by the architecture and that kernels feature a highly regular structure that allows subsequent optimizations.
- ACE Associated Compiler Experts bv. Parallelization using polyhedral analysis. 2008.Google Scholar
- S. Baghdadi, A. Grölinger, and A. Cohen. Putting automatic polyhedral compilation for GPGPU to work. Proc of CPC'10.Google Scholar
- A. Balevic and B. Kienhuis. A Data Parallel View on Polyhedral Process Networks. SCOPES'11. Google Scholar
Digital Library
- M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proc. of Compiler Construction (CC 2010). Springer, 2010. Google Scholar
Digital Library
- U. Bondhugula et al. PLuTo: a practical and fully automatic polyhedral program optimization system. In Proc. of PLDI'08, Tucson, AZ, 2008.Google Scholar
Digital Library
- A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Parallelization. Springer, 2000. Google Scholar
Digital Library
- P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, 1991.Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem. Part I. One-dimensional time. IJPP'92, 21(5):313--347, 1992. Google Scholar
Digital Library
- P. Feautrier. Scalable and structured scheduling. IJPP'06, 34(5):459--487, 2006. Google Scholar
Digital Library
- G. Kahn and D. MacQueen. Coroutines and Networks of Parallel Processes. In Proceedings of IFIP Congress 77, pages 993--998, 1977.Google Scholar
- B. Kienhuis, E. Rijpkema, and E. Deprettere. Compaan: Deriving process networks from matlab for embedded signal processing architectures. In Proc. of CODES'00, pages 13--17. ACM, 2000. Google Scholar
Digital Library
- E. A. Lee and T. M. Parks. Dataflow process networks. Proc. of the IEEE, 83(5):773--801, 2002.Google Scholar
Cross Ref
- C. Lengauer. Loop parallelization in the polytope model. LECTURE NOTES IN COMPUTER SCIENCE, pages 398--398, 1993. Google Scholar
Digital Library
- S. Meijer, H. Nikolov, and T. Stefanov. Combining process splitting and merging transformations for polyhedral process networks. Proc. ESTIMedia'10.Google Scholar
- NVIDIA Corp. NVIDIA CUDA Technical Documentation: Programming and Best Practices Guide V3.2. Technical report, Sept. 2010.Google Scholar
- T. Stefanov et al. System design using Kahn process networks: the Compaan/Laura approach. In Proc. of DATE'04, volume 1, 2004. Google Scholar
Digital Library
- S. Verdoolaege. Polyhedral process networks. Handbook of Signal Processing Systems, pages 931--965, 2010.Google Scholar
Cross Ref
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN Notices, 45(6):86--97, 2010. Google Scholar
Digital Library
Index Terms
(auto-classified)KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
PPoPP '14Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both se-quential code and ...






Comments