Abstract

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on explicit vectors. To close this expressiveness gap while preserving performance, this paper introduces the notion of \ourinvention{} (CREV). CREV allows changing the dimension of vectorization during the execution of a kernel, exposing it as a nested parallel kernel call. CREV affords programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this paper, we present a formal semantics of CREV, and an implementation of it on the ISPC compiler. We have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum effort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. Thus, CREV gives developers the elegance of dynamic programming, and the performance of explicit SIMD programming.
- R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87--90, 1958. Google Scholar
Cross Ref
- L. Bouge and J.-L. Levaire. Control structures for data parallel SIMD languages: semantics and implementation. Future Generation Computer Systems, 8(4):363--378, 1992. Google Scholar
Digital Library
- J. Brodman, D. Babokin, I. Filippov, and P. Tu. Writing scalable SIMD programs with ISPC. In WPMVP, pages 25-- 32. ACM, 2014. Google Scholar
Digital Library
- G. Chen and X. Shen. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Micro, pages 407-- 419. ACM, 2015. Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 2nd edition, 2001.Google Scholar
Digital Library
- B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. M. Jr. Divergence analysis and optimizations. In PACT, pages 320-- 329. IEEE, 2011. Google Scholar
Digital Library
- B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. M. Jr. Profiling divergences in GPU applications. Concurrency and Computation: Practice and Experience, 1(10.1002/cpe.285- 15):1--15, 2012.Google Scholar
- J. DiMarco and M. Taufer. Performance impact of dynamic parallelism on different clustering algorithms. Modeling and Simulation for Defense Systems and Applications, 8752(VIII): 87520E--87520E:8, 2013.Google Scholar
- P. Erdos and A. Renyi. On random graphs. I. Publicationes Mathematicae, 6(1):290--297, 1959.Google Scholar
- C. A. Farrell and D. H. Kieronska. Formal specification of parallel SIMD execution. Theo. Comp. Science, 169(1):39-- 65, 1996. Google Scholar
Digital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. TOPLAS, 9(3): 319--349, 1987. Google Scholar
Digital Library
- M. Garland and D. B. Kirk. Understanding throughputoriented architectures. Commun. ACM, 53:58--66, 2010. Google Scholar
Digital Library
- B. Gaster. An execution model for OpenCL 2.0. Technical Report 2014-02, Computer Sciences, 2014.Google Scholar
- T. D. Han and T. S. Abdelrahman. Reducing divergence in GPGPU programs with loop merging. In GPGPU, pages 12-- 23. ACM, 2013. Google Scholar
Digital Library
- P. Hoogvorst, R. Keryell, N. Paris, and P. Matherat. POMP or how to design a massively parallel machine with small developments. In PARLE, pages 83--100. Springer, 1991. Google Scholar
Cross Ref
- S. Jones. Introduction to dynamic parallelism -- Invited Talk. In GPU Technology Conference, pages 1--33. NVIDIA, 2014.Google Scholar
- F. Khorasani, R. Gupta, and L. N. Bhuyan. Efficient warp execution in presence of divergence with collaborative context collection. In Micro, pages 204--215. ACM, 2015. Google Scholar
Digital Library
- A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Pycuda and pyopencl: A scripting-based approach to gpu run-time code generation. Parallel Comput., 38(3): 157--174, Mar. 2012. Google Scholar
Digital Library
- D. E. Knuth, J. H. M. Jr., and V. R. Pratt. Fast pattern matching in strings. Journal of Computing, 6(2):323--350, 1977.Google Scholar
- MasPar. MasPar Programming Language (ANSI C compatible MPL) Reference Manual, 1992.Google Scholar
- D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Proceessing Letters, 21 (2):245--272, 2011. Google Scholar
Cross Ref
- A. Munshi, B. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg. OpenCL Programming Guide. Addison-Wesley, 1st edition, 2011.Google Scholar
Digital Library
- J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30:56--69, 2010. Google Scholar
Digital Library
- R. Novak. Loop optimization for divergence reduction on GPUs with SIMT architecture. IEEE Transactions on Parallel and Distributed Systems, 26(6):1633--1642, 2015. Google Scholar
Digital Library
- M. Pharr and W. R. Mark. ISPC: A SPMD compiler for highperformance CPU programming. In InPar, pages 1--13. IEEE, 2012.Google Scholar
- J. Rose and G. Steele. C*: An extended C language for data parallel programming. In ICS, 1987.Google Scholar
- D. Sampaio, R. Martins, S. Collange, and F. M. Q. Pereira. Divergence analysis with affine constraints. In SBAC-PAD, pages 67--74. IEEE, 2012. Google Scholar
Digital Library
- D. Sampaio, R. M. de Souza, C. Collange, and F. M. Q. Pereira. Divergence analysis. ACM Trans. Program. Lang. Syst., 35(4):13, 2013. Google Scholar
Digital Library
- J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. AddisonWesley, 1st edition, 2010.Google Scholar
- T. Schaub, S. Moll, R. Karrenberg, and S. Hack. The impact of the simd width on control-flow and memory divergence. TACO, 11(4):54:1--54:25, 2015.Google Scholar
Digital Library
- A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34--46, 2016. Google Scholar
Digital Library
- J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In IISWC, pages 51--60. IEEE, 2014. Google Scholar
Cross Ref
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus. In ISCA, pages 528--540. ACM, 2015. Google Scholar
Digital Library
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. LaPerm: Locality aware scheduler for dynamic parallelism on GPUs. In ISCA, 2016.Google Scholar
- J. Wu, A. Belevich, E. Bendersky, M. Heffernan, C. Leary, J. Pienaar, B. Roune, R. Springer, X. Weng, and R. Hundt. Gpucc: An open-source gpgpu compiler. In CGO, pages 105-- 116. ACM, 2016.Google Scholar
Digital Library
- Y. Yang and H. Zhou. CUDA-NP: realizing nested threadlevel parallelism in GPGPU applications. In PPoPP, pages 93--106. ACM, 2014. Google Scholar
Digital Library
- E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the- fly elimination of dynamic irregularities for GPU computing. In ASPLOS, pages 369--380. ACM, 2011. Google Scholar
Digital Library
Index Terms
Function Call Re-Vectorization
Recommendations
Function Call Re-Vectorization
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingProgramming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on ...
NVIDIA Tesla: A Unified Graphics and Computing Architecture
To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable ...
C-for-metal: high performance SIMD programming on intel GPUs
CGO '21: Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and OptimizationThe SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications ...







Comments