skip to main content
research-article

Function Call Re-Vectorization

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on explicit vectors. To close this expressiveness gap while preserving performance, this paper introduces the notion of \ourinvention{} (CREV). CREV allows changing the dimension of vectorization during the execution of a kernel, exposing it as a nested parallel kernel call. CREV affords programmability close to dynamic parallelism, a feature that allows the invocation of kernels from inside kernels, but at much lower cost. In this paper, we present a formal semantics of CREV, and an implementation of it on the ISPC compiler. We have used CREV to implement some classic algorithms, including string matching, depth first search and Bellman-Ford, with minimum effort. These algorithms, once compiled by ISPC to Intel-based vector instructions, are as fast as state-of-the-art implementations, yet much simpler. Thus, CREV gives developers the elegance of dynamic programming, and the performance of explicit SIMD programming.

References

  1. R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87--90, 1958. Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Bouge and J.-L. Levaire. Control structures for data parallel SIMD languages: semantics and implementation. Future Generation Computer Systems, 8(4):363--378, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Brodman, D. Babokin, I. Filippov, and P. Tu. Writing scalable SIMD programs with ISPC. In WPMVP, pages 25-- 32. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Chen and X. Shen. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Micro, pages 407-- 419. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 2nd edition, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. M. Jr. Divergence analysis and optimizations. In PACT, pages 320-- 329. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. M. Jr. Profiling divergences in GPU applications. Concurrency and Computation: Practice and Experience, 1(10.1002/cpe.285- 15):1--15, 2012.Google ScholarGoogle Scholar
  8. J. DiMarco and M. Taufer. Performance impact of dynamic parallelism on different clustering algorithms. Modeling and Simulation for Defense Systems and Applications, 8752(VIII): 87520E--87520E:8, 2013.Google ScholarGoogle Scholar
  9. P. Erdos and A. Renyi. On random graphs. I. Publicationes Mathematicae, 6(1):290--297, 1959.Google ScholarGoogle Scholar
  10. C. A. Farrell and D. H. Kieronska. Formal specification of parallel SIMD execution. Theo. Comp. Science, 169(1):39-- 65, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. TOPLAS, 9(3): 319--349, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Garland and D. B. Kirk. Understanding throughputoriented architectures. Commun. ACM, 53:58--66, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Gaster. An execution model for OpenCL 2.0. Technical Report 2014-02, Computer Sciences, 2014.Google ScholarGoogle Scholar
  14. T. D. Han and T. S. Abdelrahman. Reducing divergence in GPGPU programs with loop merging. In GPGPU, pages 12-- 23. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Hoogvorst, R. Keryell, N. Paris, and P. Matherat. POMP or how to design a massively parallel machine with small developments. In PARLE, pages 83--100. Springer, 1991. Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Jones. Introduction to dynamic parallelism -- Invited Talk. In GPU Technology Conference, pages 1--33. NVIDIA, 2014.Google ScholarGoogle Scholar
  17. F. Khorasani, R. Gupta, and L. N. Bhuyan. Efficient warp execution in presence of divergence with collaborative context collection. In Micro, pages 204--215. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Klockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. Pycuda and pyopencl: A scripting-based approach to gpu run-time code generation. Parallel Comput., 38(3): 157--174, Mar. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. E. Knuth, J. H. M. Jr., and V. R. Pratt. Fast pattern matching in strings. Journal of Computing, 6(2):323--350, 1977.Google ScholarGoogle Scholar
  20. MasPar. MasPar Programming Language (ANSI C compatible MPL) Reference Manual, 1992.Google ScholarGoogle Scholar
  21. D. Merrill and A. Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Proceessing Letters, 21 (2):245--272, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Munshi, B. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg. OpenCL Programming Guide. Addison-Wesley, 1st edition, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30:56--69, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Novak. Loop optimization for divergence reduction on GPUs with SIMT architecture. IEEE Transactions on Parallel and Distributed Systems, 26(6):1633--1642, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Pharr and W. R. Mark. ISPC: A SPMD compiler for highperformance CPU programming. In InPar, pages 1--13. IEEE, 2012.Google ScholarGoogle Scholar
  26. J. Rose and G. Steele. C*: An extended C language for data parallel programming. In ICS, 1987.Google ScholarGoogle Scholar
  27. D. Sampaio, R. Martins, S. Collange, and F. M. Q. Pereira. Divergence analysis with affine constraints. In SBAC-PAD, pages 67--74. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Sampaio, R. M. de Souza, C. Collange, and F. M. Q. Pereira. Divergence analysis. ACM Trans. Program. Lang. Syst., 35(4):13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Sanders and E. Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. AddisonWesley, 1st edition, 2010.Google ScholarGoogle Scholar
  30. T. Schaub, S. Moll, R. Karrenberg, and S. Hack. The impact of the simd width on control-flow and memory divergence. TACO, 11(4):54:1--54:25, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro, 36(2):34--46, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In IISWC, pages 51--60. IEEE, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  33. J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus. In ISCA, pages 528--540. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. LaPerm: Locality aware scheduler for dynamic parallelism on GPUs. In ISCA, 2016.Google ScholarGoogle Scholar
  35. J. Wu, A. Belevich, E. Bendersky, M. Heffernan, C. Leary, J. Pienaar, B. Roune, R. Springer, X. Weng, and R. Hundt. Gpucc: An open-source gpgpu compiler. In CGO, pages 105-- 116. ACM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Yang and H. Zhou. CUDA-NP: realizing nested threadlevel parallelism in GPGPU applications. In PPoPP, pages 93--106. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the- fly elimination of dynamic irregularities for GPU computing. In ASPLOS, pages 369--380. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Function Call Re-Vectorization

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 52, Issue 8
            PPoPP '17
            August 2017
            442 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/3155284
            Issue’s Table of Contents
            • cover image ACM Conferences
              PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
              January 2017
              476 pages
              ISBN:9781450344937
              DOI:10.1145/3018743

            Copyright © 2017 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 January 2017

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!