Abstract
Growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control-flow divergences. These phenomena stem from a condition that we call data divergence, which occurs whenever two processing elements (PEs) see the same variable name holding different values. This article introduces divergence analysis, a static analysis that discovers data divergences. This analysis, currently deployed in an industrial quality compiler, is useful in several ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the automatic optimization of SIMD programs. We demonstrate this last point by introducing the notion of a divergence-aware register spiller. This spiller uses information from our analysis to either rematerialize or share common data between PEs. As a testimony of its effectiveness, we have tested it on a suite of 395 CUDA kernels from well-known benchmarks. The divergence-aware spiller produces GPU code that is 26.21% faster than the code produced by the register allocator used in the baseline compiler.
- Abel, N. E., Budnik, P. P., Kuck, D. J., Muraoka, Y., Northcote, R. S., and Wilhelmson, R. B. 1969. TRANQUIL: A language for an array processing computer. In Proceedings of AFIPS. ACM, 57--73. Google Scholar
Digital Library
- Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers: Principles, Techniques, and Tools 2nd Ed. Addison Wesley. Google Scholar
Digital Library
- Aiken, A. and Gay, D. 1998. Barrier inference. In Proceedings of POPL. ACM, 342--354. Google Scholar
Digital Library
- Appel, A. W. 1998. SSA is functional programming. SIGPLAN Not. 33, 4, 17--20. Google Scholar
Digital Library
- Baghsorkhi, S. S., Delahaye, M., Patel, S. J., Gropp, W. D., and Hwu, W.-M. W. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of PPoPP. ACM, 105--114. Google Scholar
Digital Library
- Belady, L. A. 1966. A study of replacement algorithms for a virtual storage computer. IBM Syst. J. 5, 2, 78--101. Google Scholar
Digital Library
- Blelloch, G. and Chatterjee, S. 1990. Vcode: A data-parallel intermediate language. In Proceedings of FMPC. ACM, 471--480.Google Scholar
- Boudier, P. and Sellers, G. 2011. Memory system on Fusion APUs. In Proceedings of the AMD Fusion Developer Summit. AMD.Google Scholar
- Bougé, L. and Levaire, J.-L. 1992. Control structures for data-parallel SIMD languages: Semantics and implementation. Future Gen. Comput. Syst. 8, 4, 363--378. Google Scholar
Digital Library
- Bouknight, W., Denenberg, S. A., McIntyre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L. 1972. The Illiac IV system. IEEE 60, 4, 369--388.Google Scholar
Cross Ref
- Briggs, P., Cooper, K. D., and Torczon, L. 1992. Rematerialization. In Proceedings of PLDI. ACM, 311--321. Google Scholar
Digital Library
- Brockmann, K. and Wanka, R. 1997. Efficient oblivious parallel sorting on the MasPar MP-1. Vol. ICSS. 1, 200. Google Scholar
Digital Library
- Budimlic, Z., Cooper, K. D., Harvey, T. J., Kennedy, K., Oberg, T. S., and Reeves, S. W. 2002. Fast copy coalescing and live-range identification. In Proceedings of PLDI. ACM, 25--32. Google Scholar
Digital Library
- Carrillo, S., Siegel, J., and Li, X. 2009. A control-structure splitting optimization for GPGPU. In Proceedings of the 6th ACM Conference Computing Frontiers. ACM, 147--150. Google Scholar
Digital Library
- Cederman, D. and Tsigas, P. 2009. GPU-quicksort: A practical quicksort algorithm for graphics processors. J. Exp. Algorithmics 14, 1, 4--24. Google Scholar
Digital Library
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 44--54. Google Scholar
Digital Library
- Choi, J.-D., Cytron, R., and Ferrante, J. 1991. Automatic construction of sparse data flow evaluation graphs. In Proceedings of POPL. ACM, 55--66. Google Scholar
Digital Library
- Collange, S., Defour, D., and Zhang, Y. 2009. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proceedings of the Parallel Processing Workshop (HPPC). Lecture Notes in computer science, vol. 6043, Springer, 46--55. Google Scholar
Digital Library
- Cousot, P. and Halbwachs, N. 1978. Automatic discovery of linear restraints among variables of a program. In Proceedings of POPL. ACM, 84--96. Google Scholar
Digital Library
- Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2011. Divergence analysis and optimizations. In Proceedings of International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 320--329. Google Scholar
Digital Library
- Coutinho, B., Sampaio, D., Pereira, F. M. Q., and Meira, W. 2013. Profiling divergences in GPU applications. Concur. Comput. Pract. Exp. 25, 6, 775--789.Google Scholar
Cross Ref
- Cunningham, D., Bordawekar, R., and Saraswat, V. 2011. GPU programming in a high level language: Compiling ×10 to CUDA. In Proceedings of the ACM SIGPLAN X10 Workshop. ACM, 8:1--8:10. Google Scholar
Digital Library
- Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13, 4, 451--490. Google Scholar
Digital Library
- Darema, F., George, D. A., Norton, V. A., and Pfister, G. F. 1988. A single-program-multiple-data computational model for EPEX/FORTRAN. Parallel Comput. 7, 1, 11--24.Google Scholar
Cross Ref
- Diamos, G., Kerr, A., Yalamanchili, S., and Clark, N. 2010. Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 354--364. Google Scholar
Digital Library
- Dubach, C., Cheng, P., Rabbah, R., Bacon, D. F., and Fink, S. J. 2012. Compiling a high-level language for GPUS: (via language support for architectures and compilers). In Proceedings of ACM SIGPLAN PLDI. ACM, 1--12. Google Scholar
Digital Library
- Farrell, C. A. and Kieronska, D. H. 1996. Formal specification of parallel SIMD execution. Theo. Comp. Science 169, 1, 39--65. Google Scholar
Digital Library
- Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349. Google Scholar
Digital Library
- Flynn, M. 1972. Some computer organizations and their effectiveness. IEEE Trans. Comput. 21, 9, 948--960. Google Scholar
Digital Library
- Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of IEEE/ACM MICRO. IEEE, 407--420. Google Scholar
Digital Library
- Garland, M. and Kirk, D. B. 2010. Understanding throughput-oriented architectures. Comm. ACM 53, 58--66. Google Scholar
Digital Library
- Gou, C. and Gaydadjiev, G. 2013. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. Int. J. Parallel Program. 41, 3, 400--429.Google Scholar
Cross Ref
- Grover, V., Joannes, B., Aarts, M., and Murphy, M. 2009. Variance analysis for translating CUDA code for execution by a general purpose processor. U.S. Patent 2009/0259997, Filed March 31, 2009, and issued October 15, 2009.Google Scholar
- Hack, S. and Goos, G. 2006. Optimal register allocation for SSA-form programs in polynomial time. Inf. Process. Lett. 98, 4, 150--155. Google Scholar
Digital Library
- Han, T. D. and Abdelrahman, T. S. 2011. Reducing branch divergence in GPU programs. In Proceedings of GPGPU-4. ACM, 3:1--3:8. Google Scholar
Digital Library
- Jang, B., Schaa, D., Mistry, P., and Kaeli, D. 2010. Static memory access pattern analysis on a massively parallel GPU. In Proceedings of PLDI. ACM.Google Scholar
- Karrenberg, R. and Hack, S. 2011. Whole-function vectorization. In Proceedings of the 9th IEEE/ACM CGO. IEEE, 141--150. Google Scholar
Digital Library
- Karrenberg, R. and Hack, S. 2012. Improving performance of OpenCL on CPUS. In Proceedings of CC. Springer, 1--20. Google Scholar
Digital Library
- Keryell, R., Materat, P., and Paris, N. 1991. POMP, or how to design a massively parallel machine with small developments. In Proceedings of Parallel Architectures and Languages Europe (PARLE). Lecture Note in Computer Science, vol. 505, Springer, 83--100.Google Scholar
Cross Ref
- Kung, S.-Y., Arun, K. S., Gal-Ezer, R. J., and Bhaskar Rao, D. V. 1982. Wavefront array processor: Language, architecture, and applications. IEEE Trans. Comput. 31, 1054--1066. Google Scholar
Digital Library
- Lashgar, A. and Baniasadi, A. 2011. Performance in GPU architectures: Potentials and distances. In Proceedings of the 9th Workshop on Duplicating, Deconstructing, and Debunking (WDDD). IEEE, 75--81.Google Scholar
- Lawrie, D. H., Layman, T., Baer, D., and Randal, J. M. 1975. GLYPNIR-a programming language for ILLIAC IV. Comm. ACM 18, 3, 157--164. Google Scholar
Digital Library
- Lee, S., Min, S.-J., and Eigenmann, R. 2009. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proceedings of ACM SIGPLAN PPoPP. ACM, 101--110. Google Scholar
Digital Library
- Lee, Y., Avizienis, R., Bishara, A., Xia, R., Lockhart, D., Batten, C., and Asanovic, K. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). ACM, 129--140. Google Scholar
Digital Library
- Lee, Y., Krashinsky, R., Grover, V., Keckler, S. W., and Asanovic, K. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the IEEE/ACM CGO. ACM, 1--11. Google Scholar
Digital Library
- Leiba, R., Hack, S., and Wald, I. 2012. Extending a C-like language for portable SIMD programming. In Proceedings of ACM SIGPLAN PPOPP. ACM, 65--74. Google Scholar
Digital Library
- Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM, 235--246. Google Scholar
Digital Library
- Miné, A. 2006. The octagon abstract domain. Higher Order Symbol. Comput. 19, 31--100. Google Scholar
Digital Library
- Mu, S., Zhang, X., Zhang, N., Lu, J., Deng, Y. S., and Zhang, S. 2010. IP routing processing with graphic processors. In Proceedings of DATE. IEEE, 93--98. Google Scholar
Digital Library
- Nickolls, J. and Dally, W. J. 2010. The GPU computing era. IEEE Micro 30, 2, 56--69. Google Scholar
Digital Library
- Nickolls, J. and Kirk, D. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) 4th Ed. Elsevier, Chapter A, A.1--A.77.Google Scholar
- Nielson, F., Nielson, H. R., and Hankin, C. 2005. Principles of Program Analysis. Springer. Google Scholar
Digital Library
- Ottenstein, K. J., Ballance, R. A., and MacCabe, A. B. 1990. The program dependence Web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. In Proceedings of ACM SIGPLAN PLDI. ACM, 257--271. Google Scholar
Digital Library
- Pereira, F. M. Q. 2011. Dirmap's Blog. http://divmap.wordpress.com/.Google Scholar
- Perrot, R. H. 1979. A language for array and vector processors. ACM Trans. Program. Lang. Syst. 1, 177--195. Google Scholar
Digital Library
- Pharr, M. and Mark, W. R. 2012. ISPC: A SPMD compiler for high-performance CPU programming. In Proceedings of Innorative Parallel Computing (InPar).Google Scholar
- Poletto, M. and Sarkar, V. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5, 895--913. Google Scholar
Digital Library
- Prabhu, T., Ramalingam, S., Might, M., and Hall, M. 2011. EigenCFA: Accelerating flow analysis with GPUs. In Proceedings of POPL. ACM. Google Scholar
Digital Library
- Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., Stone, S. S., Kirk, D. B., and mei W. Hwu, W. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of ACM SIGPLAN PPoPP. ACM, 73--82. Google Scholar
Digital Library
- Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., and Mendelson, A. 2009. Programming model for a heterogeneous ×86 platform. In Proceedings of ACM SIGPLAN PLDI. ACM, 431--440. Google Scholar
Digital Library
- Samadi, M., Hormati, A., Mehrara, M., and Mahlke, S. 2012. Adaptive input-aware compilation for graphics engines. In Proceedings of ACM SIGPLAN PLDI. ACM. Google Scholar
Digital Library
- Sampaio, D., Martins, R., Collange, S., and Pereira, F. M. Q. 2012a. Divergence analysis with affine constraints. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 137--146. Google Scholar
Digital Library
- Sampaio, D. N., Gedeon, E., Pereira, F. M. Q., and Collange, S. 2012b. Spill code placement for simd machines. In Proceedings of the 16th Brazilian Symposium on Language Programmings (SBLP). Lecture Notes on Computer Science, vol. 7554, Springer, 12--26. Google Scholar
Digital Library
- Sandes, E. F. O. and de Melo, A. C. M. 2010. CUDALIGN: Using GPU to accelerate the comparison of megabase genomic sequences. In Proceedings of ACM SIGPLAN PPoPP. ACM, 137--146. Google Scholar
Digital Library
- Scholz, B., Zhang, C., and Cifuentes, C. 2008. User-input dependence analysis via graph reachability. Tech. rep., Sun, Inc. Google Scholar
Digital Library
- Stratton, J. A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., and Hwu, W.-m. W. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of IEEE/ACM CGO. IEEE, 111--119. Google Scholar
Digital Library
- Stratton, J. A., Rodrigues, C., Sun, I.-J., Obeid, N., Chang, L.-W., Anssari, N., Liu, G. D., and mei W. Hwu, W. 2012. The parboil report. Tech. rep., University of Illinois at Urbana-Champaign.Google Scholar
- Tu, P. and Padua, D. 1995. Efficient building and placing of gating functions. In Proceedings of ACM SIGPLAN PLDI. ACM, 47--55. Google Scholar
Digital Library
- Weiser, M. 1981. Program slicing. In Proceedings of ICSE. IEEE, 439--449. Google Scholar
Digital Library
- Yang, Y., Xiang, P., Kong, J., and Zhou, H. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of ACM SIGPLAN PLDI. ACM, 86--97. Google Scholar
Digital Library
- Zhang, E. Z., Jiang, Y., Guo, Z., and Shen, X. 2010. Streamlining GPU applications on the fly: Thread divergence elimination through runtime thread-data remapping. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS). ACM, 115--126. Google Scholar
Digital Library
- Zhang, E. Z., Jiang, Y., Guo, Z., Tian, K., and Shen, X. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. In Proceedings of ASPLOS. ACM, 369--380. Google Scholar
Digital Library
- Zhang, Y. and Owens, J. D. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). ACM, 382--393. Google Scholar
Digital Library
Index Terms
Divergence analysis
Recommendations
Pointer-Based Divergence Analysis for OpenCL 2.0 Programs
A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This ...
Divergence Analysis and Optimizations
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesThe growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings ...
Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads
Highlights- We propose a new thread synchronization heuristic called Min-SP/PC.
- Min-SP/PC ...
AbstractSimultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one ...






Comments