Abstract
Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to handle for SIMD compilers. Due to those challenges, existing SIMD compilers have excluded loops with array indirection from their candidate loops for SIMD vectorization. However, addressing those challenges is inevitable, since many important compute-intensive applications extensively use array indirection to reduce memory and computation requirements. In this work, we propose a method to generate efficient SIMD code for loops containing indirected memory references. We extract both inter- and intra-iteration parallelism, taking data reorganization overhead into consideration. We also optimally place data reorganization code in order to amortize the reorganization overhead through the performance gain of SIMD vectorization. Experiments on four array indirection kernels, which are extracted from real-world scientific applications, show that our proposed method effectively generates SIMD code for irregular kernels with array indirection. Compared to the existing SIMD vectorization methods, our proposed method significantly improves the performance of irregular kernels by 91%, on average.
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2002. Google Scholar
Digital Library
- R. Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43, pages 201--212, 2010. Google Scholar
Digital Library
- H. Chang and W. Sung. Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES '08, pages 167--176, 2008. Google Scholar
Digital Library
- R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput., 22: 462--478, Sep. 1994. Google Scholar
Digital Library
- K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales. AltiVec extension to PowerPC accelerates media processing. IEEE Micro, 20: 85--95, Mar./Apr. 2000. Google Scholar
Digital Library
- A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, pages 82--93, 2004. Google Scholar
Digital Library
- T. Grosser, H. Zheng, R. A, A. Simburger, A. Grosslinger, and L.-N. Pouchet. Polly - polyhedral optimization in llvm. In First International Workshop on Polyhedral Compilation Techniques (IMPACT'11), 2011.Google Scholar
- M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26: 10--24, Mar. 2006. Google Scholar
Digital Library
- J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34: 1--17, Sep. 2006. Google Scholar
Digital Library
- A. Krall and S. Lelait. Compilation techniques for multimedia processors. Int. J. Parallel Program., 28: 347--361, Aug. 2000. Google Scholar
Cross Ref
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 145--156, 2000. Google Scholar
Digital Library
- S. Larsen, R. Rabbah, and S. Amarasinghe. Exploiting vector parallelism in software pipelined loops. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 119--129, 2005. Google Scholar
Digital Library
- C. Lattner. Macroscopic Data Structure Analysis and Optimization. PhD thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, May 2005. {online} http://llvm.cs.uiuc.edu. Google Scholar
Digital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004. Google Scholar
Digital Library
- R. Leupers. Code selection for media processors with SIMD instructions. In Proceedings of the conference on Design, Automation and Test in Europe, DATE '00, pages 4--8, 2000. Google Scholar
Digital Library
- D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP architecture. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES '03, pages 2--11, 2003. Google Scholar
Digital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, pages 132--143, 2006. Google Scholar
Digital Library
- I. Pryanishnikov, A. Krall, T. U. Wien, and N. Horspool. Pointer alignment analysis for processors with SIMD instructions. In Proceedings of the 5th Workshop on Media and Streaming Processors, pages 50--57, 2003.Google Scholar
- G. Ren, P. Wu, and D. Padua. A preliminary study on the vectorization of multimedia applications for multimedia extensions. In Languages and Compilers for Parallel Computing, volume 2958 of Lecture Notes in Computer Science, pages 420--435. 2004.Google Scholar
Cross Ref
- G. Ren, P. Wu, and D. Padua. Optimizing data permutations for SIMD devices. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, pages 118--131, 2006. Google Scholar
Digital Library
- I. Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. In Proceedings of GCC Developers' Summit, pages 131--142, 2007.Google Scholar
- J. Shalf, S. Dosanjh, and J. Morrison. Exascale computing technology challenges. In Proc. International Meeting on High Performance Computing for Computational Science, volume 6449 of Lecture Notes in Computer Science, pages 1--25, 2011. Google Scholar
Digital Library
- N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. Int. J. Parallel Program., 28: 363--400, Aug. 2000. Google Scholar
Cross Ref
- R. Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1 (2): 146--160, 1972.Google Scholar
Digital Library
- D. Walls. How to use the restrict qualifier in C. Sun Microsystems, Sun Developer Network (SDN), March 2006. {online} http://developers.sun.com/.Google Scholar
- P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An integrated simdization framework using virtual vectors. In Proceedings of the 19th annual International Conference on Supercomputing, ICS '05, pages 169--178, 2005. Google Scholar
Digital Library
Index Terms
Efficient SIMD code generation for irregular kernels
Recommendations
FlexVec: auto-vectorization for irregular loops
PLDI '16Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...
Efficient SIMD code generation for irregular kernels
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingArray indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to ...
Distributed memory code generation for mixed Irregular/Regular computations
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingMany applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is ...







Comments