ABSTRACT
The widespread presence of SIMD devices in today's microprocessors has made compiler techniques for these devices tremendously important. One of the most important and difficult issues that must be addressed by these techniques is the generation of the data permutation instructions needed for non-contiguous and misaligned memory references. These instructions are expensive and, therefore, it is of crucial importance to minimize their number to improve performance and, in many cases, enable speedups over scalar code.Although it is often difficult to optimize an isolated data reorganization operation, a collection of related data permutations can often be manipulated to reduce the number of operations. This paper presents a strategy to optimize all forms of data permutations. The strategy is organized into three steps. First, all data permutations in the source program are converted into a generic representation. These permutations can originate from vector accesses to non-contiguous and misaligned memory locations or result from compiler transformations. Second, an optimization algorithm is applied to reduce the number of data permutations in a basic block. By propagating permutations across statements and merging consecutive permutations whenever possible, the algorithm can significantly reduce the number of data permutations. Finally, a code generation algorithm translates generic permutation operations into native permutation instructions for the target platform. Experiments were conducted on various kinds of applications. The results show that up to 77% of the permutation instructions are eliminated and, as a result, the average performance improvement is 48% on VMX and 68% on SSE2. For several applications, near perfect speedups have been achieved on both platforms.
- Aart J. C. Bik. The Software Vectorization Handbook : Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004. Google Scholar
Digital Library
- CCIR Recommendation 601-2. Encoding Parameters of Digital Television for Studios, 1990.Google Scholar
- Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang-Hua Teng. Automatic array alignment in data-parallel programs. In POPL '93: Proceedings of the 20th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 16--28. ACM Press, 1993. Google Scholar
Digital Library
- Gerald Cheong and Monica Lam. An optimizer for multimedia instruction sets. In Proceedings of the Second SUIF Compiler Workshop, 1997.Google Scholar
- E. Dahlhaus, D. S. Johson, C. H. Papadimitriou, P. D. Seymour, and M. Yannakakis. The complexity of multiterminal cuts. SIAM J. Computing, 23:864--894, 1994. Google Scholar
Digital Library
- Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI '04: Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 82--93. ACM Press, 2004. Google Scholar
Digital Library
- Franz Franchetti, Stefan Kral, Juergen Lorenz, and Christoph W. Ueberhuber. Efficient utilization of SIMD extensions. Proceedings of the IEEE, 93(2):409--425, 2005.Google Scholar
Cross Ref
- Free Software Foundation. Auto-vectorization in GCC, 2004. GCC.Google Scholar
- Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google Scholar
Cross Ref
- Gwan-Hwan Hwang, Jenq Kuen Lee, and Dz-Ching Ju. An array operation synthesis scheme to optimize FORTRAN 90 programs. In PPOPP '95: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 112--122. ACM Press, 1995. Google Scholar
Digital Library
- Intel Corporation. IA32 Intel Architecture Optimization, 2004.Google Scholar
- Andreas Krall and Sylvain Lelait. Compilation techniques for multimedia processors. International Journal of Parallel Programming, 28(4):347--361, 2000. Google Scholar
Cross Ref
- Alexei Kudriavtsev and Peter Kogge. Generation of permutations for SIMD processors. In LCTES'05: Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pages 147--156. ACM Press, 2005. Google Scholar
Digital Library
- Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI '00: Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 145--156. ACM Press, 2000. Google Scholar
Digital Library
- Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. Increasing and detecting memory address congruence. In PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 18--29. IEEE Computer Society, 2002. Google Scholar
Digital Library
- Rainer Leupers. Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools. Kluwer Academic Publishers, 2000. Google Scholar
Digital Library
- Xiaoming Li, Maria Jesus Garzaran, and David Padua. Optimizing sorting with genetic algorithms. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 99--110. IEEE Computer Society, 2005. Google Scholar
Digital Library
- Motorola Inc. AltiVec Technology Programming Environments Manual, 1998.Google Scholar
- Dorit Naishlos, Marina Biberstein, Shay Ben-David, and Ayal Zaks. Vectorizing for a SIMdD DSP architecture. In CASES '03: Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pages 2--11. ACM Press, 2003. Google Scholar
Digital Library
- Manikandan Narayanan and Katherine A. Yelick. Generating permutation instructions from a high-level description. In MSP '04: Proceedings of the 6th Workshop on Media and Streaming Processors, 2004.Google Scholar
- Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI '06: Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, 2006. Google Scholar
Digital Library
- Markus Puschel, Jose M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232--275, 2005.Google Scholar
Cross Ref
- Gang Ren, Peng Wu, and David Padua. An empirical study on the vectorization of multimedia applications for multimedia extensions. In IPDPS '05: Proceedings of the 19th International Parallel & Distributed Processing Symposium, 2005. Google Scholar
Digital Library
- Nicholas Rizzolo and David Padua. HiLO: High level optimization of FFTs. In LCPC '04: Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, 2004. Google Scholar
Digital Library
- Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodik, and Kemal Ebcioglu. Programming by sketching for bit-streaming programs. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 281--294. ACM Press, 2005. Google Scholar
Digital Library
- N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming, 28(4):363--300, 2000. Google Scholar
Cross Ref
- Peng Wu, Alexandre E. Eichenberger, and Amy Wang. Efficient SIMD code generation for runtime alignment and length conversion. In CGO '05: Proceedings of the International Symposium on Code Generation and Optimization, pages 153--164. IEEE Computer Society, 2005. Google Scholar
Digital Library
- Jianxin Xiong, Jeremy Johnson, Robert Johnson, and David Padua. SPL: a language and compiler for dsp algorithms. In PLDI '01: Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation, pages 298--308. ACM Press, 2001. Google Scholar
Digital Library
Index Terms
Optimizing data permutations for SIMD devices
Recommendations
Optimizing data permutations for SIMD devices
Proceedings of the 2006 PLDI ConferenceThe widespread presence of SIMD devices in today's microprocessors has made compiler techniques for these devices tremendously important. One of the most important and difficult issues that must be addressed by these techniques is the generation of the ...
Generation of permutations for SIMD processors
LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsShort vector (SIMD) instructions are useful in signal processing, multimedia, and scientific applications. They offer higher performance, lower energy consumption, and better resource utilization. However, compilers still do not have good support for ...
Generation of permutations for SIMD processors
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsShort vector (SIMD) instructions are useful in signal processing, multimedia, and scientific applications. They offer higher performance, lower energy consumption, and better resource utilization. However, compilers still do not have good support for ...







Comments