Abstract
Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. We demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization. We demonstrate how our vectorization scheme applies to dominant SIMD architectures, and present experimental results on a wide range of key kernels, showing speedups in execution time up to 3.7 for interleaving levels (stride) as high as 8.
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures - A Dependence-based Approach. Morgan Kaufmann Publishers, 2001. Google Scholar
Digital Library
- K. Asanovic and D. Johnson. Torrent Architecture Manual. Technical report tr-96-056, Internation Computer Science Institute (ICSI), 1996. Google Scholar
Digital Library
- L. Bachega, S. Chatterjee, K. A. Dockserz, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowskix, G. K. Liu, M. P. Mendell, C. D. Wait, and T. J. C. Ward. A High-performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design. In Proc. of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT'04), pages 85--96, September 2004. Google Scholar
Digital Library
- A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. Intel Technology J., February 2001.Google Scholar
- A. Bik. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004. Google Scholar
Digital Library
- J. Corbal, R. Espasa, and M. Valero. Exploiting a New Level of DLP in Multimedia Applications. In Proc. of the 32nd annual ACM/IEEE International Symposium on Microarchitecture (Micro), pages 72--79, 1999. Google Scholar
Digital Library
- P. D'Arcy and S. Beach. StarCore SC140: A New DSP Architecture for Portable Devices. In Wireless Symposium. Motorola, September 1999.Google Scholar
- K. Diefendorff, P. K. Dubey, R. Hochsprung and H. Scales. Altivec Extension to PowerPC Accelerates Media Processing. IEEE Micro, Vol. 20, No. 2, pages 85--95, March-April 2000. Google Scholar
Digital Library
- A. E. Eichenberger, P. Wu, and K. O'brien. Vectorization for SIMD Architectures with Alignment Constraints. In Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 82--93, June 2004. Google Scholar
Digital Library
- R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. Tarantula: A Vector Extension to the Alpha Architecture. In Proc. of the 29th Annual International Symposium on Computer Architecture (ISCA), pages 281--292, May 2002. Google Scholar
Digital Library
- Free Software Foundation. Auto-Vectorization in GCC, http://gcc.gnu.org/projects/tree-ssa/vectorization.html.Google Scholar
- Free Software Foundation. GCC, http://gcc.gnu.org.Google Scholar
- G. Goff, K. Kennedy, and C. Tseng. Practical Dependence Testing. In Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 15--29, June 1991. Google Scholar
Digital Library
- Texas Instruments. www.ti.com/sc/c6x, 2000.Google Scholar
- J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4), pages 589--604, July 2005. Google Scholar
Digital Library
- A. Kudriavtsev and P. Kogge Generation of Permutations for SIMD Processors in Proc. of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems (LCTES), pages 147 -- 156, June 2005. Google Scholar
Digital Library
- S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 145--156, June 2000. Google Scholar
Digital Library
- J. Lorenz, S. Kral, F. Franchetti, and C. W. Ueberhuber. Vectorization Techniques for the BlueGene/L Double FPU. IBM Journal of Research and Development, 49(2-3), pages 437--446, March/May 2005. Google Scholar
Digital Library
- J. Merrill. Generic and Gimple: A New Tree Representation for Entire Functions. In the GCC Developer's summit, pages 171--180, June 2003.Google Scholar
- J. H. Moreno, V. Zyuban, U. Shvadron, F. Neeser, J. Derby, M. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David, S. Asaad, T. Fox, M. Biberstein, D. Naishlos, and H. Hunter. An Innovative Low-power High-performance Programmable Signal Processor for Digital Communications. IBM Journal of Research and Development 47(2-3), pages 299--326, March/May 2003. Google Scholar
Digital Library
- D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP Architecture. In Proc. of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pages 2--11, October 2003. Google Scholar
Digital Library
- D. Naishlos and R. Henderson. Multi-platform Auto-vectorization. In Proc. of the 4th Annual International Symposium on Code Generation and Optimization (CGO), March 2006. Google Scholar
Digital Library
- H. Nguyen and L. K. John. Exploiting SIMD Parallelism in DSP and Multimedia Algorithms using the AltiVec Technology. In Intl. Conf. on Supercomputing, pages 11--20, 1999. Google Scholar
Digital Library
- D. Novillo. Tree SSA - a New Optimization Infrastructure for GCC. In Proc. of the GCC Developers Summit, pages 181--194, June 2003.Google Scholar
- A. Peleg and U. Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro Vol.16, No.4, pages 42--50, August 1996. Google Scholar
Digital Library
- G. Pokam, S. Bihan, J. Simonnet, and F. Bodin. SWARP: A Retargetable Preprocessor for Multimedia Instructions In Concurrency and Computation: Practice and Experience; Special Issue: Compilers for Parallel Computers, Vol. 16, No. 2-3, pages 303 -- 318, January 2004. Google Scholar
Digital Library
- S. Pop, G. Silber, A. Cohen, P. Clauss, and V. Loechner. Fast Recognition of Scalar Evolutions on Three-address SSA Code. Research Report A/354/CRI, CRI/ENSMP, April 2004.Google Scholar
- S. Pop, A. Cohen, and G. Silber. Induction Variable Analysis with Delayed Abstractions. In Proc. of the First International Conference of High Performance Embedded Architectures and Compilers (HiPEAC), pages 218--232, November 2005. Google Scholar
Digital Library
- I. Pryanishnikov, A. Krall, and N. Horspool. Pointer Alignment Analysis for Processors with SIMD Instructions. In Proc. of the 5th Workshop on Media and Streaming Processors at Micro '03, pages 50--57, December 2003.Google Scholar
- G. Ren, P. Wu, and D. Padua. A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions. In 16th International Workshop of Languages and Compilers for Parallel Computing (LCPC), pages 420 -- 435, October 2003. Google Scholar
Digital Library
- G. Ren, P. Wu, and D. Padua. Optimizing Data Permutations for SIMD Devices. to appear in Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2006. Google Scholar
Digital Library
- J. Shin, J. Chame, and M. W. Hall. Compiler-controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 45--55, September 2002. Google Scholar
Digital Library
- J. Shin, M. Hall, and J. Chame. Superword-Level Parallelism in the Presence of Control Flow. In Proc. of International Symposium on Code Generation and Optimization (CGO), pages 165--175, March 2005. Google Scholar
Digital Library
- K. B. Smith, A. J. Bik, and X. Tian. Support for the Intel Pentium 4 Processor with Hyper-threading Technology in Intel 8.0 Compilers. Intel Technology Journal, 8(1), pages 19--31, February 2004.Google Scholar
- D. Talla, L. K. John, and D. Burger. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements. IEEE Trans. on Computers Vol. 52, No. 8, pages 1015--1031, August 2003. Google Scholar
Digital Library
- Crecent Bay Software. VAST-F/ALtivec: Automatic Fortran Vectorizer for PowerPC Vector Unit, http://www.crescentbaysoftware.com/docs/vastfav.pdf.Google Scholar
- Crecent Bay Software. Vast/altivec faq: Vectorization for Altivec, http://www.crescentbaysoftware.com/altivec_FAQ.html.Google Scholar
- M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996. Google Scholar
Digital Library
- P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment. In Proc. of the International Symposium on Code Generation and Optimization (CGO), pages 153-- 164, March 2005. Google Scholar
Digital Library
Index Terms
Auto-vectorization of interleaved data for SIMD
Recommendations
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Auto-vectorization of interleaved data for SIMD
PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and ImplementationMost implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data ...
Vectorizing for a SIMdD DSP architecture
CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systemsThe Single Instruction Multiple Data (SIMD) model for finegrained parallelism was recently extended to support SIMD operations on disjoint vector elements. In this paper we demonstrate how SIMdD (SIMD on disjoint data) supports e#ective vectorization of ...







Comments