skip to main content
article

Auto-vectorization of interleaved data for SIMD

Published:11 June 2006Publication History
Skip Abstract Section

Abstract

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. We demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization. We demonstrate how our vectorization scheme applies to dominant SIMD architectures, and present experimental results on a wide range of key kernels, showing speedups in execution time up to 3.7 for interleaving levels (stride) as high as 8.

References

  1. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures - A Dependence-based Approach. Morgan Kaufmann Publishers, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Asanovic and D. Johnson. Torrent Architecture Manual. Technical report tr-96-056, Internation Computer Science Institute (ICSI), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Bachega, S. Chatterjee, K. A. Dockserz, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowskix, G. K. Liu, M. P. Mendell, C. D. Wait, and T. J. C. Ward. A High-performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design. In Proc. of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT'04), pages 85--96, September 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. Intel Technology J., February 2001.Google ScholarGoogle Scholar
  5. A. Bik. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Corbal, R. Espasa, and M. Valero. Exploiting a New Level of DLP in Multimedia Applications. In Proc. of the 32nd annual ACM/IEEE International Symposium on Microarchitecture (Micro), pages 72--79, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. D'Arcy and S. Beach. StarCore SC140: A New DSP Architecture for Portable Devices. In Wireless Symposium. Motorola, September 1999.Google ScholarGoogle Scholar
  8. K. Diefendorff, P. K. Dubey, R. Hochsprung and H. Scales. Altivec Extension to PowerPC Accelerates Media Processing. IEEE Micro, Vol. 20, No. 2, pages 85--95, March-April 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. E. Eichenberger, P. Wu, and K. O'brien. Vectorization for SIMD Architectures with Alignment Constraints. In Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 82--93, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. Tarantula: A Vector Extension to the Alpha Architecture. In Proc. of the 29th Annual International Symposium on Computer Architecture (ISCA), pages 281--292, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Free Software Foundation. Auto-Vectorization in GCC, http://gcc.gnu.org/projects/tree-ssa/vectorization.html.Google ScholarGoogle Scholar
  12. Free Software Foundation. GCC, http://gcc.gnu.org.Google ScholarGoogle Scholar
  13. G. Goff, K. Kennedy, and C. Tseng. Practical Dependence Testing. In Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 15--29, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Texas Instruments. www.ti.com/sc/c6x, 2000.Google ScholarGoogle Scholar
  15. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4), pages 589--604, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Kudriavtsev and P. Kogge Generation of Permutations for SIMD Processors in Proc. of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems (LCTES), pages 147 -- 156, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 145--156, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Lorenz, S. Kral, F. Franchetti, and C. W. Ueberhuber. Vectorization Techniques for the BlueGene/L Double FPU. IBM Journal of Research and Development, 49(2-3), pages 437--446, March/May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Merrill. Generic and Gimple: A New Tree Representation for Entire Functions. In the GCC Developer's summit, pages 171--180, June 2003.Google ScholarGoogle Scholar
  20. J. H. Moreno, V. Zyuban, U. Shvadron, F. Neeser, J. Derby, M. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David, S. Asaad, T. Fox, M. Biberstein, D. Naishlos, and H. Hunter. An Innovative Low-power High-performance Programmable Signal Processor for Digital Communications. IBM Journal of Research and Development 47(2-3), pages 299--326, March/May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP Architecture. In Proc. of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), pages 2--11, October 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Naishlos and R. Henderson. Multi-platform Auto-vectorization. In Proc. of the 4th Annual International Symposium on Code Generation and Optimization (CGO), March 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Nguyen and L. K. John. Exploiting SIMD Parallelism in DSP and Multimedia Algorithms using the AltiVec Technology. In Intl. Conf. on Supercomputing, pages 11--20, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Novillo. Tree SSA - a New Optimization Infrastructure for GCC. In Proc. of the GCC Developers Summit, pages 181--194, June 2003.Google ScholarGoogle Scholar
  25. A. Peleg and U. Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro Vol.16, No.4, pages 42--50, August 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Pokam, S. Bihan, J. Simonnet, and F. Bodin. SWARP: A Retargetable Preprocessor for Multimedia Instructions In Concurrency and Computation: Practice and Experience; Special Issue: Compilers for Parallel Computers, Vol. 16, No. 2-3, pages 303 -- 318, January 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Pop, G. Silber, A. Cohen, P. Clauss, and V. Loechner. Fast Recognition of Scalar Evolutions on Three-address SSA Code. Research Report A/354/CRI, CRI/ENSMP, April 2004.Google ScholarGoogle Scholar
  28. S. Pop, A. Cohen, and G. Silber. Induction Variable Analysis with Delayed Abstractions. In Proc. of the First International Conference of High Performance Embedded Architectures and Compilers (HiPEAC), pages 218--232, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Pryanishnikov, A. Krall, and N. Horspool. Pointer Alignment Analysis for Processors with SIMD Instructions. In Proc. of the 5th Workshop on Media and Streaming Processors at Micro '03, pages 50--57, December 2003.Google ScholarGoogle Scholar
  30. G. Ren, P. Wu, and D. Padua. A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions. In 16th International Workshop of Languages and Compilers for Parallel Computing (LCPC), pages 420 -- 435, October 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Ren, P. Wu, and D. Padua. Optimizing Data Permutations for SIMD Devices. to appear in Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Shin, J. Chame, and M. W. Hall. Compiler-controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 45--55, September 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Shin, M. Hall, and J. Chame. Superword-Level Parallelism in the Presence of Control Flow. In Proc. of International Symposium on Code Generation and Optimization (CGO), pages 165--175, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. B. Smith, A. J. Bik, and X. Tian. Support for the Intel Pentium 4 Processor with Hyper-threading Technology in Intel 8.0 Compilers. Intel Technology Journal, 8(1), pages 19--31, February 2004.Google ScholarGoogle Scholar
  35. D. Talla, L. K. John, and D. Burger. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements. IEEE Trans. on Computers Vol. 52, No. 8, pages 1015--1031, August 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Crecent Bay Software. VAST-F/ALtivec: Automatic Fortran Vectorizer for PowerPC Vector Unit, http://www.crescentbaysoftware.com/docs/vastfav.pdf.Google ScholarGoogle Scholar
  37. Crecent Bay Software. Vast/altivec faq: Vectorization for Altivec, http://www.crescentbaysoftware.com/altivec_FAQ.html.Google ScholarGoogle Scholar
  38. M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment. In Proc. of the International Symposium on Code Generation and Optimization (CGO), pages 153-- 164, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Auto-vectorization of interleaved data for SIMD

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 41, Issue 6
              Proceedings of the 2006 PLDI Conference
              June 2006
              426 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/1133255
              Issue’s Table of Contents
              • cover image ACM Conferences
                PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation
                June 2006
                438 pages
                ISBN:1595933204
                DOI:10.1145/1133981

              Copyright © 2006 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 11 June 2006

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!