Abstract
Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions involved in a strongly connected component (SCC) are deemed not vectorizable unless the SCC can be eliminated using techniques such as scalar expansion or privatization. Therefore, traditional vectorization techniques are limited in their ability to efficiently handle loops with dynamic cross-iteration dependencies or complex control flow interweaved within the dependence cycles. When potential dependencies do not occur very often, the end-result is under utilization of the SIMD hardware. In this paper, we propose FlexVec architecture that combines new vector instructions with novel code generation techniques to dynamically adjusts vector length for loop statements affected by cross-iteration dependencies that happen at runtime. We have designed and implemented FlexVec's new ISA as extensions to the recently released AVX-512 ISA. We have evaluated the performance improvements enabled by FlexVec vectorization for 11 C/C++ SPEC 2006 benchmarks and 7 real applications with AVX-512 vectorization as baseline. We show that FlexVec vectorization technique produces a Geomean speedup of 9% for SPEC 2006 and a Geomean speedup of 11% for 7 real applications.
- Intel architecture instruction set extensions programming reference. August 2015. URL https://software.intel.com/sites/ default/files/managed/07/b7/319433-023.pdf.Google Scholar
- GROMACS molecular simulation toolkit. URL http://www. gromacs.org/.Google Scholar
- The GZIP benchmark. URL http://www.gzip.org/.Google Scholar
- Transactional synchronization in Haswell. URL https: //software.intel.com/en-us/blogs/2012/02/07/ transactional-synchronization-in-haswell/.Google Scholar
- Intel C++ compilers. URL https://software.intel.com/ en-us/c-compilers.Google Scholar
- Intel architecture instruction set extensions programming reference. 2015.Google Scholar
- The MIMD lattice computation (MILC). URL https: //www.nersc.gov/users/computational-systems/ cori/nersc-8-procurement/trinity-nersc-8-rfp/ nersc-8-trinity-benchmarks/milc/.Google Scholar
- The ZLIB benchmark. URL http://www.zlib.net/.Google Scholar
- ARM. NEON and VFP programming. 2010.Google Scholar
- R. Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 201–212, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4299-7. Google Scholar
Digital Library
- DARPA and DOE. Synthetic scalable concise applications benchmkark suite. URL http://graphanalysis.org/benchmark/ index.html.Google Scholar
- J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta code morphing™ software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’03, pages 15–24, 2003. Google Scholar
Digital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9 (3):319–349, July 1987. ISSN 0164-0925. Google Scholar
Digital Library
- A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Dec 2014. URL http:// www.agner.org/optimize/instruction_tables.pdf.Google Scholar
- N. C. for Biotechnology Information (NCBI). The basic local alignment search tool. URL http://blast.ncbi.nlm.nih.gov/ Blast.cgi.Google Scholar
- V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA ’11, pages 503–514, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-4244-9432-3. Google Scholar
Digital Library
- V. Govindaraju, T. Nowatzki, and K. Sankaralingam. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4799-1021-2. Google Scholar
Digital Library
- M. Hampton and K. Asanovic. Compiling for vector-thread architectures. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pages 205–215, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-978-4. Google Scholar
Digital Library
- IBM. PowerPC microprocessor family: AltiVec(TM) technology programming environments manual. 2003.Google Scholar
- R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The vector-thread architecture. In Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA ’04, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2143-6. Google Scholar
Digital Library
- S. N. Laboratories. LAMMPS. URL http://lammps.sandia. gov/.Google Scholar
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 145–156, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2. Google Scholar
Digital Library
- J. Liu, Y. Zhang, O. Jang, W. Ding, and M. Kandemir. A compiler framework for extracting superword level parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, pages 347–358, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1205-9. Google Scholar
Digital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM. ISBN 1-59593-056-6. Google Scholar
Digital Library
- S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT ’11, pages 372–382, Washington, DC, USA, 2011. Google Scholar
Digital Library
- IEEE Computer Society. ISBN 978-0-7695-4566-0.Google Scholar
- Y. Park, S. Seo, H. Park, H. K. Cho, and S. Mahlke. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. SIGARCH Comput. Archit. News, 40(1):363–374, Mar. 2012. ISSN 0163-5964. Google Scholar
Digital Library
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. SIGPLAN Not., 46:12–25, 2011. Google Scholar
Digital Library
- V. Porpodas, A. Magni, and T. M. Jones. PSLP: Padded SLP automatic vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’15, pages 190–201, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4799-8161-8. Google Scholar
Digital Library
- B. Pottenger and R. Eigenmann. Idiom recognition in the Polaris parallelizing compiler. In Proceedings of the 9th International Conference on Supercomputing, ICS ’95, pages 444–448, New York, NY, USA, 1995. ACM. ISBN 0-89791-728-6. Google Scholar
Digital Library
- L. Rauchwerger, N. M. Amato, and D. A. Padua. Run-time methods for parallelizing partially parallel loops. In Proceedings of the 9th International Conference on Supercomputing, ICS ’95, pages 137–146, New York, NY, USA, 1995. ACM. ISBN 0-89791-728-6. Google Scholar
Digital Library
- J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 8(4), Apr. 1990. Google Scholar
Digital Library
- J. Shin, M. Hall, and J. Chame. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05, pages 165–175, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2298-X. Google Scholar
Digital Library
- R. Singhal, K. Venkatraman, E. Cohn, J. Holm, D. Koufaty, M. Lin, M. Madhav, M. Mattwandel, N. Nidhi, J. Pearce, and M. Seshadri. Performance analysis and validation of the Intel Pentium 4 processor on 90nm technology. In Intel Technology Journal, 2005.Google Scholar
- M. H. Sujon, R. C. Whaley, and Q. Yi. Vectorization past dependent branches through speculation. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pages 353–362, Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4799-1021-2. URL http://dl.acm.org/ citation.cfm?id=2523721.2523769. Google Scholar
Digital Library
Index Terms
FlexVec: auto-vectorization for irregular loops
Recommendations
Vectorization-aware loop unrolling with seed forwarding
CC 2020: Proceedings of the 29th International Conference on Compiler ConstructionLoop unrolling is a widely adopted loop transformation, commonly used for enabling subsequent optimizations. Straight-line-code vectorization (SLP) is an optimization that benefits from unrolling. SLP converts isomorphic instruction sequences into ...
FlexVec: auto-vectorization for irregular loops
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationTraditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 PLDI ConferenceMost implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data ...







Comments