skip to main content
article

FlexVec: auto-vectorization for irregular loops

Published:02 June 2016Publication History
Skip Abstract Section

Abstract

Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions involved in a strongly connected component (SCC) are deemed not vectorizable unless the SCC can be eliminated using techniques such as scalar expansion or privatization. Therefore, traditional vectorization techniques are limited in their ability to efficiently handle loops with dynamic cross-iteration dependencies or complex control flow interweaved within the dependence cycles. When potential dependencies do not occur very often, the end-result is under utilization of the SIMD hardware. In this paper, we propose FlexVec architecture that combines new vector instructions with novel code generation techniques to dynamically adjusts vector length for loop statements affected by cross-iteration dependencies that happen at runtime. We have designed and implemented FlexVec's new ISA as extensions to the recently released AVX-512 ISA. We have evaluated the performance improvements enabled by FlexVec vectorization for 11 C/C++ SPEC 2006 benchmarks and 7 real applications with AVX-512 vectorization as baseline. We show that FlexVec vectorization technique produces a Geomean speedup of 9% for SPEC 2006 and a Geomean speedup of 11% for 7 real applications.

References

  1. Intel architecture instruction set extensions programming reference. August 2015. URL https://software.intel.com/sites/ default/files/managed/07/b7/319433-023.pdf.Google ScholarGoogle Scholar
  2. GROMACS molecular simulation toolkit. URL http://www. gromacs.org/.Google ScholarGoogle Scholar
  3. The GZIP benchmark. URL http://www.gzip.org/.Google ScholarGoogle Scholar
  4. Transactional synchronization in Haswell. URL https: //software.intel.com/en-us/blogs/2012/02/07/ transactional-synchronization-in-haswell/.Google ScholarGoogle Scholar
  5. Intel C++ compilers. URL https://software.intel.com/ en-us/c-compilers.Google ScholarGoogle Scholar
  6. Intel architecture instruction set extensions programming reference. 2015.Google ScholarGoogle Scholar
  7. The MIMD lattice computation (MILC). URL https: //www.nersc.gov/users/computational-systems/ cori/nersc-8-procurement/trinity-nersc-8-rfp/ nersc-8-trinity-benchmarks/milc/.Google ScholarGoogle Scholar
  8. The ZLIB benchmark. URL http://www.zlib.net/.Google ScholarGoogle Scholar
  9. ARM. NEON and VFP programming. 2010.Google ScholarGoogle Scholar
  10. R. Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 201–212, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4299-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. DARPA and DOE. Synthetic scalable concise applications benchmkark suite. URL http://graphanalysis.org/benchmark/ index.html.Google ScholarGoogle Scholar
  12. J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta code morphing™ software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’03, pages 15–24, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9 (3):319–349, July 1987. ISSN 0164-0925. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Dec 2014. URL http:// www.agner.org/optimize/instruction_tables.pdf.Google ScholarGoogle Scholar
  15. N. C. for Biotechnology Information (NCBI). The basic local alignment search tool. URL http://blast.ncbi.nlm.nih.gov/ Blast.cgi.Google ScholarGoogle Scholar
  16. V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA ’11, pages 503–514, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-1-4244-9432-3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Govindaraju, T. Nowatzki, and K. Sankaralingam. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4799-1021-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Hampton and K. Asanovic. Compiling for vector-thread architectures. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pages 205–215, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-978-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. IBM. PowerPC microprocessor family: AltiVec(TM) technology programming environments manual. 2003.Google ScholarGoogle Scholar
  20. R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The vector-thread architecture. In Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA ’04, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2143-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. N. Laboratories. LAMMPS. URL http://lammps.sandia. gov/.Google ScholarGoogle Scholar
  22. S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 145–156, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Liu, Y. Zhang, O. Jang, W. Ding, and M. Kandemir. A compiler framework for extracting superword level parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, pages 347–358, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1205-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM. ISBN 1-59593-056-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT ’11, pages 372–382, Washington, DC, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. IEEE Computer Society. ISBN 978-0-7695-4566-0.Google ScholarGoogle Scholar
  27. Y. Park, S. Seo, H. Park, H. K. Cho, and S. Mahlke. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. SIGARCH Comput. Archit. News, 40(1):363–374, Mar. 2012. ISSN 0163-5964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. SIGPLAN Not., 46:12–25, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Porpodas, A. Magni, and T. M. Jones. PSLP: Padded SLP automatic vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’15, pages 190–201, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4799-8161-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Pottenger and R. Eigenmann. Idiom recognition in the Polaris parallelizing compiler. In Proceedings of the 9th International Conference on Supercomputing, ICS ’95, pages 444–448, New York, NY, USA, 1995. ACM. ISBN 0-89791-728-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Rauchwerger, N. M. Amato, and D. A. Padua. Run-time methods for parallelizing partially parallel loops. In Proceedings of the 9th International Conference on Supercomputing, ICS ’95, pages 137–146, New York, NY, USA, 1995. ACM. ISBN 0-89791-728-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. J. Parallel Distrib. Comput., 8(4), Apr. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Shin, M. Hall, and J. Chame. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05, pages 165–175, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2298-X. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Singhal, K. Venkatraman, E. Cohn, J. Holm, D. Koufaty, M. Lin, M. Madhav, M. Mattwandel, N. Nidhi, J. Pearce, and M. Seshadri. Performance analysis and validation of the Intel Pentium 4 processor on 90nm technology. In Intel Technology Journal, 2005.Google ScholarGoogle Scholar
  35. M. H. Sujon, R. C. Whaley, and Q. Yi. Vectorization past dependent branches through speculation. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pages 353–362, Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4799-1021-2. URL http://dl.acm.org/ citation.cfm?id=2523721.2523769. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FlexVec: auto-vectorization for irregular loops

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 51, Issue 6
        PLDI '16
        June 2016
        726 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2980983
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
          June 2016
          726 pages
          ISBN:9781450342612
          DOI:10.1145/2908080
          • General Chair:
          • Chandra Krintz,
          • Program Chair:
          • Emery Berger

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 June 2016

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!