skip to main content
research-article
Open Access
Artifacts Evaluated & Functional

goSLP: globally optimized superword level parallelism framework

Published:24 October 2018Publication History
Skip Abstract Section

Abstract

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM’s existing SLP auto-vectorizer.

Skip Supplemental Material Section

Supplemental Material

a110-mendis.webm

References

  1. Randy Allen and Ken Kennedy. 1987. Automatic Translation of FORTRAN Programs to Vector Form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491–542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andrew W. Appel and Lal George. 2001. Optimal Spilling for CISC Machines with Few Registers. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI ’01). ACM, New York, NY, USA, 243–253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 697–710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rajkishore Barik, Christian Grothoff, Rahul Gupta, Vinayaka Pandit, and Raghavendra Udupa. 2007. Optimal Bitwise Register Allocation Using Integer Linear Programming. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, Heidelberg, 267–282. http://dl.acm.org/citation. cfm?id=1757112.1757140 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43). IEEE Computer Society, Washington, DC, USA, 201–212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Derek Bruening, Qin Zhao, and Saman Amarasinghe. 2012. Transparent Dynamic Instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE ’12). ACM, New York, NY, USA, 133–144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chia-Ming Chang, Chien-Ming Chen, and Chung-Ta King. 1997. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors. Computers & Mathematics with Applications 34, 9 (1997), 1 – 14.Google ScholarGoogle Scholar
  8. Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI ’04). ACM, New York, NY, USA, 82–93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. John L. Henning. 2006. SPEC CP U2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. IBM. 2006. PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual. IBM Systems and Technology Group (2006).Google ScholarGoogle Scholar
  11. IBM. 2017. IBM CPLEX ILP solver. https://www- 01.ibm.com/software/commerce/optimization/cplex- optimizer/Google ScholarGoogle Scholar
  12. Intel. 2017a. Intel Software Developer’s manuals. https://www.intel.com/content/www/us/en/architecture- and- technology/ 64- ia- 32- architectures- software- developer- manual- 325462.htmlGoogle ScholarGoogle Scholar
  13. Intel. 2017b. Intel VTune Amplifier. https://software.intel.com/en- us/intel- vtune- amplifier- xeGoogle ScholarGoogle Scholar
  14. Ralf Karrenberg and Sebastian Hack. 2011. Whole-function Vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’11). IEEE Computer Society, Washington, DC, USA, 141–150. http://dl.acm.org/citation.cfm?id=2190025.2190061 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 127–138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alexei Kudriavtsev and Peter Kogge. 2005. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’05). ACM, New York, NY, USA, 147–156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Samuel Larsen. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. S.M. Thesis. Massachusetts Institute of Technology, Cambridge, MA. http://groups.csail.mit.edu/commit/papers/00/SLarsen- SM.pdfGoogle ScholarGoogle Scholar
  18. Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 18–29. http://dl.acm.org/citation.cfm?id=645989.674329 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rainer Leupers. 2000. Code Selection for Media Processors with SIMD Instructions. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’00). ACM, New York, NY, USA, 4–8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chen Linchuan, Jiang Peng, and Agrawal Gagan. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 47–58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 347–358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. LLVM. 2017. LLVM Compiler Infrastructure. https://llvm.orgGoogle ScholarGoogle Scholar
  24. Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, and Christian Schulte. 2018. Combinatorial Register Allocation and Instruction Scheduling. CoRR abs/1804.02452 (2018). arXiv: 1804.02452 http://arxiv.org/abs/1804.02452Google ScholarGoogle Scholar
  25. Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. ArXiv e-prints (Aug. 2018). arXiv: cs.DC/1808.07412Google ScholarGoogle Scholar
  26. S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Found. Trends Theor. Comput. Sci. 1, 2 (Aug. 2005), 117–236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Santosh G. Nagarakatte and R. Govindarajan. 2007. Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation. In Compiler Construction, Shriram Krishnamurthi and Martin Odersky (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 126–140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Division NASA Advanced Supercomputing. 1991–2014. NAS C Benchmark Suite 3.0. https://github.com/ benchmark- subsetting/NPB3.0- omp- C/Google ScholarGoogle Scholar
  29. Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 132–143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dorit Nuzman and Ayal Zaks. 2008. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). ACM, New York, NY, USA, 2–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Stuart Oberman, Greg Favor, and Fred Weber. 1999. AMD 3DNow! Technology: Architecture and Implementations. IEEE Micro 19, 2 (March 1999), 37–48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT ’15). IEEE Computer Society, Washington, DC, USA, 432–444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’15). IEEE Computer Society, Washington, DC, USA, 190–201. http://dl.acm.org/citation.cfm?id=2738600.2738625 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. SIGPLAN Not. 43, 6 (June 2008), 216–226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Gang Ren, Peng Wu, and David Padua. 2006. Optimizing Data Permutations for SIMD Devices. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 118–131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Vol. 5.Google ScholarGoogle Scholar
  37. Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 45–55. http://dl.acm.org/citation. cfm?id=645989.674318 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, Washington, DC, USA, 165–175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Corporation SPEC. 2017. SPEC CP U2017 Benchmark Suite. https://www.spec.org/cpu2017/Google ScholarGoogle Scholar
  40. N. Sreraman and R. Govindarajan. 2000. A Vectorizing Compiler for Multimedia Extensions. Int. J. Parallel Program. 28, 4 (Aug. 2000), 363–400.Google ScholarGoogle ScholarCross RefCross Ref
  41. Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT ’09). IEEE Computer Society, Washington, DC, USA, 327–337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 59–69. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. goSLP: globally optimized superword level parallelism framework

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!