Abstract
Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM’s existing SLP auto-vectorizer.
Supplemental Material
- Randy Allen and Ken Kennedy. 1987. Automatic Translation of FORTRAN Programs to Vector Form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491–542. Google Scholar
Digital Library
- Andrew W. Appel and Lal George. 2001. Optimal Spilling for CISC Machines with Few Registers. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI ’01). ACM, New York, NY, USA, 243–253. Google Scholar
Digital Library
- Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 697–710. Google Scholar
Digital Library
- Rajkishore Barik, Christian Grothoff, Rahul Gupta, Vinayaka Pandit, and Raghavendra Udupa. 2007. Optimal Bitwise Register Allocation Using Integer Linear Programming. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, Heidelberg, 267–282. http://dl.acm.org/citation. cfm?id=1757112.1757140 Google Scholar
Digital Library
- Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43). IEEE Computer Society, Washington, DC, USA, 201–212. Google Scholar
Digital Library
- Derek Bruening, Qin Zhao, and Saman Amarasinghe. 2012. Transparent Dynamic Instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE ’12). ACM, New York, NY, USA, 133–144. Google Scholar
Digital Library
- Chia-Ming Chang, Chien-Ming Chen, and Chung-Ta King. 1997. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors. Computers & Mathematics with Applications 34, 9 (1997), 1 – 14.Google Scholar
- Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI ’04). ACM, New York, NY, USA, 82–93. Google Scholar
Digital Library
- John L. Henning. 2006. SPEC CP U2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1–17. Google Scholar
Digital Library
- IBM. 2006. PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual. IBM Systems and Technology Group (2006).Google Scholar
- IBM. 2017. IBM CPLEX ILP solver. https://www- 01.ibm.com/software/commerce/optimization/cplex- optimizer/Google Scholar
- Intel. 2017a. Intel Software Developer’s manuals. https://www.intel.com/content/www/us/en/architecture- and- technology/ 64- ia- 32- architectures- software- developer- manual- 325462.htmlGoogle Scholar
- Intel. 2017b. Intel VTune Amplifier. https://software.intel.com/en- us/intel- vtune- amplifier- xeGoogle Scholar
- Ralf Karrenberg and Sebastian Hack. 2011. Whole-function Vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’11). IEEE Computer Society, Washington, DC, USA, 141–150. http://dl.acm.org/citation.cfm?id=2190025.2190061 Google Scholar
Digital Library
- Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 127–138. Google Scholar
Digital Library
- Alexei Kudriavtsev and Peter Kogge. 2005. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’05). ACM, New York, NY, USA, 147–156. Google Scholar
Digital Library
- Samuel Larsen. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. S.M. Thesis. Massachusetts Institute of Technology, Cambridge, MA. http://groups.csail.mit.edu/commit/papers/00/SLarsen- SM.pdfGoogle Scholar
- Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156. Google Scholar
Digital Library
- Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 18–29. http://dl.acm.org/citation.cfm?id=645989.674329 Google Scholar
Digital Library
- Rainer Leupers. 2000. Code Selection for Media Processors with SIMD Instructions. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’00). ACM, New York, NY, USA, 4–8. Google Scholar
Digital Library
- Chen Linchuan, Jiang Peng, and Agrawal Gagan. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 47–58. Google Scholar
Digital Library
- Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 347–358. Google Scholar
Digital Library
- LLVM. 2017. LLVM Compiler Infrastructure. https://llvm.orgGoogle Scholar
- Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, and Christian Schulte. 2018. Combinatorial Register Allocation and Instruction Scheduling. CoRR abs/1804.02452 (2018). arXiv: 1804.02452 http://arxiv.org/abs/1804.02452Google Scholar
- Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. ArXiv e-prints (Aug. 2018). arXiv: cs.DC/1808.07412Google Scholar
- S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Found. Trends Theor. Comput. Sci. 1, 2 (Aug. 2005), 117–236. Google Scholar
Digital Library
- Santosh G. Nagarakatte and R. Govindarajan. 2007. Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation. In Compiler Construction, Shriram Krishnamurthi and Martin Odersky (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 126–140. Google Scholar
Digital Library
- Division NASA Advanced Supercomputing. 1991–2014. NAS C Benchmark Suite 3.0. https://github.com/ benchmark- subsetting/NPB3.0- omp- C/Google Scholar
- Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 132–143. Google Scholar
Digital Library
- Dorit Nuzman and Ayal Zaks. 2008. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). ACM, New York, NY, USA, 2–11. Google Scholar
Digital Library
- Stuart Oberman, Greg Favor, and Fred Weber. 1999. AMD 3DNow! Technology: Architecture and Implementations. IEEE Micro 19, 2 (March 1999), 37–48. Google Scholar
Digital Library
- Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT ’15). IEEE Computer Society, Washington, DC, USA, 432–444. Google Scholar
Digital Library
- Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’15). IEEE Computer Society, Washington, DC, USA, 190–201. http://dl.acm.org/citation.cfm?id=2738600.2738625 Google Scholar
Digital Library
- Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. SIGPLAN Not. 43, 6 (June 2008), 216–226. Google Scholar
Digital Library
- Gang Ren, Peng Wu, and David Padua. 2006. Optimizing Data Permutations for SIMD Devices. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 118–131. Google Scholar
Digital Library
- Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Vol. 5.Google Scholar
- Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 45–55. http://dl.acm.org/citation. cfm?id=645989.674318 Google Scholar
Digital Library
- Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, Washington, DC, USA, 165–175. Google Scholar
Digital Library
- Corporation SPEC. 2017. SPEC CP U2017 Benchmark Suite. https://www.spec.org/cpu2017/Google Scholar
- N. Sreraman and R. Govindarajan. 2000. A Vectorizing Compiler for Multimedia Extensions. Int. J. Parallel Program. 28, 4 (Aug. 2000), 363–400.Google Scholar
Cross Ref
- Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT ’09). IEEE Computer Society, Washington, DC, USA, 327–337. Google Scholar
Digital Library
- Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 59–69. Google Scholar
Digital Library
Index Terms
goSLP: globally optimized superword level parallelism framework
Recommendations
A compiler framework for extracting superword level parallelism
PLDI '12SIMD (single-instruction multiple-data) instruction set extensions are quite common today in both high performance and embedded microprocessors, and enable the exploitation of a specific type of data parallelism called SLP (Superword Level Parallelism). ...
Automatic generation of custom SIMD instructions for superword level parallelism
DATE '14: Proceedings of the conference on Design, Automation & Test in EuropeApplication specific instruction-set processors (ASIPs) have drawn significant attention from System-on-a-Chip (SoC) community due to the capability of fine grain flexibility and customizability. In order to maximize the benefit of ASIP, automatic ...






Comments