Abstract
The parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up programs can be quite tedious. Either a program developer solely relies on the auto-vectorization capabilities of the compiler or he manually applies vector intrinsics, which is extremely error-prone, difficult to maintain, and not portable at all.
Based on whole-function vectorization, a method to replace control flow with data flow, we propose auto-vectorization techniques for image processing DSLs in the context of source-to-source compilation. The approach does not require the input to be available in SSA form. Moreover, we formulate constraints under which the vectorization analysis and code transformations may be greatly simplified in the context of image processing DSLs. As part of our methodology, we present control flow to data flow transformation as a source-to-source translation. Moreover, we propose a method to efficiently analyze algorithms with mixed bit-width data types to determine the optimal SIMD width, independently of the target instruction set. The techniques are integrated into an open source DSL framework. Subsequently, the vectorization capabilities are compared to a variety of existing state-of-the-art C/C++ compilers. A geometric mean speedup of up to 3.14 is observed for benchmarks taken from ISPC and image processing, compared to non-vectorized executions.
- J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th Symposium on Principles of Programming Languages (POPL), pages 177–189, Austin, Texas, 1983. Google Scholar
Digital Library
- Y. B. Asher and N. Rotem. Hybrid type legalization for a sparse SIMD instruction set. ACM Transactions on Architecture and Code Optimization (TACO), 10(3):Article No. 11, September 2013. Google Scholar
Digital Library
- S. S. Baghsorkhi, N. Vasudevan, and Y. Wu. FlexVec: Auto-vectorization for irregular loops. In Proceedings of the 37th International Conference on Programming Language Design and Implementation (PLDI), pages 697–710, Santa Barbara, CA, USA, 2016. Google Scholar
Digital Library
- C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, pages 147–151, 1988.Google Scholar
Cross Ref
- H. W. Jensen, S. Premoze, P. Shirley, W. B. Thompson, J. A. Ferwerda, and M. M. Stark. Night rendering. Technical Report UUCS-00-016, Computer Science Department, University of Utah, Aug. 2000.Google Scholar
- R. Karrenberg and S. Hack. Whole-function vectorization. In Proceedings of the 9th International Symposium on Code Generation and Optimization (CGO), pages 141–150, Chamonix, France, April 2011. Google Scholar
Digital Library
- R. Karrenberg and S. Hack. Improving performance of OpenCL on CPUs. In Proceedings of the 21st International Conference on Compiler Construction (CC), pages 1–20, Tallinn, Estonia, 2012. Google Scholar
Digital Library
- A. Krall and S. Lelait. Compilation techniques for multimedia processors. Journal of Parallel Programming, 28(4):347–361, August 2000.Google Scholar
Cross Ref
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), pages 145–156, 2000. Google Scholar
Digital Library
- R. Leißa, I. Haffner, and S. Hack. Sierra: A SIMD extension for C ++. In Proceedings of the Workshop on Programming Models for SIMD/Vector Processing, pages 17–24, Orlando, Florida, USA, February 2014. Google Scholar
Digital Library
- D. Levine, D. Callahan, and J. Dongarra. A comparative study of automatic vectorizing compilers. Journal of Parallel Computing, 17(10): 1223–1244, December 1991. Google Scholar
Digital Library
- R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. HIPAcc: A domain-specific language and compiler for image processing. IEEE Transactions on Parallel and Distributed Systems, 27(1):210–224, January 2016. Google Scholar
Digital Library
- D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 281–294, New York, USA, March 2006. Google Scholar
Digital Library
- D. Nuzman and A. Zaks. Outer-loop vectorization - revisited for short SIMD architectures. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 2–11, Toronto, Canada, October 2008. Google Scholar
Digital Library
- M. Pharr and W. R. Mark. ISPC: A SPMD compiler for high-performance CPU programming. In Proceedings of the International Conference on Innovative Parallel Computing (InPar), pages 1–13, San Jose, USA, May 2012.Google Scholar
Cross Ref
- M. Püschel, F. Franchetti, and Y. Voronenko. Spiral. In D. Padua, editor, Encyclopedia of Parallel Computing. 2011.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th Conference on Programming Language Design and Implementation (PLDI), pages 519–530, Seattle, USA, June 2013. Google Scholar
Digital Library
- H. Saito, S. Preis, N. Panchenko, and X. Tian. Reducing the Functionality Gap Between Auto-Vectorization and Explicit Vectorization, pages 173– 186. Nara, Japan, Oct. 2016.Google Scholar
- C. Schmitt, S. Kuckuk, F. Hannig, H. Köstler, and J. Teich. ExaSlang: A domain-specific language for highly scalable multigrid solvers. In Proceedings of the 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pages 42–51, New Orleans, LA, USA, 2014. Google Scholar
Digital Library
- M. J. Shensa. The discrete wavelet transform: Wedding the À Trous and Mallat algorithms. IEEE Transactions on Signal Processing, 40(10): 2464–2482, 1992. Google Scholar
Digital Library
- J. Shin, M. Hall, and J. Chame. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 165–175, San Jose, USA, March 2005. Google Scholar
Digital Library
- N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. Journal of Parallel Programming, 28(4):363–400, August 2000.Google Scholar
Cross Ref
- F. Stein. Efficient computation of optical flow using the Census Transform. In C. Rasmussen, H. Bülthoff, B. Schölkopf, and M. Giese, editors, Pattern Recognition, volume 3175 of Lecture Notes in Computer Science, pages 79–86. 2004.Google Scholar
- Y. Sui, X. Fan, H. Zhou, and J. Xue. Loop-oriented array- and fieldsensitive pointer analysis for automatic SIMD vectorization. In Proceedings of the 17th International Conference on Languages, Compilers, Tools, and Theory for Embedded Systems (LCTES), pages 41–51, Santa Barbara, CA, USA, 2016. Google Scholar
Digital Library
Index Terms
Auto-vectorization for image processing DSLs
Recommendations
Auto-vectorization for image processing DSLs
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsThe parallelization of programs and distributing their workloads to multiple threads can be a challenging task. In addition to multi-threading, harnessing vector units in CPUs proves highly desirable. However, employing vector units to speed up ...
Automatic Kernel Fusion for Image Processing DSLs
SCOPES '18: Proceedings of the 21st International Workshop on Software and Compilers for Embedded SystemsProgramming image processing algorithms on hardware accelerators such as graphics processing units (GPUs) often exhibits a trade-off between software portability and performance portability. Domain-specific languages (DSLs) have proven to be a promising ...
LLVM framework and IR extensions for parallelization, SIMD vectorization and offloading
LLVM-HPC '16: Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPCLLVM has become an integral part of the software-development ecosystem for developing advanced compilers, high-performance computing software and tools. This paper presents a small set of LLVM IR extensions for explicitly parallel vector, and offloading ...






Comments