Abstract
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories," in PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. New York, NY, USA: ACM, 2008, pp. 1--10. Google Scholar
Digital Library
- B. Jang, S. Do, H. Pien, and D. Kaeli, "Architecture-aware optimization targeting multithreaded stream computing," in GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. New York, NY, USA: ACM, 2009, pp. 62--70. Google Scholar
Digital Library
- B. Jang, D. Kaeli, S. Do, and H. Pien, "Multi GPU Implementation of Iterative Tomographic Reconstruction Algorithms," in Biomedical Imaging: From Nano to Macro, 2009. ISBI 2009. 6th IEEE International Symposium on, Jun 2009. Google Scholar
Digital Library
- AMD, "Brook+," May 2006, http://ati.amd.com/technology/streamcomputing.Google Scholar
- S. T. Leung and J. Zahorjan, "Optimizing data locality by array restructuring," University of Washington, Tech. Rep. TR 95-09-01, 1995.Google Scholar
- S. Ghosh, M. Martonosi, and S. Malik, "Cache miss equations: an analytical representation of cache misses," in ICS '97: Proceedings of the 11th international conference on Supercomputing. New York, NY, USA: ACM, 1997, pp. 317--324. Google Scholar
Digital Library
- B. Jang, P. Mistry, D. Schaa, R. Dominguez, and D. Kaeli, "Data Transformations Enabling Loop Vectorization, NUCAR Technical Report," Nov 2009, http://www.ece.neu.edu/groups/nucar/publications.html.Google Scholar
Index Terms
Data transformations enabling loop vectorization on multithreaded data parallel architectures
Recommendations
Data transformations enabling loop vectorization on multithreaded data parallel architectures
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingLoop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data ...
Validation of Loop Parallelization and Loop Vectorization Transformations
ENASE 2016: Proceedings of the 11th International Conference on Evaluation of Novel Software Approaches to Software EngineeringLoop parallelization and loop vectorization of array-intensive programs are two common transformations applied by parallelizing compilers to convert a sequential program into a parallel program. Validation of such transformations carried out by ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...







Comments