ABSTRACT
SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application domains by exploiting data-level parallelism, it is very challenging for compilers and also programmers to identify and transform parts of a program that will benefit from a particular SIMD engine. The focus of this paper is on the problem of SIMDization for the growing application domain of streaming. Streaming applications are an ideal solution for targeting multi-core architectures, such as shared/distributed memory systems, tiled architectures, and single-core systems. Since these architectures, in most cases, provide SIMD acceleration units as well, it is highly beneficial to generate SIMD code from streaming programs. Specifically, we introduce MacroSS, which is capable of performing macro-SIMDization on high-level streaming graphs. Macro-SIMDization uses high-level information such as execution rates of actors and communication patterns between them to transform the graph structure, vectorize actors of a streaming program, and generate intermediate code. We also propose low-overhead architectural modifications that accelerate shuffling of data elements between the scalar and vectorized parts of a streaming program. Our experiments show that MacroSS is capable of generating code that, on average, outperforms scalar code compiled with the current state-of-art auto-vectorizing compilers by 54%. Using the low-overhead data shuffling hardware, performance is improved by an additional 8% with less than 1% area overhead.
- R. Allen and K. Kennedy. Pfc: A program to convert fortran to parallel form. Technical Report 82-6, Dept. of Math. Sciences., Rice University, Mar. 1982.Google Scholar
- R. Allen and K. Kennedy. Automatic translation of fortran programs to vector form. ACM TOPLAS, 9(4):491--542, 1987. Google Scholar
Digital Library
- R. Allen and K. Kennedy. Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann Publishers Inc., 2002. Google Scholar
Digital Library
- ARM Ltd. ARM Neon, 2009. http://www.arm.com/miscPDFs/6629.pdf.Google Scholar
- I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Gr., 23(3):777--786, Aug. 2004. Google Scholar
Digital Library
- M. Chen, X. Li, R. Lian, J. Lin, L. Liu, T. Liu, and R. Ju. Shangrila: Achieving high performance from compiled network applications while enabling ease of programming. In Proc. '05 PLDI, pages 224--236, June 2005. Google Scholar
Digital Library
- G. C. Collection. Gcc 4.3.2, 2008. http://gcc.gnu.org/gcc-4.3/.Google Scholar
- A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In Proc. '04 PLDI, pages 82--93, 2004. Google Scholar
Digital Library
- M. Gordon, W. Thies, M. Karczmarek, J. Lin, A. Meli, A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In 10th ASPLOS, pages 291--303, Oct. 2002. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In 12th ASPLOS, pages 151--162, 2006. Google Scholar
Digital Library
- IBM. Cell Broadband Engine Architecture, Mar. 2006.Google Scholar
- Intel. Intel sse4, 2006. http://download.intel.com/technology/architecture/new-instructions-paper.pdf.Google Scholar
- Intel. Intel Core i7, 2008. gttp://www.intel.com/products/processor/corei7/index.htm.Google Scholar
- Intel. Intel compiler, 2009. software.intel.com/en-us/intel-compilers/.Google Scholar
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. '08 PLDI, pages 114--124, June 2008. Google Scholar
Digital Library
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proc. '00 PLDI, pages 145--156, June 2000. Google Scholar
Digital Library
- E. Lee and D. Messerschmitt. Synchronous data flow. Proc. IEEE, 75(9):1235--1245, 1987.Google Scholar
Cross Ref
- J. H.Moreno, V. Zyuban, U. Shvadron, F. D. Neeser, J. H. Derby,M. S. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David, S. W. Asaad, T. W. Fox, D. Littrell, M. Biberstein, D. Naishlos, and H. Hunter. An innovative low-power high-performance programmable signal processor for digital communications. IBM Jrn. of Research and Development, 47(2-3):299--326, 2003. Google Scholar
Digital Library
- A. Munshi. Opencl parallel computing on the gpu and cpu., 2008.Google Scholar
- M. Narayanan and K. A. Yelick. Generating permutation instructions from a high--level description. In In Proc. MSP'04, 2004.Google Scholar
- D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In Proc. 2006 CGO, pages 281--294, 2006. Google Scholar
Digital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for simd. In Proc. '06 PLDI, pages 132--142, 2006. Google Scholar
Digital Library
- D. Nuzman and A. Zaks. Outer-loop vectorization -- revisited for short simd architectures. pages 2--11, 2008.Google Scholar
- Nvidia. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.Google Scholar
- G. Ren, P. Wu, and D. Padua. Optimizing data permutations for simd devices. In Proc. '06 PLDI, pages 118--131, 2006. Google Scholar
Digital Library
- D. Seal. ARM Architecture Reference Manual. Addison-Wesley, London, UK, 2000. Google Scholar
Digital Library
- L. Seiler et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Gr., 27(3):1--15, 2008. Google Scholar
Digital Library
- F. Semiconductor. Altivec, 2009. www.freescale.com/altivec.Google Scholar
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. 02 CC, pages 179--196, 2002. Google Scholar
Digital Library
- Tilera. Tile64 processor -- product brief, 2008. http://www.tilera.com/pdf/.Google Scholar
- P. Wu, A. E. Eichenberger, and A. Wang. Efficient simd code generation for runtime alignment and length conversion. In Proc. 2005 CGO, pages 153--164, 2005. Google Scholar
Digital Library
Index Terms
MacroSS: macro-SIMDization of streaming applications
Recommendations
MacroSS: macro-SIMDization of streaming applications
ASPLOS '10SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many ...
MacroSS: macro-SIMDization of streaming applications
ASPLOS '10SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many ...
SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating SystemsSingle-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD ...








Comments