skip to main content
10.1145/1736020.1736053acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

MacroSS: macro-SIMDization of streaming applications

Published:13 March 2010Publication History

ABSTRACT

SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application domains by exploiting data-level parallelism, it is very challenging for compilers and also programmers to identify and transform parts of a program that will benefit from a particular SIMD engine. The focus of this paper is on the problem of SIMDization for the growing application domain of streaming. Streaming applications are an ideal solution for targeting multi-core architectures, such as shared/distributed memory systems, tiled architectures, and single-core systems. Since these architectures, in most cases, provide SIMD acceleration units as well, it is highly beneficial to generate SIMD code from streaming programs. Specifically, we introduce MacroSS, which is capable of performing macro-SIMDization on high-level streaming graphs. Macro-SIMDization uses high-level information such as execution rates of actors and communication patterns between them to transform the graph structure, vectorize actors of a streaming program, and generate intermediate code. We also propose low-overhead architectural modifications that accelerate shuffling of data elements between the scalar and vectorized parts of a streaming program. Our experiments show that MacroSS is capable of generating code that, on average, outperforms scalar code compiled with the current state-of-art auto-vectorizing compilers by 54%. Using the low-overhead data shuffling hardware, performance is improved by an additional 8% with less than 1% area overhead.

References

  1. R. Allen and K. Kennedy. Pfc: A program to convert fortran to parallel form. Technical Report 82-6, Dept. of Math. Sciences., Rice University, Mar. 1982.Google ScholarGoogle Scholar
  2. R. Allen and K. Kennedy. Automatic translation of fortran programs to vector form. ACM TOPLAS, 9(4):491--542, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Allen and K. Kennedy. Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann Publishers Inc., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ARM Ltd. ARM Neon, 2009. http://www.arm.com/miscPDFs/6629.pdf.Google ScholarGoogle Scholar
  5. I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Gr., 23(3):777--786, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Chen, X. Li, R. Lian, J. Lin, L. Liu, T. Liu, and R. Ju. Shangrila: Achieving high performance from compiled network applications while enabling ease of programming. In Proc. '05 PLDI, pages 224--236, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. C. Collection. Gcc 4.3.2, 2008. http://gcc.gnu.org/gcc-4.3/.Google ScholarGoogle Scholar
  8. A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In Proc. '04 PLDI, pages 82--93, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Gordon, W. Thies, M. Karczmarek, J. Lin, A. Meli, A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In 10th ASPLOS, pages 291--303, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In 12th ASPLOS, pages 151--162, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. IBM. Cell Broadband Engine Architecture, Mar. 2006.Google ScholarGoogle Scholar
  12. Intel. Intel sse4, 2006. http://download.intel.com/technology/architecture/new-instructions-paper.pdf.Google ScholarGoogle Scholar
  13. Intel. Intel Core i7, 2008. gttp://www.intel.com/products/processor/corei7/index.htm.Google ScholarGoogle Scholar
  14. Intel. Intel compiler, 2009. software.intel.com/en-us/intel-compilers/.Google ScholarGoogle Scholar
  15. M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. '08 PLDI, pages 114--124, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proc. '00 PLDI, pages 145--156, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Lee and D. Messerschmitt. Synchronous data flow. Proc. IEEE, 75(9):1235--1245, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. H.Moreno, V. Zyuban, U. Shvadron, F. D. Neeser, J. H. Derby,M. S. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David, S. W. Asaad, T. W. Fox, D. Littrell, M. Biberstein, D. Naishlos, and H. Hunter. An innovative low-power high-performance programmable signal processor for digital communications. IBM Jrn. of Research and Development, 47(2-3):299--326, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Munshi. Opencl parallel computing on the gpu and cpu., 2008.Google ScholarGoogle Scholar
  20. M. Narayanan and K. A. Yelick. Generating permutation instructions from a high--level description. In In Proc. MSP'04, 2004.Google ScholarGoogle Scholar
  21. D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In Proc. 2006 CGO, pages 281--294, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for simd. In Proc. '06 PLDI, pages 132--142, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Nuzman and A. Zaks. Outer-loop vectorization -- revisited for short simd architectures. pages 2--11, 2008.Google ScholarGoogle Scholar
  24. Nvidia. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.Google ScholarGoogle Scholar
  25. G. Ren, P. Wu, and D. Padua. Optimizing data permutations for simd devices. In Proc. '06 PLDI, pages 118--131, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Seal. ARM Architecture Reference Manual. Addison-Wesley, London, UK, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Seiler et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Gr., 27(3):1--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Semiconductor. Altivec, 2009. www.freescale.com/altivec.Google ScholarGoogle Scholar
  29. W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. 02 CC, pages 179--196, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tilera. Tile64 processor -- product brief, 2008. http://www.tilera.com/pdf/.Google ScholarGoogle Scholar
  31. P. Wu, A. E. Eichenberger, and A. Wang. Efficient simd code generation for runtime alignment and length conversion. In Proc. 2005 CGO, pages 153--164, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MacroSS: macro-SIMDization of streaming applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!