Abstract
Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.
In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001. Google Scholar
Digital Library
- T. Austin and G. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA, pages 342--351, 1992. Google Scholar
Digital Library
- M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO, pages 69--84, 2007. Google Scholar
Digital Library
- Clang. clang.llvm.org.Google Scholar
- DragonEgg. dragonegg.llvm.org.Google Scholar
- A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, 2004. Google Scholar
Digital Library
- L. Fireman, E. Petrank, and A. Zaks. New algorithms for SIMD alignment. In CC, pages 1--15, 2007. Google Scholar
Digital Library
- S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, pages 458--469, 2011. Google Scholar
Digital Library
- C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java programs for parallelism. In IWMSE, pages 49--55, 2009. Google Scholar
Digital Library
- HPCToolkit. www.hpctoolkit.org.Google Scholar
- M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TC, 37 (9): 1088--1098, 1988. Google Scholar
Digital Library
- M. Lam and R. Wilson. Limits of control flow on parallelism. In ISCA, pages 46--57, 1992. Google Scholar
Digital Library
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000. Google Scholar
Digital Library
- J. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE TPDS, 4 (1): 812--826, 1993. Google Scholar
Digital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, page 75, 2004. Google Scholar
Digital Library
- J. Mak and A. Mycroft. Limits of parallelism using dynamic dependency graphs. In WODA, pages 42--48, 2009. Google Scholar
Digital Library
- A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE TC, 33 (11): 968--976, 1984. Google Scholar
Digital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132--143, 2006. Google Scholar
Digital Library
- C. Oancea and A. Mycroft. Set-congruence dynamic analysis for thread-level speculation (TLS). In LCPC, pages 156--171, 2008. Google Scholar
Digital Library
- PETSc. www.mcs.anl.gov/petsc.Google Scholar
- M. Postiff, D. Greene, G. Tyson, and T. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Computer Architecture News, 27 (1): 31--34, 1999. Google Scholar
Digital Library
- L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In PLDI, pages 218--232, 1995. Google Scholar
Digital Library
- L. Rauchwerger, P. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO, pages 105--117, 1993. Google Scholar
Digital Library
- A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. Understanding parallelism-inhibiting dependences in sequential Java programs. In ICSM, page 9, 2010. Google Scholar
Digital Library
- D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism. In Euro-Par, pages 1018--1022, 2000. Google Scholar
Digital Library
- S. Tallam and R. Gupta. Unified control flow and data dependence traces. ACM TACO, 4 (3): 19, 2007. Google Scholar
Digital Library
- S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programs via dynamic execution reduction. In ISSTA, pages 207--218, 2007. Google Scholar
Digital Library
- K. Theobald, G. Gao, and L. Hendren. On the limits of program parallelism and its smoothability. In MICRO, pages 10--19, 1992. Google Scholar
Digital Library
- C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, pages 330--341, 2008. Google Scholar
Digital Library
- C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Speculative parallelization of sequential loops on multicores. JPP, 37 (5): 508--535, 2009. Google Scholar
Digital Library
- G. Tournavitis, Z. Wang, Zheng, B. Franke, and M. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, pages 177--187, 2009. Google Scholar
Digital Library
- UTDSP Benchmarks. www.eecg.toronto.edu/~corinna.Google Scholar
- D. Wall. Limits of instruction-level parallelism. In ASPLOS, pages 176--188, 1991. Google Scholar
Digital Library
- M. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996. Google Scholar
Digital Library
- aval}wu-lcpc08P. Wu, A. Kejariwal, and C. Caşcaval. Compiler-driven dependence profiling to guide program parallelization. In LCPC, pages 232--248, 2008. Google Scholar
Digital Library
- X. Zhang and R. Gupta. Cost effective dynamic program slicing. In PLDI, pages 94--106, 2004. Google Scholar
Digital Library
- X. Zhang and R. Gupta. Whole execution traces and their applications. ACM TACO, 2 (3): 301--334, 2005. Google Scholar
Digital Library
- X. Zhang, R. Gupta, and Y. Zhang. Cost and precision tradeoffs of dynamic data slicing algorithms. ACM TOPLAS, 27 (4): 631--661, 2005. Google Scholar
Digital Library
- H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, pages 290--301, 2008.Google Scholar
- X. Zhuang, A. E. Eichenberger, Y. Luo, K. O'Brien, and K. O'Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, pages 193--202, 2009. Google Scholar
Digital Library
Index Terms
Dynamic trace-based analysis of vectorization potential of applications
Recommendations
Dynamic trace-based analysis of vectorization potential of applications
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and ImplementationRecent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
FlexVec: auto-vectorization for irregular loops
PLDI '16Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...







Comments