skip to main content
research-article

Dynamic trace-based analysis of vectorization potential of applications

Published:11 June 2012Publication History
Skip Abstract Section

Abstract

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.

In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

References

  1. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Austin and G. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA, pages 342--351, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO, pages 69--84, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Clang. clang.llvm.org.Google ScholarGoogle Scholar
  5. DragonEgg. dragonegg.llvm.org.Google ScholarGoogle Scholar
  6. A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Fireman, E. Petrank, and A. Zaks. New algorithms for SIMD alignment. In CC, pages 1--15, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, pages 458--469, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java programs for parallelism. In IWMSE, pages 49--55, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. HPCToolkit. www.hpctoolkit.org.Google ScholarGoogle Scholar
  11. M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TC, 37 (9): 1088--1098, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Lam and R. Wilson. Limits of control flow on parallelism. In ISCA, pages 46--57, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE TPDS, 4 (1): 812--826, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, page 75, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Mak and A. Mycroft. Limits of parallelism using dynamic dependency graphs. In WODA, pages 42--48, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE TC, 33 (11): 968--976, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132--143, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Oancea and A. Mycroft. Set-congruence dynamic analysis for thread-level speculation (TLS). In LCPC, pages 156--171, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. PETSc. www.mcs.anl.gov/petsc.Google ScholarGoogle Scholar
  21. M. Postiff, D. Greene, G. Tyson, and T. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Computer Architecture News, 27 (1): 31--34, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In PLDI, pages 218--232, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Rauchwerger, P. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO, pages 105--117, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. Understanding parallelism-inhibiting dependences in sequential Java programs. In ICSM, page 9, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism. In Euro-Par, pages 1018--1022, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Tallam and R. Gupta. Unified control flow and data dependence traces. ACM TACO, 4 (3): 19, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programs via dynamic execution reduction. In ISSTA, pages 207--218, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Theobald, G. Gao, and L. Hendren. On the limits of program parallelism and its smoothability. In MICRO, pages 10--19, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, pages 330--341, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Speculative parallelization of sequential loops on multicores. JPP, 37 (5): 508--535, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Tournavitis, Z. Wang, Zheng, B. Franke, and M. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, pages 177--187, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. UTDSP Benchmarks. www.eecg.toronto.edu/~corinna.Google ScholarGoogle Scholar
  33. D. Wall. Limits of instruction-level parallelism. In ASPLOS, pages 176--188, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. aval}wu-lcpc08P. Wu, A. Kejariwal, and C. Caşcaval. Compiler-driven dependence profiling to guide program parallelization. In LCPC, pages 232--248, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Zhang and R. Gupta. Cost effective dynamic program slicing. In PLDI, pages 94--106, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. X. Zhang and R. Gupta. Whole execution traces and their applications. ACM TACO, 2 (3): 301--334, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Zhang, R. Gupta, and Y. Zhang. Cost and precision tradeoffs of dynamic data slicing algorithms. ACM TOPLAS, 27 (4): 631--661, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, pages 290--301, 2008.Google ScholarGoogle Scholar
  40. X. Zhuang, A. E. Eichenberger, Y. Luo, K. O'Brien, and K. O'Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, pages 193--202, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamic trace-based analysis of vectorization potential of applications

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM SIGPLAN Notices
                ACM SIGPLAN Notices  Volume 47, Issue 6
                PLDI '12
                June 2012
                534 pages
                ISSN:0362-1340
                EISSN:1558-1160
                DOI:10.1145/2345156
                Issue’s Table of Contents
                • cover image ACM Conferences
                  PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation
                  June 2012
                  572 pages
                  ISBN:9781450312059
                  DOI:10.1145/2254064

                Copyright © 2012 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 11 June 2012

                Check for updates

                Qualifiers

                • research-article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!