skip to main content
research-article
Open Access

An empirical study of the effect of source-level loop transformations on compiler stability

Published:24 October 2018Publication History
Skip Abstract Section

Abstract

Modern compiler optimization is a complex process that offers no guarantees to deliver the fastest, most efficient target code. For this reason, compilers struggle to produce a stable performance from versions of code that carry out the same computation and only differ in the order of operations. This instability makes compilers much less effective program optimization tools and often forces programmers to carry out a brute force search when tuning for performance. In this paper, we analyze the stability of the compilation process and the performance headroom of three widely used general purpose compilers: GCC, ICC, and Clang. For the study, we extracted over 1,000 <pre>for</pre> loop nests from well-known benchmarks, libraries, and real applications; then, we applied sequences of source-level loop transformations to these loop nests to create numerous semantically equivalent mutations; finally, we analyzed the impact of transformations on code quality in terms of locality, dynamic instruction count, and vectorization. Our results show that, by applying source-to-source transformations and searching for the best vectorization setting, the percentage of loops sped up by at least 1.15x is 46.7% for GCC, 35.7% for ICC, and 46.5% for Clang, and on average the potential for performance improvement is estimated to be at least 23.7% for GCC, 18.1% for ICC, and 26.4% for Clang. Our stability analysis shows that, under our experimental setup, the average coefficient of variation of the execution time across all mutations is 18.2% for GCC, 19.5% for ICC, and 16.9% for Clang, and the highest coefficient of variation for a single loop nest reaches 118.9% for GCC, 124.3% for ICC, and 110.5% for Clang. We conclude that the evaluated compilers need further improvements to claim they have stable behavior.

Skip Supplemental Material Section

Supplemental Material

a126-gong.webm

References

  1. Dominik Adamski, Grzegorz Jabłoński, Piotr Perek, and Andrzej Napieralski. 2016. Polyhedral Source-to-Source Compiler. In Mixed Design of Integrated Circuits and Systems, 2016 MIXDES-23rd International Conference. IEEE, 458–463.Google ScholarGoogle ScholarCross RefCross Ref
  2. Randy Allen and Ken Kennedy. 2001. Optimizing compilers for modern architectures a dependence-based approach. (2001).Google ScholarGoogle Scholar
  3. David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63–73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C Bastoul and LN Pouchet. 2012. Candl: the chunky analyzer for dependences in loops. Technical Report. tech. rep., LRI, Paris-Sud University, France.Google ScholarGoogle Scholar
  5. Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction. Springer, 132–146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shirley Browne, Jack Dongarra, Eric Grosse, and Tom Rowan. 1995. The Netlib mathematical software repository. D-Lib Magazine, Sep (1995).Google ScholarGoogle ScholarCross RefCross Ref
  7. Pablo De Oliveira Castro, Chadi Akel, Eric Petit, Mihail Popov, and William Jalby. 2015. CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization. ACM Trans. Archit. Code Optim. 12, 1 (2015), 6:1–6:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Citeseer.Google ScholarGoogle Scholar
  9. Zhi Chen, Zhangxiaowen Gong, Justin Szaday, David C. Wong, David Padua, Alexandru Nicolau, Alexander V. Veidenbaum, Neftali Watkinson, Zehra Sura, Saeed Maleki, Josep Torrellas, and Gerald DeJong. 2017. LORE: A loop repository for the evaluation of compilers. In 2017 IEEE International Symposium on Workload Characterization (IISWC). 219–228.Google ScholarGoogle ScholarCross RefCross Ref
  10. Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. PACMPL 1, OOPSLA (2017), 93:1–93:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Stuart I Feldman. 1990. A Fortran to C converter. In ACM SIGPLAN Fortran Forum, Vol. 9. ACM, 21–22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jason E Fritts, Frederick W Steiling, Joseph A Tucek, and Wayne Wolf. 2009. MediaBench II video: Expediting the next generation of video systems research. Microprocessors and Microsystems 33, 4 (2009), 301–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. GG Fursin, Michael FP OâĂŹBoyle, and Peter MW Knijnenburg. 2002. Evaluating iterative compilation. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 362–376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. GAP. 2007. GAP - Groups, Algorithms, Programming - a System for Computational Discrete Algebra. www.gap-system.org .Google ScholarGoogle Scholar
  15. John L Henning. 2000. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer 33, 7 (2000), 28–35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Intel. 2016. Intel 64, and IA32 architectures software developer’s manual, vol. 3A: system programming guide, part 1. Intel Corporation, Denver, CO (2016).Google ScholarGoogle Scholar
  18. David J. Kuck, Robert H. Kuhn, Bruce Leasure, and Michael Wolfe. 1980. The structure of an advanced vectorizer for pipelined processors. In Fourth International Computer Software and Applications Conference. IEEE, 201–218.Google ScholarGoogle Scholar
  19. LAME. 2017. LAME MP3 Encoder. lame.sourceforge.net .Google ScholarGoogle Scholar
  20. Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. Vol. 35. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014. 216–226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Man-Lap Li, R. Sasanka, S. V. Adve, Yen-Kuang Chen, and E. Debes. 2005. The ALPBench benchmark suite for complex multimedia applications. In IEEE International Proceedings of the IEEE Workload Characterization Symposium (IISWC). 34–45.Google ScholarGoogle Scholar
  23. Chunhua Liao, Daniel J Quinlan, Richard Vuduc, and Thomas Panas. 2009. Effective source-to-source outlining to support whole program empirical optimization. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 308–322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. LLNL. 2008. ASC Sequoia Benchmark. https://asc.llnl.gov/sequoia/benchmarks/ .Google ScholarGoogle Scholar
  25. Saeed Maleki, Yaoqing Gao, Maria J Garzar, Tommy Wong, David A Padua, et al. 2011. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 372–382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mozilla. 2017. Mozilla JPEG Encoder Project. github.com/mozilla/mozjpeg .Google ScholarGoogle Scholar
  27. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney. 2009. Producing wrong data without doing anything obviously wrong! ACM Sigplan Notices 44, 3 (2009), 265–276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization-revisited for short SIMD architectures. In Parallel Architectures and Compilation Techniques (PACT), 2008 International Conference on. IEEE, 2–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhelong Pan and Rudolf Eigenmann. 2006. Fast, Automatic, Procedure-level Performance Tuning. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT). 173–181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gabriele Paoloni. 2010. How to benchmark code execution times on Intel IA-32 and IA-64 instruction set architectures. Intel Corporation (2010), 123.Google ScholarGoogle Scholar
  31. Eunjung Park, John Cavazos, Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and P Sadayappan. 2013. Predictive modeling in a polyhedral optimization space. International journal of parallel programming 41, 5 (2013), 704–750.Google ScholarGoogle Scholar
  32. Tim Peters. 1992. Livermore loops coded in C. http://www.netlib.org/benchmark/livermorec . (1992).Google ScholarGoogle Scholar
  33. Louis-Noël Pouchet. 2011. Polyopt/C: A polyhedral optimizer for the ROSE compiler. http://web.cse.ohio-state.edu/~pouchet/ software/polyopt .Google ScholarGoogle Scholar
  34. Louis-Noël Pouchet. 2012. Polybench: The polyhedral benchmark suite. http://www.cs.ucla.edu/pouchet/software/polybench . (2012).Google ScholarGoogle Scholar
  35. Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and John Cavazos. 2008. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Notices, Vol. 43. ACM, 90–100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters 10, 02n03 (2000), 215–226.Google ScholarGoogle ScholarCross RefCross Ref
  37. Joseph Redmon. 2013–2016. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/ .Google ScholarGoogle Scholar
  38. Peter Rundberg and Fredrik Warg. 2002. The FreeBench v1.0 Benchmark Suite. http://www.freebench.org . (2002).Google ScholarGoogle Scholar
  39. Sergio Segura, Javier Troya, Amador Durán Toro, and Antonio Ruiz Cortés. 2017. Performance Metamorphic Testing: Motivation and Challenges. In 39th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track, ICSE-NIER 2017, Buenos Aires, Argentina, May 20-28, 2017. 7–10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shelby Thomas, Chetan Gohkale, Enrico Tanuwidjaja, Tony Chong, David Lau, Saturnino Garcia, and Michael Bedford Taylor. 2014. CortexSuite: A Synthetic Brain Benchmark Suite. In IEEE International Proceedings of the IEEE Workload Characterization Symposium (IISWC). 76–79.Google ScholarGoogle Scholar
  41. Ananta Tiwari, Jeffrey K Hollingsworth, Chun Chen, Mary Hall, Chunhua Liao, Daniel J Quinlan, and Jacqueline Chame. 2011. Auto-tuning full applications: A case study. The International Journal of High Performance Computing Applications 25, 3 (2011), 286–294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, and David I. August. 2003. Compiler Optimization-space Exploration. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO)). 204–215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. TwoLAME. 2017. TwoLAME - MPEG Audio Layer 2 Encoder. www.twolame.org .Google ScholarGoogle Scholar
  44. Rob F Van der Wijngaart and Timothy G Mattson. 2014. The Parallel Research Kernels.. In IEEE High Performance Extreme Computing Conference (HPEC). 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  45. Jeffrey S. Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (March 1985), 37–57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. xiph.org. 2017. Codecs from Xiph.Org Foundation. https://www.xiph.org .Google ScholarGoogle Scholar

Index Terms

  1. An empirical study of the effect of source-level loop transformations on compiler stability

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!