Abstract
Modern compiler optimization is a complex process that offers no guarantees to deliver the fastest, most efficient target code. For this reason, compilers struggle to produce a stable performance from versions of code that carry out the same computation and only differ in the order of operations. This instability makes compilers much less effective program optimization tools and often forces programmers to carry out a brute force search when tuning for performance. In this paper, we analyze the stability of the compilation process and the performance headroom of three widely used general purpose compilers: GCC, ICC, and Clang. For the study, we extracted over 1,000 <pre>for</pre> loop nests from well-known benchmarks, libraries, and real applications; then, we applied sequences of source-level loop transformations to these loop nests to create numerous semantically equivalent mutations; finally, we analyzed the impact of transformations on code quality in terms of locality, dynamic instruction count, and vectorization. Our results show that, by applying source-to-source transformations and searching for the best vectorization setting, the percentage of loops sped up by at least 1.15x is 46.7% for GCC, 35.7% for ICC, and 46.5% for Clang, and on average the potential for performance improvement is estimated to be at least 23.7% for GCC, 18.1% for ICC, and 26.4% for Clang. Our stability analysis shows that, under our experimental setup, the average coefficient of variation of the execution time across all mutations is 18.2% for GCC, 19.5% for ICC, and 16.9% for Clang, and the highest coefficient of variation for a single loop nest reaches 118.9% for GCC, 124.3% for ICC, and 110.5% for Clang. We conclude that the evaluated compilers need further improvements to claim they have stable behavior.
Supplemental Material
- Dominik Adamski, Grzegorz Jabłoński, Piotr Perek, and Andrzej Napieralski. 2016. Polyhedral Source-to-Source Compiler. In Mixed Design of Integrated Circuits and Systems, 2016 MIXDES-23rd International Conference. IEEE, 458–463.Google Scholar
Cross Ref
- Randy Allen and Ken Kennedy. 2001. Optimizing compilers for modern architectures a dependence-based approach. (2001).Google Scholar
- David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63–73. Google Scholar
Digital Library
- C Bastoul and LN Pouchet. 2012. Candl: the chunky analyzer for dependences in loops. Technical Report. tech. rep., LRI, Paris-Sud University, France.Google Scholar
- Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction. Springer, 132–146. Google Scholar
Digital Library
- Shirley Browne, Jack Dongarra, Eric Grosse, and Tom Rowan. 1995. The Netlib mathematical software repository. D-Lib Magazine, Sep (1995).Google Scholar
Cross Ref
- Pablo De Oliveira Castro, Chadi Akel, Eric Petit, Mihail Popov, and William Jalby. 2015. CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization. ACM Trans. Archit. Code Optim. 12, 1 (2015), 6:1–6:24. Google Scholar
Digital Library
- Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Citeseer.Google Scholar
- Zhi Chen, Zhangxiaowen Gong, Justin Szaday, David C. Wong, David Padua, Alexandru Nicolau, Alexander V. Veidenbaum, Neftali Watkinson, Zehra Sura, Saeed Maleki, Josep Torrellas, and Gerald DeJong. 2017. LORE: A loop repository for the evaluation of compilers. In 2017 IEEE International Symposium on Workload Characterization (IISWC). 219–228.Google Scholar
Cross Ref
- Alastair F. Donaldson, Hugues Evrard, Andrei Lascu, and Paul Thomson. 2017. Automated testing of graphics shader compilers. PACMPL 1, OOPSLA (2017), 93:1–93:29. Google Scholar
Digital Library
- Stuart I Feldman. 1990. A Fortran to C converter. In ACM SIGPLAN Fortran Forum, Vol. 9. ACM, 21–22. Google Scholar
Digital Library
- Jason E Fritts, Frederick W Steiling, Joseph A Tucek, and Wayne Wolf. 2009. MediaBench II video: Expediting the next generation of video systems research. Microprocessors and Microsystems 33, 4 (2009), 301–318. Google Scholar
Digital Library
- GG Fursin, Michael FP OâĂŹBoyle, and Peter MW Knijnenburg. 2002. Evaluating iterative compilation. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 362–376. Google Scholar
Digital Library
- GAP. 2007. GAP - Groups, Algorithms, Programming - a System for Computational Discrete Algebra. www.gap-system.org .Google Scholar
- John L Henning. 2000. SPEC CPU2000: Measuring CPU performance in the new millennium. Computer 33, 7 (2000), 28–35. Google Scholar
Digital Library
- John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1–17. Google Scholar
Digital Library
- Intel. 2016. Intel 64, and IA32 architectures software developer’s manual, vol. 3A: system programming guide, part 1. Intel Corporation, Denver, CO (2016).Google Scholar
- David J. Kuck, Robert H. Kuhn, Bruce Leasure, and Michael Wolfe. 1980. The structure of an advanced vectorizer for pipelined processors. In Fourth International Computer Software and Applications Conference. IEEE, 201–218.Google Scholar
- LAME. 2017. LAME MP3 Encoder. lame.sourceforge.net .Google Scholar
- Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. Vol. 35. ACM. Google Scholar
Digital Library
- Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014. 216–226. Google Scholar
Digital Library
- Man-Lap Li, R. Sasanka, S. V. Adve, Yen-Kuang Chen, and E. Debes. 2005. The ALPBench benchmark suite for complex multimedia applications. In IEEE International Proceedings of the IEEE Workload Characterization Symposium (IISWC). 34–45.Google Scholar
- Chunhua Liao, Daniel J Quinlan, Richard Vuduc, and Thomas Panas. 2009. Effective source-to-source outlining to support whole program empirical optimization. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 308–322. Google Scholar
Digital Library
- LLNL. 2008. ASC Sequoia Benchmark. https://asc.llnl.gov/sequoia/benchmarks/ .Google Scholar
- Saeed Maleki, Yaoqing Gao, Maria J Garzar, Tommy Wong, David A Padua, et al. 2011. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 372–382. Google Scholar
Digital Library
- Mozilla. 2017. Mozilla JPEG Encoder Project. github.com/mozilla/mozjpeg .Google Scholar
- Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney. 2009. Producing wrong data without doing anything obviously wrong! ACM Sigplan Notices 44, 3 (2009), 265–276. Google Scholar
Digital Library
- Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization-revisited for short SIMD architectures. In Parallel Architectures and Compilation Techniques (PACT), 2008 International Conference on. IEEE, 2–11. Google Scholar
Digital Library
- Zhelong Pan and Rudolf Eigenmann. 2006. Fast, Automatic, Procedure-level Performance Tuning. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT). 173–181. Google Scholar
Digital Library
- Gabriele Paoloni. 2010. How to benchmark code execution times on Intel IA-32 and IA-64 instruction set architectures. Intel Corporation (2010), 123.Google Scholar
- Eunjung Park, John Cavazos, Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and P Sadayappan. 2013. Predictive modeling in a polyhedral optimization space. International journal of parallel programming 41, 5 (2013), 704–750.Google Scholar
- Tim Peters. 1992. Livermore loops coded in C. http://www.netlib.org/benchmark/livermorec . (1992).Google Scholar
- Louis-Noël Pouchet. 2011. Polyopt/C: A polyhedral optimizer for the ROSE compiler. http://web.cse.ohio-state.edu/~pouchet/ software/polyopt .Google Scholar
- Louis-Noël Pouchet. 2012. Polybench: The polyhedral benchmark suite. http://www.cs.ucla.edu/pouchet/software/polybench . (2012).Google Scholar
- Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and John Cavazos. 2008. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Notices, Vol. 43. ACM, 90–100. Google Scholar
Digital Library
- Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters 10, 02n03 (2000), 215–226.Google Scholar
Cross Ref
- Joseph Redmon. 2013–2016. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/ .Google Scholar
- Peter Rundberg and Fredrik Warg. 2002. The FreeBench v1.0 Benchmark Suite. http://www.freebench.org . (2002).Google Scholar
- Sergio Segura, Javier Troya, Amador Durán Toro, and Antonio Ruiz Cortés. 2017. Performance Metamorphic Testing: Motivation and Challenges. In 39th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track, ICSE-NIER 2017, Buenos Aires, Argentina, May 20-28, 2017. 7–10. Google Scholar
Digital Library
- Shelby Thomas, Chetan Gohkale, Enrico Tanuwidjaja, Tony Chong, David Lau, Saturnino Garcia, and Michael Bedford Taylor. 2014. CortexSuite: A Synthetic Brain Benchmark Suite. In IEEE International Proceedings of the IEEE Workload Characterization Symposium (IISWC). 76–79.Google Scholar
- Ananta Tiwari, Jeffrey K Hollingsworth, Chun Chen, Mary Hall, Chunhua Liao, Daniel J Quinlan, and Jacqueline Chame. 2011. Auto-tuning full applications: A case study. The International Journal of High Performance Computing Applications 25, 3 (2011), 286–294. Google Scholar
Digital Library
- Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, and David I. August. 2003. Compiler Optimization-space Exploration. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO)). 204–215. Google Scholar
Digital Library
- TwoLAME. 2017. TwoLAME - MPEG Audio Layer 2 Encoder. www.twolame.org .Google Scholar
- Rob F Van der Wijngaart and Timothy G Mattson. 2014. The Parallel Research Kernels.. In IEEE High Performance Extreme Computing Conference (HPEC). 1–6.Google Scholar
Cross Ref
- Jeffrey S. Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (March 1985), 37–57. Google Scholar
Digital Library
- xiph.org. 2017. Codecs from Xiph.Org Foundation. https://www.xiph.org .Google Scholar
Index Terms
An empirical study of the effect of source-level loop transformations on compiler stability
Recommendations
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Towards a source level compiler: source level modulo scheduling
Program analysis and compilation, theory and practiceModulo scheduling is a major optimization of high performance compilers wherein The body of a loop is replaced by an overlapping of instructions from different iterations. Hence the compiler can schedule more instructions in parallel than in the ...
Compiler transformations for effectively exploiting a zero overhead loop buffer
A Zero Overhead Loop Buffer (ZOLB) is an architectural feature that is commonly found in DSP processors. This buffer can be viewed as a compiler managed cache that contains a sequence of instructions that will be executed a specified number of times ...






Comments