Abstract
It is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarse-grained, library level selections or hand coded cutoffs between base cases and recursive cases.
We present PetaBricks, a new implicitly parallel language and compiler where having multiple implementations of multiple algorithms to solve a problem is the natural way of programming. We make algorithmic choice a first class construct of the language. Choices are provided in a way that also allows our compiler to tune at a finer granularity. The PetaBricks compiler autotunes programs by making both fine-grained as well as algorithmic choices. Choices also include different automatic parallelization techniques, data distributions, algorithmic parameters, transformations, and blocking.
Additionally, we introduce novel techniques to autotune algorithms for different convergence criteria. When choosing between various direct and iterative methods, the PetaBricks compiler is able to tune a program in such a way that delivers near-optimal efficiency for any desired level of accuracy. The compiler has the flexibility of utilizing different convergence criteria for the various components within a single algorithm, providing the user with accuracy choice alongside algorithmic choice.
- Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 293---301, New York, NY, USA, 2007. ACM. ISBN 978-1--59593-768-1. Google Scholar
Digital Library
- Ed Anderson, Zhaojun Bai, Christian Bischof, Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, A. McKenney, and Danny Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. ISBN 0-89871-447-8. Google Scholar
Digital Library
- Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 340--347, New York, NY, USA, 1997. ACM. ISBN 0-89791-902-5. Google Scholar
Digital Library
- Eric A. Brewer. High-level optimization via automated statistical modeling. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 80--91, New York, NY, USA, 1995. ACM. ISBN 0-89791-701-6. Google Scholar
Digital Library
- James W. Demmel. Applied Numerical Linear Algebra. SIAM, August 1997. Google Scholar
Digital Library
- Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, February 2005. Invited paper, special issue on Program Generation, Optimization, and Platform Adaptation.Google Scholar
Cross Ref
- Matteo Frigo and Steven G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, volume 3, pages 1381--1384. IEEE, 1998.Google Scholar
Cross Ref
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk--5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, Jun 1998. Proceedings published ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998. Google Scholar
Digital Library
- John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software, 27(4):422--455, December 2001. ISSN 0098--3500. Google Scholar
Digital Library
- Eun-jin Im and Katherine Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Science, pages 127--136. Springer, 2001. Google Scholar
Digital Library
- Michail G. Lagoudakis and Michael L. Littman. Algorithm selection using reinforcement learning. In Proceedings of the International Conference On Machine Learning, pages 511--518. Morgan Kaufmann, 2000. Google Scholar
Digital Library
- Xiaoming Li, Maria Jesus Garzaran, and David Padua. A dynamically tuned sorting library. In Proceedings of the International Symposium on Code Generation and Optimization, pages 111--122, March 2004. Google Scholar
Digital Library
- Xiaoming Li, Mara Jess Garzarn, and David Padua. Optimizing sorting with genetic algorithms. In Proceedings of the International Symposium on Code Generation and Optimization, pages 99--110. IEEE Computer Society, 2005. Google Scholar
Digital Library
- Marek Olszewski and Michael Voss. Install-time system for automatic generation of optimized parallel sorting algorithms. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pages 17--23, 2004.Google Scholar
- Justin Mazzola Paluska, Hubert Pham, Umar Saif, Grace Chau, Chris Terman, and Steve Ward. Structured decomposition of adaptive applications. In Proceedings of the Annual IEEE International Conference on Pervasive Computing and Communications, pages 1--10, Washington, DC, USA, 2008. IEEE Computer Society. ISBN 978--0--7695--3113--7. Google Scholar
Digital Library
- Markus Puschel, Jose M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Aca Gacic, Franz Franchetti, Robbert W. Johnson Yevgen Voronenko, Kang Chen, and Nicholas Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, pages 232--275. IEEE, Feb 2005.Google Scholar
Cross Ref
- Richard H. Rand. Computer algebra in applied mathematics: an introduction to MACSYMA. Number 94 in Research notes in mathematics. 1984. ISBN 0-273-08632-4.Google Scholar
- Nathan Thomas, Gabriel Tanase, Olga Tkachyshyn, Jack Perdue, Nancy M. Amato, and Lawrence Rauchwerger. A framework for adaptive algorithm selection in STAPL. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, New York, NY, USA, 2005. ACM. ISBN 1-59593-080-9. Google Scholar
Digital Library
- Michael Voss and Rudolf Eigenmann. Adapt: Automated de-coupled adaptive program transformation. In Proceedings of the International Conference on Parallel Processing, pages 163--170, 2000. Google Scholar
Digital Library
- Michael Voss and Rudolf Eigenmann. High-level adaptive program optimization with adapt. ACM SIGPLAN Notices, 36(7):93--102, 2001. ISSN 0362--1340. Google Scholar
Digital Library
- Richard Vuduc, James W. Demmel, and Jeff A. Bilmes. Statistical models for empirical search-based performance tuning. International Journal ofGoogle Scholar
- High Performance Computing Applications, 18(1):65--94, 2004. ISSN 1094--3420.Google Scholar
- Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of the Scientific Discovery through Advanced Computing Conference, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing.Google Scholar
- Richard Clint Whaley and Jack J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 1--27, Washington, DC, USA, 1998. IEEE Computer Society. ISBN 0-89791-984-X. Google Scholar
Digital Library
- Richard Clint Whaley and Antoine Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101--121, February 2005. Google Scholar
Digital Library
- Samuel Webb Williams, Andrew Waterman, and David A. Patterson. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Technical Report UCB/EECS--2008--134, EECS Department, University of California, Berkeley, Oct 2008.Google Scholar
- Jianxin Xiong, Jeremy Johnson, Robert Johnson, and David Padua. SPL: a language and compiler for DSP algorithms. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 298--308, New York, NY, USA, 2001. ACM. ISBN 1-58113-414-2. Google Scholar
Digital Library
- Qing Yi and Richard Clint Whaley. Automated transformation for performance-critical kernels. In Proceedings of the ACM SIGPLAN Symposium on Library-Centric Software Design, Oct. 2007. Google Scholar
Digital Library
- Kamen Yotov, Xiaoming Li, Gang Ren, Michael Cibulskis, Gerald DeJong, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill, and Peng Wu. A comparison of empirical and model-driven optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 63--76, New York, NY, USA, 2003. ACM. ISBN 1-58113-662-5. Google Scholar
Digital Library
- Hao Yu, Dongmin Zhang, and Lawrence Rauchwerger. An adaptive algorithm selection framework. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pages 278--289, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2229-7. Google Scholar
Digital Library
Index Terms
PetaBricks: a language and compiler for algorithmic choice
Recommendations
PetaBricks: a language and compiler for algorithmic choice
PLDI '09: Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and ImplementationIt is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly ...
Parallel Languages and Compilers: Perspective From the Titanium Experience
We describe the rationale behind the design of key features of Titanium - an explicitly parallel dialect of Java for high-performance scientific programming - and our experiences in building applications with the language. Specifically, we address ...
Using Prolog to implement a compiler for a parallel image processing language
ICIP '95: Proceedings of the 1995 International Conference on Image Processing (Vol. 1)-Volume 1 - Volume 1This paper describes the use of Prolog as an implementation language for a compiler for a parallel image processing language. The target machine for which code is generated is an abstract model for parallel image processing. The structure of the final ...







Comments