Abstract
Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly queried at runtime once the execution context is known. In this paper, we define a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process. We pose this as a multi-task learning problem, where each source architecture corresponds to a task; we use device features in the construction of the variant selection model. This work explores the effectiveness of multi-task learning and the impact of different strategies for device feature selection. We evaluate our approach on a set of benchmarks and a collection of six NVIDIA GPU architectures from three distinct generations. We achieve performance results that are mostly comparable to the previous approach of tuning for a single GPU architecture without having to repeat the learning phase.
- J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, New York, NY, USA, 2009. ACM.Google Scholar
Digital Library
- S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing, pages 163--202. Birkh\"auser Press, 1997.Google Scholar
Digital Library
- S. Baxter. Modern GPU library. http://nvlabs.github.io/moderngpu/.Google Scholar
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.Google Scholar
Digital Library
- S. Bhowmick, B. Toth, and P. Raghavan. Towards low-cost, high-accuracy classifiers for linear solver selection. In Proceedings of the 9th International Conference on Computational Science: Part I, ICCS '09, pages 463--472, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 978--3--642-01969--2.Google Scholar
Digital Library
- J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PhiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing, ICS '97, pages 340--347, New York, NY, USA, 1997. ACM. ISBN 0--89791--902--5.Google Scholar
Digital Library
- E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.Google Scholar
- R. Caruana. Multitask learning. Mach. Learn., 28 (1): 41--75, July 1997. ISSN 0885--6125.Google Scholar
Digital Library
- B. Catanzaro. In-place matrix transposition. https://github.com/bryancatanzaro/inplace.Google Scholar
- B. Catanzaro, A. Keller, and M. Garland. A decomposition for in-place matrix transposition. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 193--206, New York, NY, USA, 2014. ACM.Google Scholar
Digital Library
- J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O'Boyle, and O. Temam. Rapidly selecting good compiler optimizations using performance counters. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '07, pages 185--197, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0--7695--2764--7.Google Scholar
Digital Library
- C. Chen. Model-guided empirical optimization for memory hierarchy. In Ph.D dissertation, University of Southern California, May 2007.Google Scholar
- M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0--7695--4385--7.Google Scholar
Digital Library
- K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev., 51 (1): 129--159, Feb. 2009. ISSN 0036--1445.Google Scholar
Digital Library
- T. Davis. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38: 1:1--1:25, 2011.Google Scholar
Digital Library
- Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U.-M. O'Reilly, and S. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, 2015.Google Scholar
Digital Library
- S. Donadio, J. Brodman, T. Roeder, K. Yotov, D. Barthou, A. Cohen, M. J. Garzarán, D. Padua, and K. Pingali. A language for the compact representation of multiple program versions. In Proceedings of the 18th International Conference on Languages and Compilers for Parallel Computing, LCPC'05, pages 136--151, Berlin, Heidelberg, 2006. Springer-Verlag.Google Scholar
Digital Library
- M. Frigo and S. G. Johnson. The fastest fourier transform in the west. In Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98, 1997.Google Scholar
Cross Ref
- A. Hartono, B. Norris, and P. Sadayappan. Annotation-based empirical performance tuning using Orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '09, pages 1--11. IEEE Computer Society, 2009.Google Scholar
Digital Library
- M. A. Heroux, P. Raghavan, and H. D. Simon. Parallel Processing for Scientific Computing (Software, Environments and Tools). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006.Google Scholar
- H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In A. Z. David Forsyth, Philip Torr, editor, European Conference on Computer Vision, volume I of LNCS, pages 304--317. Springer, oct 2008.Google Scholar
- S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel and Distributed Processing Symposium (IPDPS), 2010.Google Scholar
Cross Ref
- Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU program optimizations. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--10, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978--1--4244--3751--1.Google Scholar
Digital Library
- A. Magni, C. Dubach, and M. F. P. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 11:1--11:11, New York, NY, USA, 2013. ACM.Google Scholar
Digital Library
- D. Merrill. Back40 Computing,natexlaba. http://code.google.com/p/back40computing/.Google Scholar
- D. Merrill. CUDA Unbound (CUB),natexlabb. http://nvlabs.github.io/cub/.Google Scholar
- D. Merrill, M. Garland, and A. Grimshaw. Policy-based tuning for performance portability and library co-optimization. In Proc. Innovative Parallel Computing (InPar 2012), May 2012\natexlaba.Google Scholar
- D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012\natexlabb. ACM. ISBN 978--1--4503--1160--1.Google Scholar
Digital Library
- S. Muralidharan, M. Shantharam, M. Hall, M. Garland, and B. Catanzaro. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 501--512. IEEE Computer Society, 2014.Google Scholar
Digital Library
- D. Parello, O. Temam, A. Cohen, and J. Verdun. Towards a systematic, pragmatic and architecture-aware program optimization process for complex processors. In Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, Pittsburgh, PA, USA, 2004.Google Scholar
Digital Library
- E. Photonics and NVIDIA. CULA $|$ sparse. http://www.culatools.com/.Google Scholar
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93 (2): 232--275, 2005.Google Scholar
Cross Ref
- M. Ren, J. Y. Park, M. Houston, A. Aiken, and W. J. Dally. A Tuning Framework for Software-Managed Memory Hierarchies. In International Conference on Parallel Architectures and Compilation Techniques, pages 280--291, October 2008.Google Scholar
Digital Library
- S. Sanfilippo and P. Noordhuis. Redis. http://redis.io.Google Scholar
- M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society.Google Scholar
Digital Library
- A. Tiwari, C. Chen, J. Chame, M. Hall, and J. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.Google Scholar
Digital Library
- R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A Library of Automatically Tuned Sparse Matrix Kernels. Journal of Physics: Conference Series, 16 (1): 521, 2005.Google Scholar
Cross Ref
- R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27 (1-2): 3--35, 2001.Google Scholar
Digital Library
- Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized optimizations for empirical tuning. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.Google Scholar
Cross Ref
Index Terms
Architecture-Adaptive Code Variant Tuning
Recommendations
Architecture-Adaptive Code Variant Tuning
ASPLOS'16Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and ...
Architecture-Adaptive Code Variant Tuning
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsCode variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and ...
A Motivating Case Study on Code Variant Selection by Reinforcement Learning
High Performance ComputingAbstractIn this paper, we investigate the applicability of reinforcement learning as a possible approach to select code variants. Our approach is based on the observation that code variants are usually convertible between one another by code ...







Comments