skip to main content
research-article
Public Access

Architecture-Adaptive Code Variant Tuning

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly queried at runtime once the execution context is known. In this paper, we define a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process. We pose this as a multi-task learning problem, where each source architecture corresponds to a task; we use device features in the construction of the variant selection model. This work explores the effectiveness of multi-task learning and the impact of different strategies for device feature selection. We evaluate our approach on a set of benchmarks and a collection of six NVIDIA GPU architectures from three distinct generations. We achieve performance results that are mostly comparable to the previous approach of tuning for a single GPU architecture without having to repeat the learning phase.

References

  1. J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, New York, NY, USA, 2009. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing, pages 163--202. Birkh\"auser Press, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Baxter. Modern GPU library. http://nvlabs.github.io/moderngpu/.Google ScholarGoogle Scholar
  4. N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Bhowmick, B. Toth, and P. Raghavan. Towards low-cost, high-accuracy classifiers for linear solver selection. In Proceedings of the 9th International Conference on Computational Science: Part I, ICCS '09, pages 463--472, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 978--3--642-01969--2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PhiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing, ICS '97, pages 340--347, New York, NY, USA, 1997. ACM. ISBN 0--89791--902--5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.Google ScholarGoogle Scholar
  8. R. Caruana. Multitask learning. Mach. Learn., 28 (1): 41--75, July 1997. ISSN 0885--6125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Catanzaro. In-place matrix transposition. https://github.com/bryancatanzaro/inplace.Google ScholarGoogle Scholar
  10. B. Catanzaro, A. Keller, and M. Garland. A decomposition for in-place matrix transposition. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 193--206, New York, NY, USA, 2014. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O'Boyle, and O. Temam. Rapidly selecting good compiler optimizations using performance counters. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '07, pages 185--197, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0--7695--2764--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Chen. Model-guided empirical optimization for memory hierarchy. In Ph.D dissertation, University of Southern California, May 2007.Google ScholarGoogle Scholar
  13. M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0--7695--4385--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev., 51 (1): 129--159, Feb. 2009. ISSN 0036--1445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Davis. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38: 1:1--1:25, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U.-M. O'Reilly, and S. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Donadio, J. Brodman, T. Roeder, K. Yotov, D. Barthou, A. Cohen, M. J. Garzarán, D. Padua, and K. Pingali. A language for the compact representation of multiple program versions. In Proceedings of the 18th International Conference on Languages and Compilers for Parallel Computing, LCPC'05, pages 136--151, Berlin, Heidelberg, 2006. Springer-Verlag.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Frigo and S. G. Johnson. The fastest fourier transform in the west. In Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Hartono, B. Norris, and P. Sadayappan. Annotation-based empirical performance tuning using Orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '09, pages 1--11. IEEE Computer Society, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. A. Heroux, P. Raghavan, and H. D. Simon. Parallel Processing for Scientific Computing (Software, Environments and Tools). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006.Google ScholarGoogle Scholar
  21. H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In A. Z. David Forsyth, Philip Torr, editor, European Conference on Computer Vision, volume I of LNCS, pages 304--317. Springer, oct 2008.Google ScholarGoogle Scholar
  22. S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel and Distributed Processing Symposium (IPDPS), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  23. Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU program optimizations. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--10, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978--1--4244--3751--1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Magni, C. Dubach, and M. F. P. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 11:1--11:11, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Merrill. Back40 Computing,natexlaba. http://code.google.com/p/back40computing/.Google ScholarGoogle Scholar
  26. D. Merrill. CUDA Unbound (CUB),natexlabb. http://nvlabs.github.io/cub/.Google ScholarGoogle Scholar
  27. D. Merrill, M. Garland, and A. Grimshaw. Policy-based tuning for performance portability and library co-optimization. In Proc. Innovative Parallel Computing (InPar 2012), May 2012\natexlaba.Google ScholarGoogle Scholar
  28. D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012\natexlabb. ACM. ISBN 978--1--4503--1160--1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Muralidharan, M. Shantharam, M. Hall, M. Garland, and B. Catanzaro. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 501--512. IEEE Computer Society, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Parello, O. Temam, A. Cohen, and J. Verdun. Towards a systematic, pragmatic and architecture-aware program optimization process for complex processors. In Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, Pittsburgh, PA, USA, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Photonics and NVIDIA. CULA $|$ sparse. http://www.culatools.com/.Google ScholarGoogle Scholar
  32. M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93 (2): 232--275, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  33. M. Ren, J. Y. Park, M. Houston, A. Aiken, and W. J. Dally. A Tuning Framework for Software-Managed Memory Hierarchies. In International Conference on Parallel Architectures and Compilation Techniques, pages 280--291, October 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Sanfilippo and P. Noordhuis. Redis. http://redis.io.Google ScholarGoogle Scholar
  35. M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Tiwari, C. Chen, J. Chame, M. Hall, and J. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A Library of Automatically Tuned Sparse Matrix Kernels. Journal of Physics: Conference Series, 16 (1): 521, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  38. R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27 (1-2): 3--35, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized optimizations for empirical tuning. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Architecture-Adaptive Code Variant Tuning

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM SIGPLAN Notices
                ACM SIGPLAN Notices  Volume 51, Issue 4
                ASPLOS '16
                April 2016
                774 pages
                ISSN:0362-1340
                EISSN:1558-1160
                DOI:10.1145/2954679
                • Editor:
                • Andy Gill
                Issue’s Table of Contents
                • cover image ACM Conferences
                  ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
                  March 2016
                  824 pages
                  ISBN:9781450340915
                  DOI:10.1145/2872362
                  • General Chair:
                  • Tom Conte,
                  • Program Chair:
                  • Yuanyuan Zhou

                Copyright © 2016 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 25 March 2016

                Check for updates

                Qualifiers

                • research-article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!