Abstract
Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.
- E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects. Journal of Physics: Conference Series, 180(1):012037, 2009.Google Scholar
Cross Ref
- P. Baldi, P. Sadowski, and D. Whiteson. Searching for Exotic Particles in High-Energy Physics with Deep Learning. Nature communications, 5, 2014.Google Scholar
- N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 18. ACM, 2009. Google Scholar
Digital Library
- J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3, 2010.Google Scholar
Cross Ref
- M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. Proceedings of the VLDB Endowment, 7(7):553–564, 2014. Google Scholar
Digital Library
- J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proceedings of the 19th international conference on Knowledge discovery and data mining, pages 95–103, 2013. Google Scholar
Digital Library
- J. Canny and H. Zhao. BIDMach: Large-Scale Learning with Zero Memory Allocation. In BigLearning, NIPS Workshop, 2013.Google Scholar
- B. Catanzaro, N. Sundaram, and K. Keutzer. Fast Support Vector Machine Training and Classification on Graphics Processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008. Google Scholar
Digital Library
- O. Chapelle. Training a Support Vector Machine in the Primal. Neural Computation, 19(5):1155–1178, 2007. Google Scholar
Digital Library
- A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337–1345, 2013.Google Scholar
- R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like Environment for Machine Learning. In BigLearning, NIPS Workshop, 2011.Google Scholar
- cuBLAS. The NVIDIA CUDA Basic Linear Algebra Subroutines Library. URL https://developer.nvidia.com/cublas.Google Scholar
- CUDA. A Parallel Computing Platform and Programming Model Invented by NVIDIA. URL http://www.nvidia.com/object/ cuda_home_new.html.Google Scholar
- cuDNN. The NVIDIA CUDA Library of Primitives for Deep Neural Networks. URL https://developer.nvidia.com/cuDNN.Google Scholar
- cuSPARSE. The NVIDIA CUDA Sparse Matrix Library. URL https://developer.nvidia.com/cusparse.Google Scholar
- R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell. A Parallel Implementation of K-Means Clustering on GPUs. In PDPTA, pages 340–345, 2008.Google Scholar
- A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative Machine Learning on MapReduce. In IEEE 27th International Conference on Data Engineering, pages 231–242. IEEE, 2011. Google Scholar
Digital Library
- HiPLAR. High Performance Linear Algebra in R. URL http: //hiplar.org.Google Scholar
- C.-H. Ho and C.-J. Lin. Large-Scale Linear Support Vector Regression. The Journal of Machine Learning Research, 13(1):3323–3348, 2012. Google Scholar
Digital Library
- Intel. Math Kernel Library. URL https://software.intel.com/ en-us/intel-mkl.Google Scholar
- Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, December 2008.Google Scholar
- D. B. Kirk and W. H. Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Newnes, 2012. Google Scholar
Digital Library
- J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM (JACM), 46(5):604–632, 1999. Google Scholar
Digital Library
- C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9: 627–650, 2008. Google Scholar
Digital Library
- N. Lopes and B. Ribeiro. GPUMLib: An Efficient Open-Source GPU Machine Learning Library. International Journal of Computer Information Systems and Industrial Management Applications, 3:355– 362, 2011.Google Scholar
- N. Lopes, B. Ribeiro, and R. Quintas. GPUMLib: a New Library to Combine Machine Learning Algorithms with Graphics Processing Units. In Hybrid Intelligent Systems (HIS), 2010 10th International Conference on, pages 229–232. IEEE, 2010.Google Scholar
- MAGMA. Matrix Algebra on GPU and Multicore Architectures. URL http://icl.cs.utk.edu/magma.Google Scholar
- P. McCullagh. Generalized Linear Models. European Journal of Operational Research, 16(3):285–292, 1984.Google Scholar
- J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40–53, 2008. Google Scholar
Digital Library
- NVIDIA. CUDA GPU Occupancy Calculator. URL http://developer.download.nvidia.com/compute/ cuda/CUDA_Occupancy_calculator.xls.Google Scholar
- NVVP. NVIDIA Visual Profiler. URL https://developer. nvidia.com/nvidia-visual-profiler.Google Scholar
- R. Raina, A. Madhavan, and A. Y. Ng. Large-Scale Deep Unsupervised Learning Using Graphics Processors. In International Conference on Machine Learning, volume 9, pages 873–880, 2009. Google Scholar
Digital Library
- T. Sharp. Implementing Decision Trees and Forests on a GPU. In Computer Vision–ECCV 2008, pages 595–608. Springer, 2008.Google Scholar
- J. Stamper, A. Niculescu-Mizil, S. Ritter, G. Gordon, and K. Koedinger. Algebra I 2008-2009. Challenge Data Set from KDD Cup 2010 Educational Data Mining Challenge, 2013. URL http: //pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp.Google Scholar
- C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google Scholar
Digital Library
Index Terms
On optimizing machine learning workloads via kernel fusion
Recommendations
On optimizing machine learning workloads via kernel fusion
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingExploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches ...
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumData warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...






Comments