Abstract
Offloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. Drop-in GPU libraries have made it easy to offload computations to multiple GPUs by hiding this complexity inside library calls. Such encapsulation prevents the reuse of the data between successive kernel invocations resulting in redundant communication. This limitation exists in multi-GPU libraries like CUBLASXT. In this paper, we introduce SemCache++, a semantics-aware GPU cache that automatically manages communication between the CPU and multiple GPUs in addition to optimizing communication by eliminating redundant transfers using caching. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. Our caching technique is efficient; it uses a two level caching directory to track matrices and sub-matrices. Experimental results show that our system can eliminate redundant communication and deliver significant performance improvements over multi-GPU libraries like CUBLASXT.
- N. AlSaber and M. Kulkarni. Semcache: Semantics-aware caching for efficient gpu offloading. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, 2013. Google Scholar
Digital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par 2009 Parallel Processing. 2009. Google Scholar
Digital Library
- E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. Quintana-Ort´ı. An extension of the starss programming model for platforms with multiple gpus. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par ’09, 2009. Google Scholar
Digital Library
- J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. Cula: hybrid gpu accelerated linear algebra routines. SPIE Defense and Security Symposium (DSS).Google Scholar
- T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for cpu-gpu architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, 2012. Google Scholar
Digital Library
- NVIDIA. Cuda. http://developer.nvidia.com/ cuda-toolkit.Google Scholar
- G. Quintana-Ort´ı, F. D. Igual, E. S. Quintana-Ort´ı, and R. A. van de Geijn. Solving dense linear systems on platforms with multiple hardware accelerators. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’09, 2009. Google Scholar
Digital Library
- P. D. S. Tomov, R. Nath and J. Dongarra. Magma version 0.2 user guide.Google Scholar
Index Terms
SemCache++: semantics-aware caching for efficient multi-GPU offloading
Recommendations
SemCache++: semantics-aware caching for efficient multi-GPU offloading
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingOffloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. Drop-in GPU libraries have made it easy to offload computations to multiple GPUs by hiding this ...
SemCache++: Semantics-Aware Caching for Efficient Multi-GPU Offloading
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingOffloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. GPU drop-in libraries (which require no program rewrite) have made it easy to offload computations ...
SemCache: semantics-aware caching for efficient GPU offloading
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingRecently, GPU libraries have made it easy to improve application performance by offloading computation to the GPU. However, using such libraries introduces the complexity of manually handling explicit data movements between GPU and CPU memory spaces. ...






Comments