Abstract
Graphics processing units (GPUs) are increasingly used for high-performance computing. Programming frameworks for general-purpose computing on GPUs (GPGPU), such as CUDA and OpenCL, are also maturing. Driving this trend is the recent proliferation of mobile devices such as smartphones and wearable computers. These devices are increasingly incorporating computationally intensive applications that involve some form of environmental recognition such as augmented reality (AR) or voice recognition. However, devices with low computational power cannot satisfy such demanding computing requirements. The CPU load of these devices could be reduced by offloading computation onto GPUs on the cloud. This paper presents GPUrpc, a remote procedure call (RPC) extension to Gdev, which is a rich set of runtime libraries and device drivers for achieving first-class GPU resource management. GPUrpc allows developers to use CUDA for GPGPU development work. Existing research uses RPCs based on the CUDA application programming interfaces (APIs); hence, all CUDA APIs require communication. To reduce communication overhead, we use an RPC based on a low-level API than CUDA API and reduced API that does not require communication. Our evaluation conducted on Linux and NVIDIA GPUs shows that the basic performance of our prototype implementation is reliable in comparison with the existing method. Evaluation using the Rodinia benchmark suite designed for research in heterogeneous parallel computing showed that GPUrpc is effective for applications such as image processing and data mining. GPUrpc also can improve power consumption to approximately 1/6 that of CPU processing for performing 512 × 512 matrix multiplication.
- 2014. TOP500 supercomputing sites. Retrieved from http://www.top500.org/lists/2014/11/.Google Scholar
- Erik Alerstam, Tomas Svensson, and Stefan Andersson-Engels. 2008. Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration. J. Biomed. Optics 13, 6 (2008), 060504--060504.Google Scholar
Cross Ref
- Nguyen Viet Anh, Yusuke Fujii, Yuki Iida, Takuya Azumi, Nobuhiko Nishio, and Shinpei Kato. 2014. Reducing data copies between GPUs and NICs. In Proceedings of the IEEE International Conference on Cyber-Physical Systems, Networks, and Applications. 37--42. Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC’09). IEEE Computer Society, 44--54. DOI:http://dx.doi.org/10.1109/IISWC.2009.5306797 Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68, 10 (Oct. 2008), 1370--1380. DOI:http://dx.doi.org/10.1016/j.jpdc.2008.05.014 Google Scholar
Digital Library
- Eduardo Cuervo, Aruna Balasubramanian, Dae ki Cho, Alec Wolman, Stefan Saroiu, Ranveer Chandra, and Paramvir Bahl. 2010. MAUI: Making smartphones last longer with code offload. In Proceedings of the ACM International Conference on Mobile Systems, Applications, and Services. 49--62. Google Scholar
Digital Library
- José Duato, Francisco D. Igual, Rafael Mayo, Antonio J. Peña, Enrique S. Quintana-Ortí, and Federico Silla. 2010a. An efficient implementation of GPU virtualization in high performance clusters. In Proceedings of the 2009 International Conference on Parallel Processing (Euro-Par’09). Springer-Verlag, 385--394. http://dl.acm.org/citation.cfm?id=1884795.1884840 Google Scholar
Digital Library
- José Duato, Antonio J. Peña, Federico Silla, Rafael Mayo, and Enrique S. Quintana-Ortí. 2010b. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedings of the 2010 International Conference on High Performance Computing 8 Simulation (HPCS’10). 224--231. DOI:http://dx.doi.org/10.1109/HPCS.2010.5547126Google Scholar
Cross Ref
- Glenn A. Elliott and James H. Anderson. 2014. Exploring the multitude of real-time multi-GPU configurations. In Proceedings of the IEEE Real-Time Systems Symposium. 260--271.Google Scholar
- James H. Anderson Glenn A. Elliott, Bryan C. Ward. 2013. GPUSync: A framework for real-time GPU management. In Proceedings of the IEEE Real-Time Systems Symposium. 33--44. Google Scholar
Digital Library
- Google. 2013. Google Glass. Retrieved from http://www.google.com/glass.Google Scholar
- Khronos Group. 2013. OpenCL. Retrieved from http://jp.khronos.org/opencl.Google Scholar
- Yuki Iida, Manato Hirabayashi, Takuya Azumi, Nobuhiko Nishio, and Shinpei Kato. 2014. Connected smartphones and high-performance servers for remote object detection. In Proceedings of the IEEE International Conference on Cyber-Physical Systems, Networks, and Applications. 71--76. Google Scholar
Digital Library
- Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems (ICCPS’13). ACM, 170--178. DOI:http://dx.doi.org/10.1145/2502524.2502548 Google Scholar
Digital Library
- Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-class gpu resource management in the operating system. In Presented as Part of the 2012 USENIX Annual Technical Conference (USENIX ATC’12). USENIX, 401--412. https://www.usenix.org/conference/atc12/technical-sessions/ presentation/kato. Google Scholar
Digital Library
- A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi. 2012. Distributed-shared CUDA: Virtualization of large-scale GPU systems for programmability and reliability. In Proceedings of the 4th International Conference on Future Computational Technologies and Applications (FUTURE COMPUTING’12). 712.Google Scholar
- Volodymyr Kindratenko and Pedro Trancoso. 2011. Trends in high-performance computing. Computing in Science 8 Engineering 13, 3 (2011), 92--95. Google Scholar
Digital Library
- Andreas Kolb and Nicolas Cuntz. 2005. Dynamic particle coupling for GPU-based fluid simulation. In Proceedings of the Symposium on Simulation Technique. 722--727.Google Scholar
- Wenjing Ma and Gagan Agrawal. 2009. A translation system for enabling data mining applications on GPUs. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, 400--409. DOI:http://dx.doi.org/10.1145/1542275.1542331 Google Scholar
Digital Library
- NVIDIA. 2015a. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programm ing-guide.Google Scholar
- NVIDIA. 2015b. CUDA Documents. Retrieved from http://docs.nvidia.com/cuda/.Google Scholar
- Minoru Oikawa, Atsushi Kawai, Kentaro Nomura, Kenji Yasuoka, Kazuyuki Yoshikawa, and Tetsu Narumi. 2012. DS-CUDA: A middleware to use many GPUs in the cloud environment. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, USA, 1207--1214. DOI:http://dx.doi.org/10.1109/SC.Companion.2012.146 Google Scholar
Digital Library
- Antonio José Peña, Carlos Reaño, Federico Silla, Rafael Mayo, Enrique S. Quintana-Ortí, and Jose Duato. 2014. A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Comput. 40, 10 (2014), 574--588. DOI:http://dx.doi.org/10.1016/j.parco.2014.09.011 Google Scholar
Digital Library
- Padmanabhan S. Pillai, Lily B. Mummert, Steven W. Schlosser, and Rahul Sukthankar. 2009. SLIPstream: Scalable low-latency interactive perception on streaming data. In NOSSDAV. ACM, 43--48. Google Scholar
Digital Library
- Moo-Ryong Ra, Anmol Sheth, Lily Mummert, Padmanabhan Pillai, David Wetherall, and Ramesh Govindan. 2011. Odessa: Enabling interactive perception applications on mobile devices. In Proceedings of the 9th International Conference on Mobile Systems, Applications, and Services (MobiSys’11). ACM, 43--56. DOI:http://dx.doi.org/10.1145/1999995.2000000 Google Scholar
Digital Library
- Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, 107--118. DOI:http://dx.doi.org/10.1109/MICRO.2012.19 Google Scholar
Digital Library
- Jeffrey Young, Haicheng Wu, and Sudhakar Yalamanchili. 2012. Satisfying data-intensive queries using GPU clusters. In SC Companion. IEEE Computer Society, 1314. Google Scholar
Digital Library
Index Terms
GPUrpc: Exploring Transparent Access to Remote GPUs
Recommendations
Scalable and Parallel Implementation of a Financial Application on a GPU: With Focus on Out-of-Core Case
CIT '10: Proceedings of the 2010 10th IEEE International Conference on Computer and Information TechnologyThe architecture of the latest Graphic Processing Unit (GPU) consists of a number of uniform programmable units integrated on the same chip, which facilitate the general-purpose computing beyond the graphic processing. With the multiple programmable ...
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingIn heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...






Comments