Abstract
Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.
GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.
- Sandeep R. Agrawal, Valentin Pistol, Jun Pang, John Tran, David Tarjan, and Alvin R. Lebeck. 2014. Rhythm: Harnessing data parallel hardware for server workloads. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). Google Scholar
Digital Library
- Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12, 2037--2041. Google Scholar
Digital Library
- R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. S. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, and P. Vicini. 2012. APEnet+: A 3D Torus network optimized for GPU-based HPC systems. Journal of Physics: Conference Series 396, 1--11.Google Scholar
Cross Ref
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44. Google Scholar
Digital Library
- Nathan Z. Beckmann, Charles Gruenwald III, Christopher R. Johnson, Harshad Kasture, Filippo Sironi, Anant Agarwal, M. Frans Kaashoek, and Nickolai Zeldovich. 2014. PIKA: A Network Service for Multikernel Operating Systems. Technical Report MIT-CSAIL-TR-2014-002. Massachusetts Institute of Technology, Cambridge, MA. http://hdl.handle.net/1721.1/84608.Google Scholar
- Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 267--280. Google Scholar
Digital Library
- Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13). 1337--1345.Google Scholar
Digital Library
- Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’16). Article No. 6. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google Scholar
Digital Library
- Bryan Ford. 2007. Structured streams: A new transport abstraction. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 361--372. DOI:http://dx.doi.org/10.1145/1282380.1282421 Google Scholar
Digital Library
- Shalini Gupta. 2013. Efficient Object Detection on GPUs Using MB-LBP Features and Random Forests. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2013/presentations/S3297-Efficient-Object-Detection-GPU-MB-LBP-Forest.pdf.Google Scholar
- Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. ACM SIGCOMM Computer Communication Review 40, 4, 195--206. DOI:http://dx.doi.org/10.1145/1851275.1851207 Google Scholar
Digital Library
- Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A new programming interface for scalable network I/O. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google Scholar
Digital Library
- Sean Hefty. 2012. Rsockets. Available at https://www.openfabrics.org/index.php/resources/document-downlo ads/public-documents/doc_download/495-rsockets.html.Google Scholar
- InfiniBand Trade Association. 2007. InfiniBand Architecture Specification, Volume 1—General Specification, Release 1.2.1. InfiniBand Trade Association.Google Scholar
- Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL acceleration with commodity processors. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’11). http://portal.acm.org/citation.cfm?id=1972457.1972459 Google Scholar
Digital Library
- Feng Ji, Heshan Lin, and Xiaosong Ma. 2013. RSVM: A region-based software virtual memory for GPU. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 269--278. Google Scholar
Digital Library
- Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems (ICCPS’13). ACM, New York, NY, 170--178. DOI:http://dx.doi.org/10.1145/2502524.2502548 Google Scholar
Digital Library
- Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the USENIX Annual Technical Conference. http://portal.acm.org/citation.cfm?id=2002181.2002183 Google Scholar
Digital Library
- Khronos Group. 2016. OpenCL: The Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved August 21, 2016, from http://www.khronos.org/opencl.Google Scholar
- David B. Kirk and W. Hwu Wen-mei. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann. Google Scholar
Digital Library
- Maxwell Krohn, Eddie Kohler, and M. Frans Kaashoek. 2007. Events can make sense. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1364385.1364392 Google Scholar
Digital Library
- Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. 2014. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY. Google Scholar
Digital Library
- Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K. Panda. 2004. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3, 167--198. Google Scholar
Digital Library
- NVIDIA. 2015. Developing a Linux Kernel Module Using GPUDirect RDMA. Retrieved August 21, 2016, from http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.Google Scholar
- NVIDIA. 2016. GPU Applications. Retrieved August 21, 2016, from http://www.nvidia.com/object/ gpu-applications.html.Google Scholar
- Ohio State University Network-Based Computing Laboratory. 2015. MVAPICH2: High Performance MPI over InfiniBand, iWARP and RoCE. http://mvapich.cse. ohio-state.edu. (2015).Google Scholar
- John Ousterhout et al. 2010. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. ACM Operating Systems Review 43, 4, 92--105. Google Scholar
Digital Library
- Sreeram Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, Krishna Kandalla, Hari Subramoni, and Dhabaleswar K. Panda. 2013a. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’13). ACM, New York, NY. DOI:http://dx.doi.org/10.1145/2503210.2503288 Google Scholar
Digital Library
- Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013b. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing (ICPP’13). IEEE, Los Alamitos, CA, 80--89. Google Scholar
Digital Library
- Alexander Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, and Amin Vahdat. 2012. Themis: An I/O efficient MapReduce. In Proceedings of the ACM Symposium on Cloud Computing. Google Scholar
Digital Library
- Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). 233--248. Google Scholar
Digital Library
- Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 49--68. DOI:http://dx.doi.org/10.1145/2517349.2522715 Google Scholar
Digital Library
- Davide Rossetti, Sreeram Potluri, and David Fontaine. 2016. State of GPUdirect Technologies. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2016/presentation/s6264-davide-rossetti-GPUDirect.pdf.Google Scholar
- Leah Shalev, Julian Satran, Eran Borovik, and Muli Ben-Yehuda. 2010. IsoStack: Highly efficient network processing on dedicated cores. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1855840.1855845 Google Scholar
Digital Library
- Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating file systems with GPUs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 13. Google Scholar
Digital Library
- Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014a. GPUfs: Integrating file systems with GPUs. ACM Transactions on Computer Systems 32, 1, Article No. 1. Google Scholar
Digital Library
- Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2014b. GPUnet: Networking Abstractions for GPU Programs. Retrieved August 21, 2016, from https://sites.google.com/site/silbersteinmark/GPUnet.Google Scholar
- W. Richard Stevens. 1993. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley. Google Scholar
Digital Library
- W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. 2004. UNIX Network Programming. Vol. 1. Addison-Wesley Professional. Google Scholar
Digital Library
- Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE, Los Alamitos, CA, 1068--1079. Google Scholar
Digital Library
- Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. IEEE, Los Alamitos, CA, 25--36. http://dl.acm.org/citation.cfm?id=2537857.2537861 Google Scholar
Digital Library
- Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications. ACM, New York, NY, 9--16. Google Scholar
Digital Library
- Taneja Group Technology Analysts. 2012. InfiniBand Data Center March. Retrieved August 21, 2016, from https://cw.infinibandta.org/document/dl/7269.Google Scholar
- Animesh Trivedi, Bernard Metzler, Patrick Stuedi, and Thomas R. Gross. 2013. On limitations of network acceleration. In Proceedings of the 9th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT’13). ACM, New York, NY, 121--126. DOI:http://dx.doi.org/10.1145/2535372.2535412 Google Scholar
Digital Library
- Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-accelerated stateful packet processing framework. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 321--332. Google Scholar
Digital Library
- Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using vector interfaces to deliver millions of IOPS from a networked key-value storage server. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY. DOI:http://dx.doi.org/10.1145/2391229.2391237 Google Scholar
Digital Library
- Vasily Volkov. 2010. Better Performance at Lower Occupancy. Retrieved August 21, 2016, from http://www.cs.berkeley.edu/∼volkov/volkov10-GTC.pdf.Google Scholar
- Rob Von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. 2003. Capriccio: Scalable threads for Internet services. ACM Operating Systems Review 37, 268--281. Google Scholar
Digital Library
- Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable Internet services. ACM Operating Systems Review 35, 230--243. Google Scholar
Digital Library
- Bob Woodruf. 2013. OFS Software for the Intel Xeon Phi. In Proceedings of the OpenFabrics Alliance International Developer Workshop.Google Scholar
- Lior Zeno and Mark Silberstein. 2016. The case for I/O preemption on discrete GPUs. In Proceedings of the International Workshop on GPU Computing Systems (GPGPU’16). 63--71. Google Scholar
Digital Library
Index Terms
GPUnet: Networking Abstractions for GPU Programs
Recommendations
GPUrdma: GPU-side library for high performance networking from GPU kernels
ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for SupercomputersWe present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both ...
GPUfs: integrating a file system with GPUs
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsPU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host's ...
GPUfs: Integrating a file system with GPUs
As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order ...






Comments