skip to main content
research-article
Public Access

GPUnet: Networking Abstractions for GPU Programs

Published:17 September 2016Publication History
Skip Abstract Section

Abstract

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.

GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

References

  1. Sandeep R. Agrawal, Valentin Pistol, Jun Pang, John Tran, David Tarjan, and Alvin R. Lebeck. 2014. Rhythm: Harnessing data parallel hardware for server workloads. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12, 2037--2041. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. S. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, and P. Vicini. 2012. APEnet+: A 3D Torus network optimized for GPU-based HPC systems. Journal of Physics: Conference Series 396, 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  4. Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nathan Z. Beckmann, Charles Gruenwald III, Christopher R. Johnson, Harshad Kasture, Filippo Sironi, Anant Agarwal, M. Frans Kaashoek, and Nickolai Zeldovich. 2014. PIKA: A Network Service for Multikernel Operating Systems. Technical Report MIT-CSAIL-TR-2014-002. Massachusetts Institute of Technology, Cambridge, MA. http://hdl.handle.net/1721.1/84608.Google ScholarGoogle Scholar
  6. Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 267--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13). 1337--1345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’16). Article No. 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bryan Ford. 2007. Structured streams: A new transport abstraction. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 361--372. DOI:http://dx.doi.org/10.1145/1282380.1282421 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shalini Gupta. 2013. Efficient Object Detection on GPUs Using MB-LBP Features and Random Forests. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2013/presentations/S3297-Efficient-Object-Detection-GPU-MB-LBP-Forest.pdf.Google ScholarGoogle Scholar
  12. Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. ACM SIGCOMM Computer Communication Review 40, 4, 195--206. DOI:http://dx.doi.org/10.1145/1851275.1851207 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A new programming interface for scalable network I/O. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sean Hefty. 2012. Rsockets. Available at https://www.openfabrics.org/index.php/resources/document-downlo ads/public-documents/doc_download/495-rsockets.html.Google ScholarGoogle Scholar
  15. InfiniBand Trade Association. 2007. InfiniBand Architecture Specification, Volume 1—General Specification, Release 1.2.1. InfiniBand Trade Association.Google ScholarGoogle Scholar
  16. Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL acceleration with commodity processors. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’11). http://portal.acm.org/citation.cfm?id=1972457.1972459 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Feng Ji, Heshan Lin, and Xiaosong Ma. 2013. RSVM: A region-based software virtual memory for GPU. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems (ICCPS’13). ACM, New York, NY, 170--178. DOI:http://dx.doi.org/10.1145/2502524.2502548 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the USENIX Annual Technical Conference. http://portal.acm.org/citation.cfm?id=2002181.2002183 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Khronos Group. 2016. OpenCL: The Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved August 21, 2016, from http://www.khronos.org/opencl.Google ScholarGoogle Scholar
  21. David B. Kirk and W. Hwu Wen-mei. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Maxwell Krohn, Eddie Kohler, and M. Frans Kaashoek. 2007. Events can make sense. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1364385.1364392 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. 2014. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K. Panda. 2004. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3, 167--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NVIDIA. 2015. Developing a Linux Kernel Module Using GPUDirect RDMA. Retrieved August 21, 2016, from http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.Google ScholarGoogle Scholar
  26. NVIDIA. 2016. GPU Applications. Retrieved August 21, 2016, from http://www.nvidia.com/object/ gpu-applications.html.Google ScholarGoogle Scholar
  27. Ohio State University Network-Based Computing Laboratory. 2015. MVAPICH2: High Performance MPI over InfiniBand, iWARP and RoCE. http://mvapich.cse. ohio-state.edu. (2015).Google ScholarGoogle Scholar
  28. John Ousterhout et al. 2010. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. ACM Operating Systems Review 43, 4, 92--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sreeram Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, Krishna Kandalla, Hari Subramoni, and Dhabaleswar K. Panda. 2013a. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’13). ACM, New York, NY. DOI:http://dx.doi.org/10.1145/2503210.2503288 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013b. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing (ICPP’13). IEEE, Los Alamitos, CA, 80--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alexander Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, and Amin Vahdat. 2012. Themis: An I/O efficient MapReduce. In Proceedings of the ACM Symposium on Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). 233--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 49--68. DOI:http://dx.doi.org/10.1145/2517349.2522715 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Davide Rossetti, Sreeram Potluri, and David Fontaine. 2016. State of GPUdirect Technologies. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2016/presentation/s6264-davide-rossetti-GPUDirect.pdf.Google ScholarGoogle Scholar
  35. Leah Shalev, Julian Satran, Eran Borovik, and Muli Ben-Yehuda. 2010. IsoStack: Highly efficient network processing on dedicated cores. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1855840.1855845 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating file systems with GPUs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014a. GPUfs: Integrating file systems with GPUs. ACM Transactions on Computer Systems 32, 1, Article No. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2014b. GPUnet: Networking Abstractions for GPU Programs. Retrieved August 21, 2016, from https://sites.google.com/site/silbersteinmark/GPUnet.Google ScholarGoogle Scholar
  39. W. Richard Stevens. 1993. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. 2004. UNIX Network Programming. Vol. 1. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE, Los Alamitos, CA, 1068--1079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. IEEE, Los Alamitos, CA, 25--36. http://dl.acm.org/citation.cfm?id=2537857.2537861 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications. ACM, New York, NY, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Taneja Group Technology Analysts. 2012. InfiniBand Data Center March. Retrieved August 21, 2016, from https://cw.infinibandta.org/document/dl/7269.Google ScholarGoogle Scholar
  45. Animesh Trivedi, Bernard Metzler, Patrick Stuedi, and Thomas R. Gross. 2013. On limitations of network acceleration. In Proceedings of the 9th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT’13). ACM, New York, NY, 121--126. DOI:http://dx.doi.org/10.1145/2535372.2535412 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-accelerated stateful packet processing framework. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 321--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using vector interfaces to deliver millions of IOPS from a networked key-value storage server. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY. DOI:http://dx.doi.org/10.1145/2391229.2391237 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Vasily Volkov. 2010. Better Performance at Lower Occupancy. Retrieved August 21, 2016, from http://www.cs.berkeley.edu/∼volkov/volkov10-GTC.pdf.Google ScholarGoogle Scholar
  49. Rob Von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. 2003. Capriccio: Scalable threads for Internet services. ACM Operating Systems Review 37, 268--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable Internet services. ACM Operating Systems Review 35, 230--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Bob Woodruf. 2013. OFS Software for the Intel Xeon Phi. In Proceedings of the OpenFabrics Alliance International Developer Workshop.Google ScholarGoogle Scholar
  52. Lior Zeno and Mark Silberstein. 2016. The case for I/O preemption on discrete GPUs. In Proceedings of the International Workshop on GPU Computing Systems (GPGPU’16). 63--71. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPUnet: Networking Abstractions for GPU Programs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 34, Issue 3
      September 2016
      103 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/2966277
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 September 2016
      • Accepted: 1 June 2016
      • Received: 1 January 2016
      Published in tocs Volume 34, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!