Abstract
The large variety of compute-heavy and data-driven applications accelerate the need for a distributed I/O solution that enables cost-effective scaling of resources between networked hosts. For example, in a cluster system, different machines may have various devices available at different times, but moving workloads to remote units over the network is often costly and introduces large overheads compared to accessing local resources. To facilitate I/O disaggregation and device sharing among hosts connected using Peripheral Component Interconnect Express (PCIe) non-transparent bridges, we present SmartIO. NVMes, GPUs, network adapters, or any other standard PCIe device may be borrowed and accessed directly, as if they were local to the remote machines. We provide capabilities beyond existing disaggregation solutions by combining traditional I/O with distributed shared-memory functionality, allowing devices to become part of the same global address space as cluster applications. Software is entirely removed from the data path, and simultaneous sharing of a device among application processes running on remote hosts is enabled. Our experimental results show that I/O devices can be shared with remote hosts, achieving native PCIe performance. Thus, compared to existing device distribution mechanisms, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance.
- Keras. [n.d.]. Retrieved from https://keras.io.Google Scholar
- TensorFlow. [n.d.]. Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.Google Scholar
- Darren Abramson, Jeff Jackson, Sridhar Muthrasanallur, Gil Neiger, Greg Regnier, Rajes Sankaran, Ioannis Schoinas, Rich Uhlig, Balaji Vembu, and John Weigert. 2006. Intel virtualization technology for directed I/O. Intel Technol. J. 10, 03 (2006). DOI:https://doi.org/10.1535/itj.1003.02Google Scholar
Cross Ref
- Ahmed Abulila, Vikram Sharma Mailthody, Zaid Qureshi, Jian Huang, Nam Sung Kim, Jinjun Xiong, and Wen-mei Hwu. 2019. FlatFlash: Exploiting the byte-accessibility of SSDs within a unified memory-storage hierarchy. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 971–985. DOI:https://doi.org/10.1145/3297858.3304061Google Scholar
Digital Library
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 775–787.Google Scholar
- Knut Alnæs, Ernst H. Kristiansen, David B. Gustavson, and David V. James. 1990. Scalable coherent interface. In Proceedings of the International Conference on Computer Systems and Software Engineering (CompEuro’90). 446–453. DOI:https://doi.org/10.1109/CMPEUR.1990.113656Google Scholar
- Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. 2010. IOMMU: Strategies for mitigating the IOTLB bottleneck. In Proceedings of the International Symposium on Computer Architecture(ISCA’10). Springer, 256–274. DOI:https://doi.org/10.1007/978-3-642-24322-6_22Google Scholar
- Eric A. Anderson and Jeanna M. Neefe. 1994. An Exploration of Network RAM. Technical Report. EECS Department, University of California. Retrieved from https://www2.eecs.berkeley.edu/Pubs/TechRpts/1998/CSD-98-1000.pdf.Google Scholar
- Jens Axboe. [n.d.]. Flexible I/O Tester. Retrieved from https://github.com/axboe/fio.Google Scholar
- Stephen Bates. 2015. Project Donard. Retrieved from https://github.com/sbates130272/donard.Google Scholar
- Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. 2017. SPIN: Seamless operating system integration of peer-to-peer DMA between SSDs and GPUs. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 665–676.Google Scholar
- Maciej Bielski, Christian Pinto, Daniel Raho, and Renaud Pacalet. 2016. Survey on memory and devices disaggregation solutions for HPC systems. In Proceedings of the International Conference on Computational Science and Engineering and International Conference on Embedded and Ubiquitous Computing and International Symposium on Distributed Computing and Applications for Business Engineering (CSE-EUC-DCABES’16). 197–204. DOI:https://doi.org/10.1109/CSE-EUC-DCABES.2016.185Google Scholar
- Broadcom. 2011. PEX8733, PCI Express Gen 3 Switch, 32 Lanes, 18 Ports. Retrieved from https://docs.broadcom.com/docs/12351852.Google Scholar
- Broadcom. 2012. PEX8796, PCI Express Gen 3 Switch, 96 Lanes, 24 Ports. Retrieved from https://docs.broadcom.com/docs/12351860.Google Scholar
- I.-Hsin Chung, Bulent Abali, and Paul Crumley. 2018. Towards a composable computer system. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia’18). 137–147. DOI:https://doi.org/10.1145/3149457.3149466Google Scholar
Digital Library
- Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng, and Bryan Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the International Conference on Machine Learning (ICML’13). 1337–1345.Google Scholar
- Intel Corporation. 2015. Intel Rack Scale Design. Retrieved from https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html.Google Scholar
- Liqid Corporation. [n.d.]. Liqid Composable Infrastructure. Retrieved from https://www.liqid.com/.Google Scholar
- Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Comput. Architect. News 43, 3 (2015), 567–579. DOI:https://doi.org/10.1145/2872887.2750415Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09). 248–255. DOI:https://doi.org/10.1109/CVPR.2009.5206848Google Scholar
Cross Ref
- Dolphin Interconnect Solutions. [n.d.]. SFF-8644 MiniSAS-HD PCIe Gen3 cables. Retrieved from https://www.dolphinics.com/products/PCI_Express_SFF-8644_cables.html.Google Scholar
- Dolphin Interconnect Solutions [n.d.]. SISCI API Documentation. Dolphin Interconnect Solutions. Retrieved from http://ww.dolphinics.no/download/SISCI_DOC_V2/.Google Scholar
- José Duato, Antonio J. Pena, Frederico Silla, Rafael Mayo, and Enrique S. Quintana-Ortí. 2010. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’10). 224–231. DOI:https://doi.org/10.1109/HPCS.2010.5547126Google Scholar
- Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, and Henry M. Levy. 1995. Implementing global memory management in a workstation cluster. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’95). 201–212. DOI:https://doi.org/10.1145/224056.224072Google Scholar
- Trevor Fountain, Alexandra McCarthy, and Fangfang Peng. 2005. PCI express: An overview of PCI express, cabled PCI express and PXI express. In Proceedings of the International Conference on Accelerator 8 Large Experimental Physics Control Systems(ICALEPCS’05).Google Scholar
- Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdury, and Kang G. Shin. 2017. Efficient memory disaggregation with INFINISWAP. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI’17). 649–667.Google Scholar
- Anubhav Guleria, J. Lakshmi, and Chakri Padala. 2019. EMF: Disaggregated GPUs in datacenters for efficiency, modularity and flexibility. In Proceedings of the International Conference on Cloud Computing in Emerging Markets (CCEM’19). 1–8. DOI:https://doi.org/10.1109/CCEM48484.2019.000-5Google Scholar
Cross Ref
- Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan. 2017. NVMe-over-fabrics performance characterization and the path to low-overhead flash disaggregation. In Proceedings of the International Systems and Storage Conference (SYSTOR’17). 1–9. DOI:https://doi.org/10.1145/3078468.3078483Google Scholar
Digital Library
- Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balkrishnan. 2018. Performance characterization of NVMe-over-fabrics storage disaggregation. ACM Trans. Stor. 14, 4 (Dec. 2018), 1–18. DOI:https://doi.org/10.1145/3239563Google Scholar
- Steven Alexander Hicks, Michael Riegler, Konstantin Pogorelov, Kim V. Ånonsen, Thomas de Lange, Dag Johansen, Mattis Jeppsson, Kristin Ranheim Randel, Sigrun Eskeland, and Pål Halvorsen. 2018. Dissecting deep neural networks for better medical image classification and classification understanding. In Proceedings of the International Symposium on Computer-Based Medical Systems (CBMS’18). 363–368. DOI:https://doi.org/10.1109/CBMS.2018.00070Google Scholar
Cross Ref
- Rui Hou, Tao Jiang, Liuhang Zhang, Pengfei Qi, Jianbo Dong, Haibin Wang, Xiongli Gu, and Shujie Zhang. 2013. Cost effective data center servers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’13). 179–187. DOI:https://doi.org/10.1109/HPCA.2013.6522317Google Scholar
- Jian Huang, Xiangyong Ouyang, Jithin Jose, Md Wasi-Ur-Rahman, Hao Wang, Miao Luo, Hari Subramoni, Chet Murthy, and Dhabaleswar K. Panda. 2012. High-performance design of hbase with RDMA over InfiniBand. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’12). 774–785. DOI:https://doi.org/10.1109/IPDPS.2012.74Google Scholar
- Neo Jia and Kirti Wankhede. 2016. VFIO Mediated Devices. Retrieved from https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt.Google Scholar
- Weihang Jiang, Jiuxing Liu, Hyun-Wook Jin, Dhabaleswar K. Panda, William Gropp, and Rajeev Thakur. 2004. High performance MPI-2 one-sided communication over InfiniBand. In Proceedings of the International Symposium on Cluster Computing and the Grid (CCGrid’04). 531–538. DOI:https://doi.org/10.1109/CCGrid.2004.1336648Google Scholar
- Linux kernel development community. [n.d.]. NTB Drivers. Retrieved from https://www.kernel.org/doc/html/latest/driver-api/ntb.html.Google Scholar
- Linux kernel development community. 2013. Linux Filesystems API. Retrieved from https://www.kernel.org/doc/htmldocs/filesystems/index.html.Google Scholar
- Linux kernel development community. 2013. VFIO—“Virtual Function I/O.” Retrieved from https://www.kernel.org/doc/Documentation/vfio.txt.Google Scholar
- Linux kernel development community. 2019. Linux IOMMU Support. Retrieved from https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt.Google Scholar
- Hyeong-Jun Kim, Young-Sik Lee, and Jin-Soo Kim. 2016. NVMeDirect: A user-space I/O framework for application-specific optimization on NVMe SSDs. In Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). 41–45.Google Scholar
- KaiGai Kohei. 2016. GpuScan + SSD-to-GPUDirect DMA. Retrieved from https://kaigai.hatenablog.com/entry/2016/09/08/003556.Google Scholar
- Lars Bjørlykke Kristiansen, Jonas Markussen, Håkon Kvale Stensland, Michael Riegler, Hugo Kohmann, Friedrich Seifert, Roy Nordstrøm, Carsten Griwodz, and Pål Halvorsen. 2016. Device lending in PCI express networks. In Proceedings of the International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV’16). 10:1–10:6. DOI:https://doi.org/10.1145/2910642.2910650Google Scholar
Digital Library
- Shuang Liang, Ranjit Noronha, and Dhabaleswar K. Panda. 2005. Swapping to remote memory over Infiniband: An approach using a high performance network block device. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster’05). 1–10. DOI:https://doi.org/10.1109/CLUSTR.2005.347050Google Scholar
- Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In Proceedings of the the Annual International Symposium on Computer Architecture(ISCA’09). 267–278. DOI:https://doi.org/10.1145/1555754.1555789Google Scholar
- Seung-Ho Lim, Ki-Woong Park, and Kwang-Ho Cha. 2019. Developing an OpenSHMEM model over a switchless PCIe non-transparent bridge interface. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops (IPDPSW’19). 593–602. DOI:https://doi.org/10.1109/IPDPSW.2019.00104Google Scholar
Cross Ref
- Xiaoyi Lu, Nusrat S. Islam, Md. Wasi-Ur-Rahman, Jithin Jose, Hari Subramoni, Hao Wang, and Dhabaleswar K. Panda. 2013. High-performance design of Hadoop RPC with RDMA over InfiniBand. In Proceedings of the International Conference on Parallel Processing (ICPP’13). 641–650. DOI:https://doi.org/10.1109/ICPP.2013.78Google Scholar
Digital Library
- Evangelos P. Markatos and George Dramitinos. 1996. Implementation of a reliable remote memory pager. In Proceedings of the USENIX Annual Technical Conference (ATC’96).Google Scholar
- Athanasios Theodore Markettos, Colin Rothwell, Brett F. Gutstein, Allison Pearce, Peter G. Neumann, Simon W. Moore, and Robert N. M. Watson. 2019. Thunderclap: Exploring vulnerabilities in operating system IOMMU protection via DMA from untrustworthy peripherals. In Proceedings of the Network and Distributed System Security Symposium (NDSS’19). DOI:https://doi.org/10.14722/ndss.2019.23194Google Scholar
- Jonas Markussen, Lars Bjørlykke Kristiansen, Rune Johan Borgli, Håkon Kvale Stensland, Friedrich Seifert, Michael Riegler, Carsten Griwodz, and Pål Halvorsen. 2020. Flexible device compositions and dynamic resource sharing in PCIe interconnected clusters using Device lending. Cluster Comput. 23 (2020), 1211–1234. Issue 2. DOI:https://doi.org/10.1007/s10586-019-02988-0Google Scholar
Digital Library
- Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in PCIe clusters using device lending. In Proceedings of the International Conference on Parallel Processing Companion (ICPPComp’18). Article 48, 48:1–48:10. DOI:https://doi.org/10.1145/3229710.3229759Google Scholar
Digital Library
- Vijay Meduri. 2011. A Case for PCI Express as a High-Performance Cluster Interconnect. Retrieved from https://www.hpcwire.com/2011/01/24/a_case_for_pci_express_as_a_high-performance_cluster_interconnect/.Google Scholar
- Microsemi. 2019. Multi-Host Sharing of NVMe Drives and GPUs Using PCIe Fabrics. Technical Report. Microsemi. Retrieved from http://www.symmttm.com/document-portal/doc_download/1244483-multi-host-sharing-of-nvme-drives-and-gpus-using-pcie.Google Scholar
- Ben-Yehuda Muli, Jon Mason, Orran Krieger, Jimi Xenidis, Leendert Van Doorn, Asit Mallick, Jun Nakijima, and Elsie Wahlig. 2006. Utilizing IOMMUs for virtualization in Linux and Xen. In Proceedings of the Linux Symposium. 71–85.Google Scholar
- NVIDIA Corporation 2019. GPUDirect RDMA Documentation. NVIDIA Corporation. Retrieved from https://docs.nvidia.com/cuda/gpudirect-rdma/index.html.Google Scholar
- NVIDIA Corporation 2020. CUDA Toolkit Documentation v11.0.171. NVIDIA Corporation. Retrieved from http://docs.nvidia.com/cuda/.Google Scholar
- NVM Express 2019. NVM Express Base Specification. NVM Express. Retrieved from https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3d-2019.03.20-Ratified.pdf.Google Scholar
- NVM Express 2019. NVM Express Over Fabrics. NVM Express. Retrieved from https://nvmexpress.org/wp-content/uploads/NVMe-over-Fabrics-1.1-2019.10.22-Ratified.pdf.Google Scholar
- Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (Oct. 2010), 1345–1359. DOI:https://doi.org/10.1109/TKDE.2009.191Google Scholar
Digital Library
- Bo Peng, Haozhong Zhang, Jianguo Yao, Yaozu Dong, Yu Xu, and Haibing Guan. 2018. MDev-NVMe: A NVMe storage virtualization solution with mediated pass-through. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 665–676.Google Scholar
- Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2008. Multi-root I/O Virtualization and Sharing Specification. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://www.pcisig.com/specifications/iov/multi-root/.Google Scholar
- Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2009. Address Translation Services Revision 1.1. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://www.pcisig.com/specifications/iov/ats/.Google Scholar
- Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2010. PCI Express 3.1 Base Specification. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://pcisig.com/specifications.Google Scholar
- Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2010. Single-root I/O Virtualization and Sharing Specification. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://www.pcisig.com/specifications/iov/single-root/.Google Scholar
- Konstantin Pogorelov, Olga Ostroukhova, Mattis Jeppsson, Håvard Espeland, Carsten Griwodz, Thomas de Lange, Dag Johansen, Michael Riegler, and Pål Halvorsen. 2018. Deep learning and hand-crafted feature based approaches for polyp detection in medical videos. In Proceedings of the International Symposium on Computer-Based Medical Systems (CBMS’18). 381–386. DOI:https://doi.org/10.1109/CBMS.2018.00073Google Scholar
Cross Ref
- Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the ACM Multimedia Systems Conference (MMSys’17). 164–169. DOI:https://doi.org/10.1145/3083187.3083212Google Scholar
Digital Library
- Konstantin Pogorelov, Michael Riegler, Sigrun Eskeland, Thomas de Lange, Dag Johansen, Carsten Griwodz, Peter Thelin Schmidt, and Pål Halvorsen. 2017. Efficient disease detection in gastrointestinal videos–global features versus neural networks. Multimedia Tools Appl. 76, 21 (2017), 22493–22525. DOI:https://doi.org/10.1007/s11042-017-4989-yGoogle Scholar
Digital Library
- Konstantin Pogorelov, Michael Riegler, Jonas Markussen, Mathias Kux, Håkon Kvale Stensland, Thomas Lange, Carsten Griwodz, Pål Halvorsen, Dag Johansen, Peter Schmidt, and Sigrun Eskeland. 2016. Efficient processing of videos in a multi auditory environment using device lending of GPUs. In Proceedings of the International Conference on Multimedia Systems (MMSys’16). 381–386. DOI:https://doi.org/10.1145/2910017.2910636Google Scholar
Digital Library
- Murali Ravindran. 2008. Extending cabled PCI express to connect devices with independent PCI domains. In Proceedings of the IEEE Systems Conference (SysCon’08). 1–7. DOI:https://doi.org/10.1109/SYSTEMS.2008.4519048Google Scholar
Cross Ref
- Carlos Reaño, Federico Silla, and José Duato. 2017. Enhancing the rCUDA remote GPU virtualization framework: From a prototype to a production solution. In Proceedings of the International Symposium on Cluster, Cloud and Grid Computing (CCGRID’17). 695–698. DOI:https://doi.org/10.1109/CCGRID.2017.42Google Scholar
Digital Library
- Jack Regula. 2004. Using Non-Transparent Bridging in PCI Express Systems. Whitepaper. PLX Technology/Broadcom. Retrieved from https://www.digikey.no/no/pdf/b/broadcom/using-non-transparent-bridging-pci.Google Scholar
- Davide Rosetti. 2014. Benchmarking GPUDirect RDMA on Modern Server Platforms. Retrieved from https://developer.nvidia.com/blog/benchmarking-gpudirect-rdma-on-modern-server-platforms/.Google Scholar
- Andy Rudoff. 2017. Persistent memory programming. USENIX; login: 42, 2 (2017), 34–40. Retrieved from https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf.Google Scholar
- Kazuo Saito, Koji Anai, Keiju Igarashi, Takeshi Nishikawa, Ryoichi Himeno, and Kazuhiro Yoguchi. 1998. ATM bus system. U.S. patent No. 5,796,741 A.Google Scholar
- Nikolay Sakharnykh. 2016. Beyond GPU Memory Limits with Unified Memory on Pascal. Retrieved from https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/.Google Scholar
- Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’18). 69–87.Google Scholar
- Cheol Shim, Kwang-Ho Cha, and Min Choi. 2018. Design and implementation of initial OpenSHMEM on PCIe NTB based cloud computing. Cluster Comput. 22 (Feb. 2018), 1815–1826. DOI:https://doi.org/10.1007/s10586-018-1707-0Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.Google Scholar
- Mark J. Sullivan. 2010. Intel Xeon Processor C5500/C3500 Series Non-Transparent Bridge. Technical Report. Intel Corporation.Google Scholar
- Jun Suzuki, Yoichi Hidaka, Hunichi Higuchi, Masaki Kan, and Takashi Yoshikawa. 2016. Disaggregation and sharing of I/O devices in cloud data centers. IEEE Trans. Comput. 65 (Dec. 2016), 3013–3026. Issue 10. DOI:https://doi.org/10.1109/TC.2015.2513759Google Scholar
Digital Library
- Jun Suzuki, Yoichi Hidaka, Junichi Higuchi, Teruyuki Baba, Nobuharu Kami, and Takashi Yoshikawa. 2010. Multi-root share of single-root I/O virtualization (SR-IOV) compliant PCI express device. In Proceedings of the IEEE Symposium on High Performance Interconnects(HOTI’10). 25–31. DOI:https://doi.org/10.1109/HOTI.2010.21Google Scholar
Digital Library
- Amir Taherkordi, Feroz Zahid, Yiannis Verginadis, and Geir Horn. 2018. Future cloud system designs: Challenges and research directions. IEEE Access 6 (2018). DOI:https://doi.org/10.1109/ACCESS.2018.2883149Google Scholar
- Mellanox Technologies. [n.d.]. ConnectX-5 EN Single/Dual-Port Adapter Supporting 100Gb/s Ethernet. Retrieved from https://www.mellanox.com/products/ethernet-adapters/connectx-5-en.Google Scholar
- PLX Technologies. 2005. Multi-Host System and Intelligent I/O Design with PCI Express. Whitepaper. PLX Technology/Broadcom. Retrieved from https://docs.broadcom.com/docs-and-downloads/pdf/technical/expresslane/NTB_Brief_April-05.pdf.Google Scholar
- Adam Thompson and Chris J. Newburn. 2019. GPUDirect Storage: A Direct Path Between Storage and GPU Memory. Retrieved from https://developer.nvidia.com/blog/gpudirect-storage/.Google Scholar
- Animesh Trivedi, Bernard Metzler, and Patrick Stuedi. 2011. A case for RDMA in clouds. In Proceedings of the Second Asia-Pacific Workshop on Systems (APSys’11). 17:1–17:5. DOI:https://doi.org/10.1145/2103799.2103820Google Scholar
Digital Library
- Shin-Yeh Tsai and Yiying Zhang. 2019. A double-edged sword: Security threats and opportunities in one-sided network communication. In Proceedings of the Workshop on Hot Topics in Cloud Computing (HotCloud’19).Google Scholar
- Cheng-Chun Tu. 2014. Memory-Based Rack Area Networking. Ph.D. Dissertation. Stony Brook University.Google Scholar
- Cheng-Chun Tu and Tzi-cker Chiueh. 2018. Seamless fail-over for PCIe switched networks. In Proceedings of the International Systems and Storage Conference (SYSTOR’18). 101–111. DOI:https://doi.org/10.1145/3211890.3211895Google Scholar
Digital Library
- Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2013. Secure I/O device sharing among virtual machines on multiple hosts. ACM SIGARCH Comput. Architect. News 41, 3 (2013), 108–119. DOI:https://doi.org/10.1145/2508148.2485932Google Scholar
Digital Library
- Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: A memory-based rack area network. In Proceedings of the ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS’14). 125–136. DOI:https://doi.org/10.1145/2658260.2658262Google Scholar
Digital Library
- Akshay Venkatesh, Khaled Hamidouche, Sreeram Potluri, Davide Rosettig, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2017. MPI-GDS: High performance MPI designs with GPUDirect-aSync for CPU-GPU control flow decoupling. In Proceedings of the International Conference on Parallel Processing (ICPP’17). 151–160. DOI:https://doi.org/10.1109/ICPP.2017.24Google Scholar
- Akshay Venkatesh, Hari Subramoni, Khaled Hamidouche, and Dhabaleswar K. Panda. 2014. A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters. In Proceedings of the International Conference on High Performance Computing (HiPC’14). 1–10. DOI:https://doi.org/10.1109/HiPC.2014.7116875Google Scholar
- Heymian Wong. 2011. PCI Express Multi-Root Switch Reconfiguration During System Operation. Master’s thesis. Massachusetts Institute of Technology.Google Scholar
- Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’20). 169–182.Google Scholar
- Ziye Yang, James R. Harris, Benjamin Walker, Daniel Verkamp, Changpeng Liu, Cunyin Chang, Gang Cao, Jonathan Stern, Vishal Verma, and Luse E. Paul. 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the International Conference on Cloud Computing Technology and Science (CloudCom’17). 154–161. DOI:https://doi.org/10.1109/CloudCom.2017.14Google Scholar
- Xiangliang Yu. 2016. NTB: Add support for AMD PCI-Express Non-Transparent Bridge. Retrieved from https://lwn.net/Articles/672752/.Google Scholar
Index Terms
SmartIO: Zero-overhead Device Sharing through PCIe Networking
Recommendations
Flexible device compositions and dynamic resource sharing in PCIe interconnected clusters using Device Lending
AbstractModern workloads often exceed the processing and I/O capabilities provided by resource virtualization, requiring direct access to the physical hardware in order to reduce latency and computing overhead. For computers interconnected in a cluser, ...
Flexible Device Sharing in PCIe Clusters using Device Lending
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel ProcessingProcessing workloads may have very high IO demands, exceeding the capabilities provided by resource virtualization and requiring direct access to the physical hardware. For computers that are interconnected in PCI Express (PCIe) networks, we have ...
Optimized I/O determinism for emerging NVM-based NVMe SSD in an enterprise system
DAC '18: Proceedings of the 55th Annual Design Automation ConferenceNon-volatile memory express (NVMe) over peripheral component interconnect express (PCIe) has been adopted in the storage system to provide low latency and high throughput. NVMe allows a host system to reduce latency because it offers a high parallel ...






Comments