ABSTRACT
For the past two decades, the communication channel between the NIC and CPU has largely remained the same---issuing memory requests across a slow PCIe peripheral interconnect. Today, with application service times and network fabric delays measuring hundreds of nanoseconds, the NIC--CPU interface can account for most of the overhead when programming modern warehouse-scale computers.
In this paper, we tackle this issue head-on by proposing a design for a fast path between the NIC and CPU, called Lightning NIC (L-NIC), which deviates from the established norms of offloading computation onto the NIC (inflating latency), or using centralized dispatcher cores for packet scheduling (limiting throughput). L-NIC adds support for a fast path from the network to the core of the CPU by writing and reading packets directly to/from the CPU register file. This approach minimizes network IO latency, providing significant performance improvements over traditional NIC--CPU interfaces.
Supplemental Material
References
- Amazon: Annapurna Labs. http://www.annapurnalabs.com/. Accessed on 06/28/2019.Google Scholar
- AWS Lambda. https://aws.amazon.com/lambda/. Accessed on 06/28/2019.Google Scholar
- Azure Functions. https://azure.microsoft.com/en-us/services/functions/. Accessed on 06/28/2019.Google Scholar
- Google Cloud Functions. https://cloud.google.com/functions/. Accessed on 06/28/2019.Google Scholar
- Mellanox Technologies: Introducing 200G HDR InfiniBand Solutions. https://www.mellanox.com/related-docs/whitepapers/WP_Introducing_200G_HDR_IniniBand_Solutions.pdf. Accessed on 6/28/2019.Google Scholar
- Reversi (Wikipedia). https://en.wikipedia.org/wiki/Reversi. Accessed on 06/28/2019.Google Scholar
- Serverless Use Cases. https://serverless.com/learn/use-cases/. Accessed on 06/28/2019.Google Scholar
- Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G., and Walker, D. P4: Programming Protocol-independent Packet Processors. ACM SIGCOMM CCR 44, 3 (July 2014), 87--95.Google Scholar
- Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown, N., Izzard, M., Mujica, F., and Horowitz, M. Forwarding Metamorphosis: Fast Programmable Match-action Processing in Hardware for SDN. In ACM SIGCOMM (2013).Google Scholar
- Cadar, C., Godefroid, P., Khurshid, S., Pasareanu, C. S., Sen, K., Tillmann, N., and Visser, W Symbolic Execution for Software Testing in Practice: Preliminary Assessment. In IEEE ICSE (2011).Google Scholar
- Caulfield, A. M., Chung, E. S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.-Y., Lo, D., Massengill, T., Ovtcharov, K., Papamichael, M., Woods, L., Lanka, S., Chiou, D., and Burger, D. A Cloud-scale Acceleration Architecture. In IEEE/ACM MICRO (2016).Google Scholar
- Dang, H. T., Sciascia, D., Canini, M., Pedone, F., and Soulé, R. NetPaxos: Consensus at Network Speed. In ACM SOSR (2015).Google Scholar
- Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51, 1 (Jan. 2008), 107--113.Google Scholar
Digital Library
- Firestone, D., Putnam, A., Mundkur, S., Chiou, D., Dabagh, A., Andrewartha, M., Angepat, H., Bhanu, V., Caulfield, A., Chung, E., Chandrappa, H. K., Chaturmohta, S., Humphrey, M., Lavier, J., Lam, N., Liu, F., Ovtcharov, K., Padhye, J., Popuri, G., Raindel, S., Sapre, T., Shaw, M., Silva, G., Sivakumar, M., Srivastava, N., Verma, A., Zuhair, Q., Bansal, D., Burger, D., Vaid, K., Maltz, D. A., and Greenberg, A. Azure Accelerated Networking: SmartNICs in the Public Cloud. In USENIX NSDI (2018).Google Scholar
Digital Library
- Handley, M., Raiciu, C., Agache, A., Voinescu, A., Moore, A. W., Antichi, G., and Wójcik, M. Re-architecting Datacenter Networks and Stacks for Low Latency and High Performance. In ACM SIGCOMM (2017).Google Scholar
- Jetley, P., Gioachin, F., Mendes, C., Kale, L. V., and Quinn, T. Massively parallel cosmological simulations with changa. In 2008 IEEE International Symposium on Parallel and Distributed Processing (2008), IEEE, pp. 1--12.Google Scholar
Cross Ref
- Jin, X., Li, X., Zhang, H., Soulé, R., Lee, J., Foster, N., Kim, C., and Stoica, I. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In SOSP (2017).Google Scholar
- Kaffes, K., Chong, T., Humphries, J. T., Belay, A., Mazières, D., and Kozyrakis, C. Shinjuku: Preemptive Scheduling for Microsecond-scale Tail Latency. In USENIX NSDI (2019).Google Scholar
Digital Library
- Kiefer, J. Sequential Minimax Search for a Maximum. AMS 4, 3 (1953), 502--506.Google Scholar
- Loveland, D. W. Automated Theorem Proving: A Logical Basis. Elsevier, 2016.Google Scholar
- Miao, R., Zeng, H., Kim, C., Lee, J., and Yu, M. SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs. In ACM SIGCOMM (2017).Google Scholar
- Mittal, R., Lam, V. T., Dukkipati, N., Blem, E., Wassel, H., Ghobadi, M., Vahdat, A., Wang, Y., Wetherall, D., and Zats, D. TIMELY: RTT-based Congestion Control for the Datacenter. In ACM SIGCOMM (2015).Google Scholar
- Montazeri, B., Li, Y., Alizadeh, M., and Ousterhout, J. Homa: A Receiver-driven Low-latency Transport Protocol Using Network Priorities. In ACM SIGCOMM (2018).Google Scholar
- N. Dukkipatti, e. a. PicNIC: Predictable Virtualized NIC. In ACM SIGCOMM (2019).Google Scholar
- Neugebauer, R., Antichi, G., Zazo, J. F., Audzevich, Y., López-Buedo, S., and Moore, A. W. Understanding PCIe Performance for End Host Networking. In ACM SIGCOMM (2018).Google Scholar
- Ousterhout, A., Fried, J., Behrens, J., Belay, A., and Balakrishnan, H. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In USENIX NSDI (2019).Google Scholar
Digital Library
- Ousterhout, J., Gopalan, A., Gupta, A., Kejriwal, A., Lee, C., Montazeri, B., Ongaro, D., Park, S. J., Qin, H., Rosenblum, M., Rumble, S., Stutsman, R., and Yang, S. The RAMCloud Storage System. ACM TOCS 33, 3 (Aug. 2015), 7:1--7:55.Google Scholar
Digital Library
- Qin, H., Li, Q., Speiser, J., Kraft, P., and Ousterhout, J. Arachne: Core-aware Thread Management. In USENIX OSDI (2018).Google Scholar
- Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., et al. Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM 59, 11 (2016), 56--65.Google Scholar
Digital Library
- Zhang, W. State-space Search: Algorithms, Complexity, Extensions, and Applications. Springer Science & Business Media, 1999.Google Scholar
Index Terms
The Case for a Network Fast Path to the CPU





Comments