skip to main content
research-article

Congestion Control for Large-Scale RDMA Deployments

Published: 17 August 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 μs per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-routed datacenter networks, RDMA is deployed using RoCEv2 protocol, which relies on Priority-based Flow Control (PFC) to enable a drop-free network. However, PFC can lead to poor application performance due to problems like head-of-line blocking and unfairness. To alleviates these problems, we introduce DCQCN, an end-to-end congestion control scheme for RoCEv2. To optimize DCQCN performance, we build a fluid model, and provide guidelines for tuning switch buffer thresholds, and other protocol parameters. Using a 3-tier Clos network testbed, we show that DCQCN dramatically improves throughput and fairness of RoCEv2 RDMA traffic. DCQCN is implemented in Mellanox NICs, and is being deployed in Microsoft's datacenters.

    Supplementary Material

    WEBM File (p523-zhu.webm)

    References

    [1]
    M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M. Seaman. Data center transport mechanisms: Congestion control theory and IEEE standardization. In Allerton, 2008.
    [2]
    M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010.
    [3]
    M. Alizadeh, A. Javanmard, and B. Prabhakar. Analysis of DCTCP: Stability, convergence and fairness. In SIGMETRICS, 2011.
    [4]
    M. Alizadeh, A. Kabbani, B. Atikoglu, and B. Prabhakar. Stability analysis of QCN: the averaging principle. In SIGMETRICS, 2011.
    [5]
    M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and M. Yasuda. Less is more: Trading a little bandwidth for ultra-low latency in the data center. In NSDI, 2012.
    [6]
    M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. pFabric: Minimal near-optimal datacenter transport. In SIGCOMM, 2013.
    [7]
    A. K. Choudhury and E. L. Hahne. Dynamic queue length thresholds for shared-memory packet switches. IEEE/ACM Transactions on Networking, 6(2), 1998.
    [8]
    Cisco. Priority flow control: Build reliable layer 2 infrastructure. http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-542809_ns783_Networking_Solutions_White_Paper.html.
    [9]
    P. Devkota and A. L. N. Reddy. Performance of quantized congestion notification in TCP incast scenarios of data centers. In MASCOTS, 2012.
    [10]
    A. Dragojevic, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. In NSDI, 2014.
    [11]
    J. Duetto, I. Johnson, J. Flich, F. Naven, P. Garcia, and T. Nachiondo. A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks. In HPCA, 2005.
    [12]
    N. Dukkipati. Rate control protocol (RCP): Congestion control to make flows complete quickly. In PhD diss., Stanford University, 2007.
    [13]
    S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1:397--413, 1993.
    [14]
    E. G. Gran, M. Eimot, S.-A. Reinemo, T. Skeie, O. Lysne, L. P. Huse, and G. Shainer. First experiences with congestion control in infiniband hardware. In IPDPS, 2010.
    [15]
    A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.
    [16]
    C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure coding in Windows Azure storage. In USENIX ATC, 2012.
    [17]
    IEEE. 802.11Qau. Congestion notification, 2010.
    [18]
    IEEE. 802.11Qbb. Priority based flow control, 2011.
    [19]
    Infiniband Trade Association. InfiniBand architecture volume 1, general specifications, release 1.2.1, 2008.
    [20]
    Infiniband Trade Association. Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A16: RDMA over converged ethernet (RoCE), 2010.
    [21]
    Infiniband Trade Association. InfiniBand architecture volume 2, physical specifications, release 1.3, 2012.
    [22]
    Infiniband Trade Association. Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (IP routable RoCE), 2014.
    [23]
    E. Jeong, S. Woo, A. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mTCP: a highly scalable user-level TCP stack for multicore systems. In NSDI, 2014.
    [24]
    S. Kamil, L. Oliker, A. Pinar, and J. Shalf. Communication requirements and interconnect optimization for high-end scientific applications. IEEE TPDS, 21:188--202, 2009.
    [25]
    S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of datacenter traffic: Measurements and analysis. In IMC, 2009.
    [26]
    A. Kangarlou et al. vSnoop: Improving TCP throughput in virtualized environments via acknowledgement offload. In SC, 2010.
    [27]
    S. Larsen, P. Sarangam, and R. Huggahalli. Architectural breakdown of end-to-end latency in a TCP/IP network. In SBAC-PAD, 2007.
    [28]
    Luigi Rizzo. netmap: a novel framework for fast packet I/O. In USENIX ATC, 2012.
    [29]
    I. Marinos, R. N. Watson, and M. Handley. Network stack specialization for performance. In SIGCOMM, 2014.
    [30]
    C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In USENIX ATC, 2013.
    [31]
    R. Mitta, E. Blem, N. Dukkipati, T. Lam, A. Vahdat, Y. Wang, H. Wassel, D. Wetherall, D. Zats, and M. Ghobadi. TIMELY: RTT-based congestion control for the datacenter. In SIGCOMM, 2015.
    [32]
    G. Mora, P. J. Garcia, J. Flich, and J. Duato. RECN-IQ: A cost-effective input-queued switch architecture with congestion management. In ICPP, 2007.
    [33]
    J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. Fastpass: A centralized, zero-queue datacenter network. In SIGCOMM, 2014.
    [34]
    K. Ramakrishnan, S. Floyd, and D. Black. The addition of explicit congestion notification (ECN). RFC 3168.
    [35]
    M. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia. A remote direct memory access protocol specification. RFC 5040.
    [36]
    A. Sivaraman, K. Winstein, P. Thaker, and H. Balakrishnan. An experimental study of the learnability of congestion control. In SIGCOMM, 2014.
    [37]
    B. Stephens, A. Cox, A. Singla, J. Carter, C. Dixon, and W. Felter. Practical DCB for improved data center networks. In INFOCOMM, 2014.
    [38]
    H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. Panda. Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes. In SC, 2012.
    [39]
    V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, and B. Mueller. Safe and effective fine-grained TCP retransmissions for datacenter communication. In SIGCOMM, 2009.
    [40]
    K. Winstein and H. Balakrishnan. TCP ex machina: computer-generated congestion control. In SIGCOMM, 2013.
    [41]
    X. Wu, D. Turner, G. Chen, D. Maltz, X. Yang, L. Yuan, and M. Zhang. Netpilot: Automating datacenter network failure mitigation. In SIGCOMM, 2012.
    [42]
    D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz. Detail: Reducing the flow completion time tail in datacenter networks. In SIGCOMM, 2012.
    [43]
    Chelsio Terminator 5 ASIC. http://www.chelsio.com/nic/rdma-iwarp/.
    [44]
    ConnectX-4 single/dual-port adapter supporting 100gb/s. http://www.mellanox.com/.
    [45]
    Intel data direct I/O technology. http://www.intel.com/content/www/us/en/io/data-direct-i-o-technology-brief.html.
    [46]
    Iperf - the TCP/UDP bandwidth measurement tool. http://iperf.fr.
    [47]
    Offloading the segmentation of large TCP packets. http://msdn.microsoft.com/en-us/library/windows/hardware/ff568840(v=vs.85).aspx.
    [48]
    QueryPerformanceCounter function. http://msdn.microsoft.com/en-us/library/windows/desktop/ms644904(v=vs.85).aspx.
    [49]
    Receive Side Scaling (RSS). http://technet.microsoft.com/en-us/library/hh997036.aspx.

    Cited By

    View all
    • (2024)Symmetrical Data Recovery: FPGA-Based Multi-Dimensional Elastic Recovery Acceleration for Multiple Block Failures in Ceph SystemsSymmetry10.3390/sym1606067216:6(672)Online publication date: 30-May-2024
    • (2024)Congestion Control Mechanism Based on Backpressure Feedback in Data Center NetworksFuture Internet10.3390/fi1604013116:4(131)Online publication date: 15-Apr-2024
    • (2024)Accurate and fast congestion feedback in MEC-enabled RDMA datacentersJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00642-813:1Online publication date: 23-Mar-2024
    • Show More Cited By

    Index Terms

    1. Congestion Control for Large-Scale RDMA Deployments

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGCOMM Computer Communication Review
      ACM SIGCOMM Computer Communication Review  Volume 45, Issue 4
      SIGCOMM'15
      October 2015
      659 pages
      ISSN:0146-4833
      DOI:10.1145/2829988
      Issue’s Table of Contents
      • cover image ACM Conferences
        SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
        August 2015
        684 pages
        ISBN:9781450335423
        DOI:10.1145/2785956
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 August 2015
      Published in SIGCOMM-CCR Volume 45, Issue 4

      Check for updates

      Author Tags

      1. ECN
      2. PFC
      3. RDMA
      4. congestion control
      5. datacenter transport

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,317
      • Downloads (Last 6 weeks)178

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Symmetrical Data Recovery: FPGA-Based Multi-Dimensional Elastic Recovery Acceleration for Multiple Block Failures in Ceph SystemsSymmetry10.3390/sym1606067216:6(672)Online publication date: 30-May-2024
      • (2024)Congestion Control Mechanism Based on Backpressure Feedback in Data Center NetworksFuture Internet10.3390/fi1604013116:4(131)Online publication date: 15-Apr-2024
      • (2024)Accurate and fast congestion feedback in MEC-enabled RDMA datacentersJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00642-813:1Online publication date: 23-Mar-2024
      • (2024)Configuring and Coordinating End-to-end QoS for Emerging Storage InfrastructureACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/36316069:1(1-32)Online publication date: 15-Jan-2024
      • (2024)Unison: A Parallel-Efficient and User-Transparent Network Simulation KernelProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629574(115-131)Online publication date: 22-Apr-2024
      • (2024)PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared MemoryCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653377(295-308)Online publication date: 9-Jun-2024
      • (2024)Alarm: An Adaptive Routing Algorithm Based on One-Way Delay for InfinibandIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.338229511:4(3653-3666)Online publication date: Jul-2024
      • (2024)FlowStar: Fast Convergence Per-Flow State Accurate Congestion Control for InfiniBandIEEE/ACM Transactions on Networking10.1109/TNET.2024.336365832:3(2662-2674)Online publication date: Jun-2024
      • (2024)PACC: A Proactive CNP Generation Scheme for Datacenter NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2024.336177132:3(2586-2599)Online publication date: Jun-2024
      • (2024)AStore: Uniformed Adaptive Learned Index and Cache for RDMA-enabled Key-Value StoreIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.3355100(1-18)Online publication date: 2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media