skip to main content
research-article

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Published: 17 August 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for large-scale data center network latency measurement and analysis to answer the above question affirmatively. Pingmesh has been running in Microsoft data centers for more than four years, and it collects tens of terabytes of latency data per day. Pingmesh is widely used by not only network software developers and engineers, but also application and service developers and operators.

    Supplementary Material

    WEBM File (p139-guo.webm)

    References

    [1]
    M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In Proc. SIGCOMM, 2008.
    [2]
    Alexey Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://code.facebook.com/posts/360346274145943/, Nov 2014.
    [3]
    Hadoop. http://hadoop.apache.org/.
    [4]
    Peter Bailis and Kyle Kingsbury. The Network is Reliable: An Informal Survey of Real-World Communications Failures. ACM Queue, 2014.
    [5]
    Luiz Barroso, Jeffrey Dean, and Urs H$\ddoto$lzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, March-April 2003.
    [6]
    Theophilus Benson, Aditya Akella, and David A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Internet Measurement Conference, November 2010.
    [7]
    et.al Brad Calder. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In SOSP, 2011.
    [8]
    Cisco. IP SLAs Configuration Guide, Cisco IOS Release 12.4T. http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/ipsla/configuration/12--4t/sla-12--4t-book.pdf.
    [9]
    Citrix. What is Load Balancing? http://www.citrix.com/glossary/load-balancing.html.
    [10]
    Jeffrey Dean and Luiz Andr$\acutee$ Barroso. The Tail at Scale. CACM, Februry 2013.
    [11]
    Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.
    [12]
    Albert Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, August 2009.
    [13]
    Chi-Yao Hong et al. Achieving High Utilization with Software-Driven WAN. In SIGCOMM, 2013.
    [14]
    Parveen Patel et al. Ananta: Cloud Scale Load Balancing. In ACM SIGCOMMM. ACM, 2013.
    [15]
    R. Chaiken et al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB'08, 2008.
    [16]
    Sushant Jain et al. B4: Experience with a Globally-Deployed Software Defined WAN. In SIGCOMM, 2013.
    [17]
    Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In ACM SOSP. ACM, 2003.
    [18]
    Nicolas Guilbaud and Ross Cartlidge. Google Backbone Monitoring, Localizing Packet Loss in a Large Complex Network, Feburary 2013. Nanog57.
    [19]
    Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazi$\gravee$res, and Nick McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In NSDI, 2014.
    [20]
    Michael Isard. Autopilot: Automatic Data Center Management. ACM SIGOPS Operating Systems Review, 2007.
    [21]
    Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken. The nature of data center traffic: Measurements & analysis. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, IMC '09, 2009.
    [22]
    Rishi Kapoor, Alex C. Snoeren, Geoffrey M. Voelker, and George Porter. Bullet Trains: A Study of NIC Burst Behavior at Microsecond Timescales. In ACM CoNEXT, 2013.
    [23]
    Cade Metz. Return of the Borg: How Twitter Rebuilt Google's Secret Weapon. http://www.wired.com/2013/03/google-borg-twitter-mesos/all/, March 2013.
    [24]
    Wenfei Wu, Guohui Wang, Aditya Akella, and Anees Shaikh. Virtual Network Diagnosis as a Service. In SoCC, 2013.
    [25]
    Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. Automatic Test Packet Generation. In CoNEXT, 2012.

    Cited By

    View all
    • (2024)Design model of a twisted and folded Clos network with multi-step grouped intermediate switches guaranteeing admissible blocking probabilityJournal of Optical Communications and Networking10.1364/JOCN.51389816:3(328)Online publication date: 21-Feb-2024
    • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
    • (2024)CloudSentry: Two-Stage Heavy Hitter Detection for Cloud-Scale Gateway Overload ProtectionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330185235:4(616-633)Online publication date: Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGCOMM Computer Communication Review
    ACM SIGCOMM Computer Communication Review  Volume 45, Issue 4
    SIGCOMM'15
    October 2015
    659 pages
    ISSN:0146-4833
    DOI:10.1145/2829988
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
      August 2015
      684 pages
      ISBN:9781450335423
      DOI:10.1145/2785956
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 August 2015
    Published in SIGCOMM-CCR Volume 45, Issue 4

    Check for updates

    Author Tags

    1. data center networking
    2. network troubleshooting
    3. silent packet drops

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)781
    • Downloads (Last 6 weeks)98

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Design model of a twisted and folded Clos network with multi-step grouped intermediate switches guaranteeing admissible blocking probabilityJournal of Optical Communications and Networking10.1364/JOCN.51389816:3(328)Online publication date: 21-Feb-2024
    • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
    • (2024)CloudSentry: Two-Stage Heavy Hitter Detection for Cloud-Scale Gateway Overload ProtectionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330185235:4(616-633)Online publication date: Apr-2024
    • (2024)SFANT: A SRv6-Based Flexible and Active Network Telemetry Scheme in Programming Data PlaneIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327700011:3(2415-2425)Online publication date: May-2024
    • (2023)PDLB: Path Diversity-aware Load Balancing with adaptive granularity in data center networksJournal of Cloud Computing10.1186/s13677-023-00548-x12:1Online publication date: 7-Dec-2023
    • (2023)MARS: Fault Localization in Programmable Networking Systems with Low-cost In-Band Network TelemetryProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605622(347-357)Online publication date: 7-Aug-2023
    • (2023)Tripartite Graph Aided Tensor Completion For Sparse Network MeasurementIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321325934:1(48-62)Online publication date: 1-Jan-2023
    • (2023)Fast, Scalable and Robust Centralized Routing for Data Center NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2023.325954131:6(2624-2639)Online publication date: Dec-2023
    • (2023)CocoSketch: High-Performance Sketch-Based Measurement Over Arbitrary Partial Key QueryIEEE/ACM Transactions on Networking10.1109/TNET.2023.325722631:6(2653-2668)Online publication date: Dec-2023
    • (2023)Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction ModelsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.313157120:1(256-272)Online publication date: 1-Jan-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media