skip to main content
research-article

A scalable, commodity data center network architecture

Published: 17 August 2008 Publication History
  • Get Citation Alerts
  • Abstract

    Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.
    In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

    References

    [1]
    Cisco Data Center Infrastructure 2.5 Design Guide. http://www.cisco.com/univercd/cc/td/doc/solution/dcidg21.pdf.
    [2]
    InfiniBand Architecture Specification Volume 1, Release 1.0. http://www.infinibandta.org/specs.
    [3]
    Juniper J-Flow. http://www.juniper.net/techpubs/software/erx/junose61/swconfig-routing-vol1/html/ip-jflow-stats-config2.html.
    [4]
    Sun Datacenter Switch 3456 Architecture White Paper. http://www.sun.com/products/networking/datacenter/ds3456/ds3456_wp.pdf.
    [5]
    M. Blumrich, D. Chen, P. Coteus, A. Gara, M. Giampapa, P. Heidelberger, S. Singh, B. Steinmacher-Burow, T. Takken, and P. Vranas. Design and Analysis of the BlueGene/L Torus Interconnection Network. IBM Research Report RC23025 (W0312--022), 3, 2003.
    [6]
    N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, and J. Seizovic. Myrinet: A Gigabit-per-second Local Area Network. Micro, IEEE, 15(1), 1995.
    [7]
    S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1--7), 1998.
    [8]
    R. Cheveresan, M. Ramsay, C. Feucht, and I. Sharapov. Characteristics of Workloads used in High Performance and Technical Computing. In International Conference on Supercomputing, 2007.
    [9]
    L. Chisvin and R. J. Duckworth. Content-Addressable and Associative Memory: Alternatives to the Ubiquitous RAM. Computer, 22(7):51--64, 1989.
    [10]
    B. Claise. Cisco Systems NetFlow Services Export Version 9. RFC 3954, Internet Engineering Task Force, 2004.
    [11]
    C. Clos. A Study of Non-blocking Switching Networks. Bell System Technical Journal, 32(2), 1953.
    [12]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. USENIX Symposium on Operating Systems Design and Implementation, 2004.
    [13]
    G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. ACM Symposium on Operating Systems Principles, 2007.
    [14]
    A. B. Downey. Evidence for Long-tailed Distributions in the Internet. ACM SIGCOMM Workshop on Internet Measurement, 2001.
    [15]
    W. Eatherton, G. Varghese, and Z. Dittia. Tree Bitmap : Hardware/Software IP Lookups with Incremental Updates. SIGCOMM Computer Communications Review, 34(2):97--122, 2004.
    [16]
    S. B. Fred, T. Bonald, A. Proutiere, G. Régnié, and J. W. Roberts. Statistical Bandwidth Sharing: A Study of Congestion at Flow Level. SIGCOMM Computer Communication Review, 2001.
    [17]
    M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.
    [18]
    S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. ACM SIGOPS Operating Systems Review, 37(5), 2003.
    [19]
    C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, Internet Engineering Task Force, 2000.
    [20]
    D. Katz, D. Ward. BFD for IPv4 and IPv6 (Single Hop) (Draft). Technical report, Internet Engineering Task Force, 2008.
    [21]
    E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. Kaashoek. The Click Modular Router. ACM Transactions on Computer Systems, 18(3), 2000.
    [22]
    C. Leiserson, Z. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. Pierre, D. Wells, et al. The Network Architecture of the Connection Machine CM-5 (Extended Abstract). ACM Symposium on Parallel Algorithms and Architectures, 1992.
    [23]
    C. E. Leiserson. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers, 34(10):892--901, 1985.
    [24]
    J. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo. NetFPGA-An Open Platform for Gigabit-rate Network Switching and Routing. In IEEE International Conference on Microelectronic Systems Education, 2007.
    [25]
    J. Moy. OSPF Version 2. RFC 2328, Internet Engineering Task Force, 1998.
    [26]
    F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX Conference on File and Storage Technologies, 2002.
    [27]
    L. R. Scott, T. Clark, and B. Bagheri. Scientific Parallel Computing. Princeton University Press, 2005.
    [28]
    SGI Developer Central Open Source Linux XFS. XFS: A High-performance Journaling Filesystem. http://oss.sgi.com/projects/xfs/.
    [29]
    V. Srinivasan and G. Varghese. Faster IP Lookups using Controlled Prefix Expansion. ACM Transactions on Computer Systems (TOCS), 17(1):1--40, 1999.
    [30]
    D. Thaler and C. Hopps. Multipath Issues in Unicast and Multicast Next-Hop Selection. RFC 2991, Internet Engineering Task Force, 2000.
    [31]
    L. Tucker and G. Robertson. Architecture and Applications of the Connection Machine. Computer, 21(8), 1988.
    [32]
    J. Vetter, S. Alam, J. Dunigan, T.H., M. Fahey, P. Roth, and P. Worley. Early Evaluation of the Cray XT3. In IEEE International Parallel and Distributed Processing Symposium, 2006.
    [33]
    M. Woodacre, D. Robb, D. Roe, and K. Feind. The SGI Altix 3000 Global Shared-Memory Architecture. SGI White Paper, 2003.

    Cited By

    View all
    • (2024)Green Cloud ComputingA Sustainable Future with E-Mobility10.4018/979-8-3693-5247-2.ch012(226-252)Online publication date: 17-May-2024
    • (2024)An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center NetworksMathematics10.3390/math1204059712:4(597)Online publication date: 17-Feb-2024
    • (2024)Orchid: enhancing HPC interconnection networks through infrequent topology reconfigurationJournal of Optical Communications and Networking10.1364/JOCN.51603116:6(644)Online publication date: 21-May-2024
    • Show More Cited By

    Index Terms

    1. A scalable, commodity data center network architecture

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGCOMM Computer Communication Review
        ACM SIGCOMM Computer Communication Review  Volume 38, Issue 4
        October 2008
        436 pages
        ISSN:0146-4833
        DOI:10.1145/1402946
        Issue’s Table of Contents
        • cover image ACM Conferences
          SIGCOMM '08: Proceedings of the ACM SIGCOMM 2008 conference on Data communication
          August 2008
          452 pages
          ISBN:9781605581750
          DOI:10.1145/1402958
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 17 August 2008
        Published in SIGCOMM-CCR Volume 38, Issue 4

        Check for updates

        Author Tags

        1. data center topology
        2. equal-cost routing

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1,259
        • Downloads (Last 6 weeks)124

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Green Cloud ComputingA Sustainable Future with E-Mobility10.4018/979-8-3693-5247-2.ch012(226-252)Online publication date: 17-May-2024
        • (2024)An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center NetworksMathematics10.3390/math1204059712:4(597)Online publication date: 17-Feb-2024
        • (2024)Orchid: enhancing HPC interconnection networks through infrequent topology reconfigurationJournal of Optical Communications and Networking10.1364/JOCN.51603116:6(644)Online publication date: 21-May-2024
        • (2024)Accurate and fast congestion feedback in MEC-enabled RDMA datacentersJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00642-813:1Online publication date: 23-Mar-2024
        • (2024)Impossibility Results for Data-Center Routing with Congestion Control and Unsplittable FlowsProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662777(358-368)Online publication date: 17-Jun-2024
        • (2024)Computation-Power Coupled Modeling for IDCs and Collaborative Optimization in ADNsIEEE Transactions on Smart Grid10.1109/TSG.2023.332137615:3(2762-2775)Online publication date: May-2024
        • (2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
        • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
        • (2024)BurstBalancer: Do Less, Better Balance for Large-Scale Data Center TrafficIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329545435:6(932-949)Online publication date: Jun-2024
        • (2024)Virtual Machine Placement for Minimizing Image Retrieval Cost and Communication Cost in Cloud Data CenterIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335114821:2(1998-2011)Online publication date: Apr-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media