Abstract
NUMA (non-uniform memory access) servers are commonly used in high-performance computing and datacenters. Within each server, a processor-interconnect (e.g., Intel QPI, AMD HyperTransport) is used to communicate between the different sockets or nodes. In this work, we explore the impact of the processor-interconnect on overall performance -- in particular, the performance un- fairness caused by processor-interconnect arbitration. It is well known that locally-fair arbitration does not guarantee globally-fair bandwidth sharing as closer nodes receive more bandwidth in a multi-hop network. However, this work demonstrates that the opposite can occur in a commodity NUMA server where remote nodes receive higher bandwidth (and perform better). We analyze this problem and iden- tify that this occurs because of external concentration used in router micro-architectures for processor-interconnects without globally-aware arbitration. While accessing remote memory can occur in any NUMA system, performance un- fairness (or performance variation) is more critical in cloud computing and virtual machines with shared resources. We demonstrate how this unfairness creates significant performance variation when a workload is executed on the Xen virtualization platform. We then provide analysis using synthetic workloads to better understand the source of unfair- ness and eliminate the impact of other shared resources, including the shared last-level cache and main memory. To provide fairness, we propose a novel, history-based arbitration that tracks the history of arbitration grants made in the previous history window. A weighted arbitration is done based on the history to provide global fairness. Through simulations, we show our proposed history-based arbitration can provide global fairness and minimize the processor- interconnect performance unfairness at low cost.
- D. Abts and D. Weisser. Age-Based Packet Arbitration in Large-Radix k-ary n-cubes. In ICS, 2007. Google Scholar
Digital Library
- J. Ahn, S. Li, O. Seongil, and N. P. Jouppi. McSimAGoogle Scholar
- : A Manycore Simulator with Application-levelGoogle Scholar
- Simulation and Detailed Microarchitecture Modeling. In ISPASS, 2013.Google Scholar
- J. Balfour and W. J. Dally. Design Tradeoffs for Tiled CMP On-Chip Networks. In ICS, 2006. Google Scholar
Digital Library
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In SOSP, 2003. Google Scholar
Digital Library
- E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS Architecture and Design Process for Network on Chip. Journal of Systems Architecture, 2004. Google Scholar
Digital Library
- P. Conway and B. Hughes. The AMD Opteron Northbridge Architecture. IEEE Micro, 2007. Google Scholar
Digital Library
- W. J. Dally and B. Towles. Route Packets, Not Wires: On-Chip Iinterconnection Networks. In DAC, 2001.Google Scholar
- W. J. Dally and B. P. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.Google Scholar
Digital Library
- R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. Application-Aware Prioritization Mechanisms for On-Chip Networks. In MICRO, 2009. Google Scholar
Digital Library
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ASPLOS, 2013. Google Scholar
Digital Library
- A. Demers, S. Keshav, and S. Shenker. Analysis and Simulation of a Fair Queueing Algorithm. In SIGCOMM, 1989. Google Scholar
Digital Library
- B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip. In MICRO, 2009.Google Scholar
Digital Library
- Intel. An Introduction to the Intel QuickPath Interconnect, 2009. URL http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf.Google Scholar
- N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim. A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator. In ISPASS, 2013. Google Scholar
Cross Ref
- R. E. Kessler and J. L. Schwarzmeier. CRAY T3D: A New Dimension for Cray Research. In COMPCON, 1993.Google Scholar
Cross Ref
- J. H. Kim and A. A. Chien. Rotating Combined Queueing (RCQ): Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks. In ISCA, 1996.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.Google Scholar
Digital Library
- P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary. Exploring concentration and channel slicing in on-chip network router. In NOCS, 2009. Google Scholar
Digital Library
- J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In ISCA, 2008. Google Scholar
Digital Library
- M. M. Lee, J. Kim, D. Abts, M. Marty, and J. W. Lee. Probabilistic Distance-based Arbitration: Providing Equality of Service for Many-core CMPs. In MICRO, 2010.Google Scholar
Digital Library
- M. Millberg, E. Nilsson, R. Thid, and A. Jantsch. Guaranteed Bandwidth using Looped Containers in Temporally Disjoint Networks within the Nostrum Network on Chip. In DATE, 2004. Google Scholar
Cross Ref
- O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.Google Scholar
Digital Library
- B. Mutnury, F. Paglia, J. Mobley, G. K. Singh, and R. Bellomio. QuickPath Interconnect (QPI) Design and Aanalysis in High Speed Servers. In EPEPS, 2010.Google Scholar
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In MICRO, 2006. Google Scholar
Digital Library
- J. Ouyang and Y. Xie. LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support. In MICRO, 2010. Google Scholar
Digital Library
- J. Rao, K. Wang, X. Zhou, and C.-Z. Xu. Optimizing Virtual Machine Scheduling in NUMA Multicore Systems. In HPCA, 2013.Google Scholar
- P. Salihundam, S. Jain, T. Jacob, S. Kumar, V. Erraguntla, Y. Hoskote, S. Vangal, G. Ruhl, and N. Borkar. A 2 Tb/s 6 x 4 Mesh Network for a Single-Chip Cloud Computer with DVFS in 45 nm CMOS. IEEE Journal of Solid-State Circuits, 2011. Google Scholar
Cross Ref
- G. Sartori. Hypertransport Technology. In Platform Conference, 2001.Google Scholar
- W. Song, H. J. Jung, J. Ahn, J. Lee, and J. Kim. Evaluation of performance unfairness in numa system architecture. IEEE Computer Architecture Letters, 2016. Google Scholar
Cross Ref
- W. Song, J. Kim. D. Abts, and J. Lee. Security Vulnerability in Processor-Interconnect Router Design. In CCS, 2014.Google Scholar
- W. Song, H. Choi, J. Kim, E. Kim, Y. Kim, and J. Kim. PIkit: A New Kernel-Independent Processor-Interconnect Rootkit. In USENIX Security, 2016.Google Scholar
- L. Tang, J. Mars, X. Zhang, R. Hagmann, R. Hundt, and E. Tune. Optimizing Google's Warehouse Scale Computers: The NUMA Experience. In HPCA, 2013.Google Scholar
- G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity effective memory access scheduling for many-core accelerator architectures. In MICRO, 2009. Google Scholar
Digital Library
- H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. Memguard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms. In RTAS, 2013.Google Scholar
- L. Zhang. Virtual Clock: A New Traffic Control Algorithm for Packet Switching Networks. In SIGCOMM, 1990. Google Scholar
Digital Library
Index Terms
History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers
Recommendations
History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers
Asplos'17NUMA (non-uniform memory access) servers are commonly used in high-performance computing and datacenters. Within each server, a processor-interconnect (e.g., Intel QPI, AMD HyperTransport) is used to communicate between the different sockets or nodes. ...
History-Based Arbitration for Fairness in Processor-Interconnect of NUMA Servers
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsNUMA (non-uniform memory access) servers are commonly used in high-performance computing and datacenters. Within each server, a processor-interconnect (e.g., Intel QPI, AMD HyperTransport) is used to communicate between the different sockets or nodes. ...
Performance Analysis of Arbitration Policies for SoC Communication Architectures
As technology scales toward deep submicron, the integration of a large number of IP blocks on the same silicon die is becoming technically feasible, thus enabling large-scale parallel computations, such as those required for multimedia workloads. The ...







Comments