Abstract
Virtual Machine based approaches to workload consolidation, as seen in IaaS cloud as well as datacenter platforms, have long had to contend with performance degradation caused by synchronization primitives inside the guest environments. These primitives can be affected by virtual CPU preemptions by the host scheduler that can introduce delays that are orders of magnitude longer than those primitives were designed for. While a significant amount of work has focused on the behavior of spinlock primitives as a source of these performance issues, spinlocks do not represent the entirety of synchronization mechanisms that are susceptible to scheduling issues when running in a virtualized environment. In this paper we address the virtualized performance issues introduced by TLB shootdown operations. Our profiling study, based on the PARSEC benchmark suite, has shown that up to 64% of a VM's CPU time can be spent on TLB shootdown operations under certain workloads. In order to address this problem, we present a paravirtual TLB shootdown scheme named Shoot4U. Shoot4U completely eliminates TLB shootdown preemptions by invalidating guest TLB entries from the VMM and allowing guest TLB shootdown operations to complete without waiting for remote virtual CPUs to be scheduled. Our performance evaluation using the PARSEC benchmark suite demonstrates that Shoot4U can reduce benchmark runtime by up to 85% compared an unmodified Linux kernel, and up to 44% over a state-of-the-art paravirtual TLB shootdown scheme.
- Linux Control Groups (cgroups). https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.Google Scholar
- Gartner Says Efficient Data Center Design Can Lead to 300 Percent Capacity Growth in 60 Percent Less Space. http://www.gartner.com/newsroom/id/1472714.Google Scholar
- ktap: A lightweight script-based dynamic tracing tool for Linux. http://www.ktap.org/.Google Scholar
- KVM Paravirt Remote Flush TLB. https://lwn.net/Articles/500188/.Google Scholar
- The PARSEC Benchmark Suite. http://parsec.cs.princeton.edu/.Google Scholar
- perf: Linux Profiling with Performance Counters. https://perf.wiki.kernel.org/.Google Scholar
- Sysbench. https://github.com/akopytov/sysbench.Google Scholar
- Vmware(r) vsere(tm): The cpu scheduler in vmware esx(r) 4.1. Technical report, VMware, Inc, 2010.Google Scholar
- le]Barroso13L. A. Barroso, J. Clidaras, and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 2013.Google Scholar
Digital Library
- X. Ding, P. B. Gibbons, M. A. Kozuch, and J. Shan. Gleaner: Mitigating the Blocked-Waiter Wakeup Problem for Virtualized Multicore Applications. In Proc. 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC), iladelia, PA, June 2014. USENIX Association. URL https://www.usenix.org/conference/atc14/technical-sessions/presentation/ding.Google Scholar
- T. Friebel. How to Deal with Lock-Holder Preemption. Presented at the Xen Summit North America, 2008.Google Scholar
- J. Kaplan, W. Forrest, and N. Kindler. Revolutionizing Data Center Energy Efficiency. Technical report, McKinsey & Company, 2008.Google Scholar
- H. Kim, S. Kim, J. Jeong, J. Lee, and S. Maeng. Demand-based Coordinated Scheduling for SMP VMs. In Proc. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.Google Scholar
Digital Library
- D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proc. of the 42nd Annual International Symposium on Computer Architecture (ISCA), ISCA '15, 2015. 10.1145/2749469.2749475. URL http://doi.acm.org/10.1145/2749469.2749475.Google Scholar
Digital Library
- J. Ousterhout. Scheduling Techniques for Concurrent Systems. In Proc. 3rd International Conference on Distributed Computing Systems, 1982.Google Scholar
- J. Ouyang and J. R. Lange. Preemptable Ticket Spinlocks: Improving Consolidated Performance in the Cloud. In Proc. 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2013.Google Scholar
Digital Library
- K. Raghavendra and J. Fitzhardinge. Paravirtualized ticket spinlocks, May 2012. URL http://lwn.net/Articles/495597/.Google Scholar
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proc. 3rd ACM Symposium on Cloud Computing (SoCC), 2012. ISBN 978--1--4503--1761-0. 10.1145/2391229.2391236. URL http://doi.acm.org/10.1145/2391229.2391236.Google Scholar
Digital Library
- R. v. Riel. Directed yield for pause loop exiting, 2011. URL http://lwn.net/Articles/424960/.Google Scholar
- O. Sukwong and H. S. Kim. Is Co-scheduling Too Expensive for SMP VMs? In Proc. 6th European Conference on Computer Systems (EuroSys), 2011.Google Scholar
Digital Library
- V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. Towards Scalable Multiprocessor Virtual Machines. In Proc. 3rd conference on Virtual Machine Research And Technology Symposium, 2004.Google Scholar
Digital Library
- C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic Adaptive Scheduling for Virtual Machines. In Proc. 20th International Symposium on High Performance Parallel and Distributed Computing (HPDC), 2011.Google Scholar
Digital Library
- L. Zhang, Y. Chen, Y. Dong, and C. Liu. Lock-Visor: An Efficient Transitory Co-scheduling for MP Guest. In Proc. 41st International Conference on Parallel Processing (ICPP), 2012.Google Scholar
Digital Library
Index Terms
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Recommendations
Don't shoot down TLB shootdowns!
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer SystemsTranslation Lookaside Buffers (TLBs) are critical for building performant virtual memory systems. Because most processors do not provide coherence for TLB mappings, TLB shootdowns provide a software mechanism that invokes inter-processor interrupts (...
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
VEE '16: Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution EnvironmentsVirtual Machine based approaches to workload consolidation, as seen in IaaS cloud as well as datacenter platforms, have long had to contend with performance degradation caused by synchronization primitives inside the guest environments. These primitives ...
DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesTranslation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (...







Comments