Abstract
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this variation and exploit the opportunities of mitigating variation directly to simultaneously improve both QoS and utilization. We develop, implement, and evaluate Dirigent, a lightweight performance-management runtime system that accurately controls the QoS of latency-critical applications at fine time scales, leveraging existing architecture mechanisms. We evaluate Dirigent on a real machine and show that it is significantly more effective than configurations representative of prior schemes.
- Alia Atlas and Azer Bestavros. Statistical rate monotonic scheduling. In Real-Time Systems Symposium, 1998. Proceedings., The 19th IEEE. IEEE, 1998.Google Scholar
Cross Ref
- Luiz Andres Barroso and Urs Hoelzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan Claypool, 2009.Google Scholar
Cross Ref
- Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.Google Scholar
Digital Library
- Dominik Brodowski and Nico Golde. CPU Frequency and Voltage Scaling Code in the Linux kernel.Google Scholar
- Tao Chen, Alexander Rucker, and G Edward Suh. Execution time prediction for energy-efficient hardware accelerators. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 2015.Google Scholar
Digital Library
- Yixin Chen and Li Tu. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.Google Scholar
Digital Library
- Derek Chiou, Prabhat Jain, Srinivas Devadas, and Larry Rudolph. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference. Citeseer, 2000.Google Scholar
- Jason Clemons, Haishan Zhu, Silvio Savarese, and Todd Austin. Mevbench: A mobile computer vision benchmarking suite. In Workload Characterization (IISWC), 2011 IEEE International Symposium on. IEEE, 2011.Google Scholar
Digital Library
- Henry Cook, Miquel Moreto, Sarah Bird, Khanh Dao, David A Patterson, and Krste Asanovic. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In ACM SIGARCH Computer Architecture News. ACM, 2013.Google Scholar
- Ryan R. Curtin, James R. Cline, Neil P. Slagle, William B. March, P. Ram, Nishant A. Mehta, and Alexander G. Gray. MLPACK: A scalable C++ machine learning library. Journal of Machine Learning Research, 2013.Google Scholar
- Jeffrey Dean and Luiz Andre Barroso. The tail at scale. Communications of the ACM, 2013.Google Scholar
Digital Library
- Christina Delimitrou and Christos Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. ACM SIGARCH Computer Architecture News, 2013.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices, 2014.Google Scholar
- Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. Fairness via source throttling: a configurable and high- performance fairness substrate for multi-core memory systems. In ACM Sigplan Notices. ACM, 2010.Google Scholar
Digital Library
- Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. Prefetch-aware shared resource management for multi-core systems. ACM SIGARCH Computer Architecture News, 2011.Google Scholar
- Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. In ACM SIGARCH Computer Architecture News. ACM, 2007.Google Scholar
- Johann Hauswald, Michael A Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ronald G Dres- linski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, et al. Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2015.Google Scholar
Digital Library
- John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 2006.Google Scholar
- Andrew Herdrich, Ramesh Illikkal, Ravi Iyer, Don Newell, Vineet Chadha, and Jaideep Moses. Rate-based qos techniques for cache/memory in cmp platforms. In Proceedings of the 23rd international conference on Supercomputing. ACM, 2009.Google Scholar
Digital Library
- Henry Hoffmann, Jonathan Eastep, Marco D Santambrogio, Jason E Miller, and Anant Agarwal. Application heartbeats: a generic interface for specifying program performance and goals in autonomous computing environments. In Proceedings of the 7th international conference on Autonomic computing. ACM, 2010.Google Scholar
Digital Library
- Chang-Hong Hsu, Yunqi Zhang, Michael Laurenzano, David Meisner, Thomas Wenisch, Jason Mars, Lingjia Tang, Ronald G Dreslinski, et al. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 2015.Google Scholar
Cross Ref
- Ramesh Illikkal, Vineet Chadha, Andrew Herdrich, Ravi Iyer, and Donald Newell. PIRATE: QoS and performance management in CMP architectures. ACM SIGMETRICS Performance Evaluation Review, 2010.Google Scholar
- Intel. Intel Product Information.Google Scholar
- Intel. Cache Monitoring Technology and Cache Allocation Technology.Google Scholar
- Intel. Intel 64 and IA-32 Architectures Software Developer Manuals.Google Scholar
- Ravi Iyer. Cqos: a framework for enabling qos in shared caches of cmp platforms. In Proceedings of the 18th annual international conference on Supercomputing. ACM, 2004.Google Scholar
- Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. ACM SIGMETRICS Performance Evaluation Review, 2007.Google Scholar
- Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. A qos-aware memory controller for dynamically balancing gpu and cpu bandwidth use in an mpsoc. In Proceedings of the 49th Annual Design Automation Conference. ACM, 2012.Google Scholar
Digital Library
- Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. Balancing dram locality and parallelism in shared memory cmp systems. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012.Google Scholar
Digital Library
- Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A. Kim. Measuring interference between live datacenter applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, 2012.Google Scholar
Digital Library
- Harshad Kasture, Davide B Bartolini Nathan Beckmann, and Daniel Sanchez. Rubik: Fast analytical power management for latency-critical systems. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 2015.Google Scholar
Digital Library
- Harshad Kasture and Daniel Sanchez. Ubik: efficient cache sharing with strict qos for latency-critical workloads. ACM SIGARCH Computer Architecture News, 2014.Google Scholar
- Wonyoung Kim, Meeta S Gupta, Gu-Yeon Wei, and David Brooks. System level analysis of fast, per-core dvfs using on-chip switching regulators. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on. IEEE, 2008.Google Scholar
- Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. Atlas: A scalable and high-performance scheduling algorithm for multiple memory controllers. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010.Google Scholar
- Karthik Kumar, Jibang Liu, Yung-Hsiang Lu, and Bharat Bhargava. A survey of computation offloading for mobile systems. Mobile Networks and Applications, 2013.Google Scholar
Digital Library
- Min Lee, AS Krishnakumar, Parameshwaran Krishnan, Navjot Singh, and Shalini Yajnik. Supporting soft real-time tasks in the xen hypervisor. In ACM Sigplan Notices. ACM, 2010.Google Scholar
Digital Library
- Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys '14, 2014.Google Scholar
Digital Library
- Chit-Kwan Lin and H. T. Kung. Mobile app acceleration via fine-grain offloading to the cloud. In 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14). USENIX Association, 2014.Google Scholar
Digital Library
- Fang Liu, Xiaowei Jiang, and Yan Solihin. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010.Google Scholar
- Daniel Lo, Taejoon Song, and G Edward Suh. Prediction- guided performance-energy trade-off for interactive applications. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 2015.Google Scholar
Digital Library
- David Lo, Liqun Cheng, Rama Govindaraju, Luiz Andre Barroso, and Christos Kozyrakis. Towards energy proportionality for large-scale latency-critical workloads. In Proceeding of the 41st annual international symposium on Computer architecuture. IEEE Press, 2014.Google Scholar
- David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, 2015.Google Scholar
Digital Library
- Ying Lu, Tarek Abdelzaher, Chenyang Lu, and Gang Tao. An adaptive control framework for qos guarantees and its application to differentiated caching. In Quality of Service, 2002. Tenth IEEE International Workshop on. IEEE, 2002.Google Scholar
- Jiuyue Ma, Xiufeng Sui, Ninghui Sun, Yupeng Li, Zihao Yu, Bowen Huang, Tianni Xu, Zhicheng Yao, Yun Chen, Haibin Wang, et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (pard). In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2015.Google Scholar
Digital Library
- Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2011.Google Scholar
Digital Library
- David Meisner, Brian T. Gold, and Thomas F. Wenisch. Powernap: Eliminating server idle power. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV. ACM, 2009.Google Scholar
Digital Library
- Rustam Miftakhutdinov, Eiman Ebrahimi, and Yale N Patt. Predicting performance impact of dvfs for realistic memory systems. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2012.Google Scholar
Digital Library
- Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2011.Google Scholar
Digital Library
- Onur Mutlu and Thomas Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007.Google Scholar
Digital Library
- Onur Mutlu and Thomas Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems. In ACM SIGARCH Computer Architecture News. IEEE Computer Society, 2008.Google Scholar
- Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. In Proceedings of the 5th European conference on Computer systems. ACM, 2010.Google Scholar
Digital Library
- Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. Deepdive: Transparently identifying and managing performance interference in virtualized environments. Technical Report 183449, EPFL, 2013.Google Scholar
- Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead, high-performance, run- time mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2006.Google Scholar
Digital Library
- Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P Pipe, Thomas F Wenisch, and Milo MK Martin. Computational sprinting. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on. IEEE, 2012.Google Scholar
Digital Library
- Daniel Sanchez and Christos Kozyrakis. Vantage: scalable and efficient fine-grain cache partitioning. In ACM SIGARCH Computer Architecture News. ACM, 2011.Google Scholar
- Lui Sha, Tarek Abdelzaher, Karl-Erik Arzen, Anton Cervin, Theodore Baker, Alan Burns, Giorgio Buttazzo, Marco Caccamo, John Lehoczky, and Aloysius K Mok. Real time scheduling theory: A historical perspective. Real-time systems, 2004.Google Scholar
- Techspot. Facebook to build a $1 billion wind-powered data center in Fort Worth.Google Scholar
- Hiroyuki Usui, Lavanya Subramanian, Kevin Chang, and Onur Mutlu. Squash: Simple qos-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. arXiv preprint arXiv:1505.07502, 2015.Google Scholar
- Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and TN Vijaykumar. Timetrader: Exploiting latency tail to save datacenter energy for online search. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 2015.Google Scholar
Digital Library
- Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News. ACM, 2013.Google Scholar
- Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, and Lui Sha. Memguard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2013 IEEE 19th. IEEE, 2013.Google Scholar
- Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. Cpi2: Cpu performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, New York, NY, USA, 2013. ACM.Google Scholar
Digital Library
- Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, 2014.Google Scholar
Digital Library
- Jiacheng Zhao, Huimin Cui, Jingling Xue, Xiaobing Feng, Youliang Yan, and Wensen Yang. An empirical model for predicting cross-core performance interference on multicore processors. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, 2013.Google Scholar
Digital Library
- Yanqi Zhou and David Wentzlaff. The sharing architecture: sub-core configurability for iaas clouds. In ACM SIGARCH Computer Architecture News. ACM, 2014.Google Scholar
Index Terms
Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems
Recommendations
Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsLatency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) ...
Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems
ASPLOS'16Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) ...
Rubik: fast analytical power management for latency-critical systems
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureLatency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low ...







Comments