ABSTRACT
The traffic load sent to key-value (KV) stores varies over long timescales of hours to short timescales of a few microseconds. Long-term variations present the opportunity to save power during low or medium periods of utilization. Several techniques exist to save power in servers, including feedback-based controllers that right-size the number of allocated CPU cores, dynamic voltage and frequency scaling (DVFS), and c-state (idle-state) mechanisms. In this paper, we demonstrate that existing power saving techniques are not effective for KV stores. This is because the high rate of traffic even under low load prevents the system from entering low power states for extended periods of time. To achieve power savings, we must unbalance the load among the CPU cores so that some of them can enter low power states during periods of low load. We accomplish this by introducing the notion of in-application CPU scheduling. Instead of relying on the kernel to schedule threads, we pin threads to bypass the kernel CPU scheduler and then perform the scheduling within the KV store application. Our design, Peafowl, is a KV store that features an in-application CPU scheduler that monitors the load to learn the workload characteristics and then scales the number of active CPU cores when the load drops, leading to notable power savings during low or medium periods of utilization. Our experiments demonstrate that Peafowl uses up to 40--54% lower power than state of the art approaches such as Rubik and μDPM.
Supplemental Material
References
- Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (NSDI'18).Google Scholar
Digital Library
- E. Asyabi, S. SanaeeKohroudi, M. Sharifi, and A. Bestavros. 2018. TerrierTail: Mitigating Tail Latency of Cloud Virtual Machines. IEEE Transactions on Parallel and Distributed Systems 29, 10 (2018), 2346--2359.Google Scholar
Cross Ref
- Esmail Asyabi, Erfan Sharafzadeh, SeyedAlireza SanaeeKohroudi, and Mohsen Sharifi. 2019. CTS: An operating system CPU scheduler to mitigate tail latency for latency-sensitive multi-threaded applications. J. Parallel and Distrib. Comput. 133 (2019), 232 -- 243.Google Scholar
Cross Ref
- Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload Analysis of a Large-scale Key-value Store. Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 12) (2012).Google Scholar
Digital Library
- Tom Barbette, Georgios P Katsikas, Gerald Q Maguire, Jr., and Dejan Kostić. 2019. RSS++: Load and State-aware Receive Side Scaling. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies (CoNEXT '19).Google Scholar
Digital Library
- Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the killer microseconds. Commun. ACM 60, 4 (2017), 48--54.Google Scholar
Digital Library
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14).Google Scholar
- C. Chou, L. N. Bhuyan, and D. Wong. 2019. μDPM: Dynamic Power Management for the Microsecond Era. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- Chih-Hsun Chou, Daniel Wong, and Laxmi N. Bhuyan. 2016. DynSleep: Fine-grained Power Management for a Latency-Critical Data Center Application. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED '16).Google Scholar
- Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56, 2 (Feb. 2013).Google Scholar
Digital Library
- Christina Delimitrou and Christos Kozyrakis. 2018. Amdahl's Law for Tail Latency. Commun. ACM 61, 8 (2018), 65--72.Google Scholar
Digital Library
- Qingyuan Deng, David Meisner, Abhishek Bhattacharjee, Thomas F. Wenisch, and Ricardo Bianchini. 2012. CoScale: Coordinating CPU and Memory System DVFS in Server Systems. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45).Google Scholar
Digital Library
- Diego Didona and Willy Zwaenepoel. 2019. Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).Google Scholar
- L. Duan, D. Zhan, and J. Hohnerlein. 2015. Optimizing Cloud Data Center Energy Efficiency via Dynamic Prediction of CPU Idle Intervals. In IEEE 8th International Conference on Cloud Computing.Google Scholar
- Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A. Kozuch. 2012. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers. ACM Trans. Comput. Syst. 30, 4 (2012), 14:1--14:26.Google Scholar
Digital Library
- Jana Giceva, Gerd Zellweger, Gustavo Alonso, and Timothy Roscoe. 2016. Customized OS support for data-processing. In DaMoN '16.Google Scholar
Digital Library
- Vishal Gupta, Paul Brett, David Koufaty, Dheeraj Reddy, Scott Hahn, Karsten Schwan, and Ganapati Srinivasa. 2012. The Forgotten `Uncore': On the Energy-Efficiency of Heterogeneous Cores. In USENIX Annual Technical Conference (USENIX ATC 12).Google Scholar
- U. U. Hafeez, M. Wajahat, and A. Gandhi. 2018. ElMem: Towards an Elastic Memcached System. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).Google Scholar
- Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17).Google Scholar
- Mor Harchol-Balter. 2013. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press.Google Scholar
- C. Hsu, Y. Zhang, M. A. Laurenzano, D. Meisner, T. Wenisch, J. Mars, L. Tang, and R. G. Dreslinski. 2015. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- Thomas Ilsche, Marcus Hähnel, Robert Schöne, Mario Bielert, and Daniel Hackenberg. 2018. Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-core Systems. In Euro-Par 2017: Parallel Processing Workshops, Dora B. Heras, Luc Bougé, Gabriele Mencagli, Emmanuel Jeannot, Rizos Sakellariou, Rosa M. Badia, Jorge G. Barbosa, Laura Ricci, Stephen L. Scott, Stefan Lankes, and Josef Weidendorfer (Eds.).Google Scholar
- Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18).Google Scholar
- Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. 2019. Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design Guidelines for High Performance RDMA Systems. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '16).Google Scholar
- S. Kanev, K. Hazelwood, G. Wei, and D. Brooks. 2014. Tradeoffs between power management and tail latency in warehouse-scale applications. In 2014 IEEE International Symposium on Workload Characterization (IISWC).Google Scholar
- H. Kasture, D. B. Bartolini, N. Beckmann, and D. Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 15).Google Scholar
- David H. K. Kim, Connor Imes, and Henry Hoffmann. 2015. Racing and Pacing to Idle: Theoretical and Empirical Analysis of Energy Optimization Heuristics. In Proceedings of the 2015 IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications (CPSNA '15).Google Scholar
Digital Library
- Mustafa Korkmaz, Martin Karsten, Kenneth Salem, and Semih Salihoglu. 2018. Workload-Aware CPU Performance Scaling for Transactional Database Systems. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18).Google Scholar
Digital Library
- Jacob Leverich. 2014. Mutilate: High-Performance Memcached Load Generator.Google Scholar
- Jacob Leverich and Christos Kozyrakis. 2014. Reconciling High Server Utilization and Sub-millisecond Quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14).Google Scholar
Digital Library
- Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, and Lintao Zhang. 2017. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17).Google Scholar
Digital Library
- Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14).Google Scholar
- Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14).Google Scholar
Digital Library
- Yanpei Liu, Stark C. Draper, and Nam Sung Kim. 2014. SleepScale: Runtime Joint Speed Scaling and Sleep States Management for Power Efficient Data Centers. SIGARCH Comput. Archit. News 42, 3 (June 2014).Google Scholar
Digital Library
- David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards Energy Proportionality for Large-scale Latency-critical Workloads. ACM/IEEE 41st International Symposium on Computer Architecture (ISCA 2014) (2014).Google Scholar
- Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, and et al. 2019. Snap: A Microkernel Approach to Host Networking (SOSP '19).Google Scholar
- David Meisner, Brian T. Gold, and Thomas F. Wenisch. 2009. PowerNap: Eliminating Server Idle Power. Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2009).Google Scholar
- David Meisner and Thomas F. Wenisch. 2012. DreamWeaver: Architectural Support for Deep Sleep. Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 12) (2012).Google Scholar
- Minh Nguyen, Zhongwei Li, Feng Duan, Hao Che, and Hong Jiang. 2016. The Tail at Scale: How to Predict It?. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16).Google Scholar
- Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimont. 2018. Welcome to Zombieland: Practical and Energy-efficient Memory Disaggregation in a Datacenter. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18).Google Scholar
Digital Library
- Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).Google Scholar
Digital Library
- Chandandeep Singh Pabla. 2009. Completely Fair Scheduler. Linux Journal 2009, 184 (2009).Google Scholar
Digital Library
- R. Peck, C. Olsen, and J.L. Devore. 2011. Introduction to Statistics and Data Analysis. Cengage Learning.Google Scholar
- George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP 17).Google Scholar
Digital Library
- George Prekas, Mia Primorac, Adam Belay, Christos Kozyrakis, and Edouard Bugnion. 2015. Energy Proportionality and Workload Consolidation for Latency-critical Applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15).Google Scholar
Digital Library
- Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: Core-Aware Thread Management. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18).Google Scholar
Digital Library
- Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. 2011. TritonSort: A Balanced Large-scale Sorting System. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11).Google Scholar
- Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer. 2011. Improving Per-node Efficiency in the Datacenter with New OS Abstractions. In Proceedings of the 2Nd ACM Symposium on Cloud Computing (SOCC '11).Google Scholar
Digital Library
- Clayne B Robison. 2017. Intel Ethernet Flow Director. https://software.intel.com/content/www/us/en/develop/articles/setting-up-intel-ethernet-flow-director.html. Accessed: 2020-5-22.Google Scholar
- Alexander Rucker, Muhammad Shahbaz, Tushar Swamy, and Kunle Olukotun. 2019. Elastic RSS: Co-Scheduling Packets and Cores Using Programmable NICs. In Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019 (APNet '19).Google Scholar
Digital Library
- Marta Rybczyńska. 2019. Improving idle behavior in tickless systems. https://lwn.net/Articles/775618/. Accessed: 2019-11-18.Google Scholar
- Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A Distributed Graph Engine on a Memory Cloud. In ACM SIGMOD International Conference on Management of Data (SIGMOD '13).Google Scholar
Digital Library
- Erfan Sharafzadeh, Seyed Alireza Sanaee Kohroudi, Esmail Asyabi, and Mohsen Sharifi. 2019. Yawn: A CPU Idle-State Governor for Datacenter Applications. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys '19).Google Scholar
Digital Library
- V. Spiliopoulos, A. Sembrant, and S. Kaxiras. 2012. Power-Sleuth: A Tool for Investigating Your Program's Power Behavior. In 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.Google Scholar
- Akshitha Sriraman and Thomas F Wenisch. 2018. μTune: Auto-Tuned Threading for OLDI Microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18).Google Scholar
Digital Library
- A. Sriraman and T. F. Wenisch. 2018. Suite: A Benchmark Suite for Microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC).Google Scholar
- Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. 2015. C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15).Google Scholar
- Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar. 2015. TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for Online Search. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48).Google Scholar
- Arjan van de Ven. [n.d.]. cpuidle: A new variant of the menu governor to boost IO performance. https://lwn.net/Articles/352180/. Accessed: 2018-5-25.Google Scholar
- BP Welford. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4, 3 (1962), 419--420.Google Scholar
Cross Ref
- Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen, Ming Wu, Wei Li, and Lidong Zhou. 2017. Tux2: Distributed Graph Computation for Machine Learning. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17).Google Scholar
Digital Library
- Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, and Dongyan Xu. 2013. vTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13).Google Scholar
- X. Zhan, R. Azimi, S. Kanev, D. Brooks, and S. Reda. 2017. CARB: A C-State Power Management Arbiter for Latency-Critical Workloads. IEEE Computer Architecture Letters 16, 1 (2017), 6--9.Google Scholar
Digital Library
- Y. Zhang, D. Meisner, J. Mars, and L. Tang. 2016. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).Google Scholar
- Timothy Zhu, Anshul Gandhi, Mor Harchol-Balter, and Michael A Kozuch. 2012. Saving Cash by Using Less Cache. In 4th Usenix workshop on hot topics in cloud computing (HotCloud '12).Google Scholar
Index Terms
Peafowl: in-application CPU scheduling to reduce power consumption of in-memory key-value stores





Comments