10.1145/3419111.3421298acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Peafowl: in-application CPU scheduling to reduce power consumption of in-memory key-value stores

Online:12 October 2020Publication History

ABSTRACT

The traffic load sent to key-value (KV) stores varies over long timescales of hours to short timescales of a few microseconds. Long-term variations present the opportunity to save power during low or medium periods of utilization. Several techniques exist to save power in servers, including feedback-based controllers that right-size the number of allocated CPU cores, dynamic voltage and frequency scaling (DVFS), and c-state (idle-state) mechanisms. In this paper, we demonstrate that existing power saving techniques are not effective for KV stores. This is because the high rate of traffic even under low load prevents the system from entering low power states for extended periods of time. To achieve power savings, we must unbalance the load among the CPU cores so that some of them can enter low power states during periods of low load. We accomplish this by introducing the notion of in-application CPU scheduling. Instead of relying on the kernel to schedule threads, we pin threads to bypass the kernel CPU scheduler and then perform the scheduling within the KV store application. Our design, Peafowl, is a KV store that features an in-application CPU scheduler that monitors the load to learn the workload characteristics and then scales the number of active CPU cores when the load drops, leading to notable power savings during low or medium periods of utilization. Our experiments demonstrate that Peafowl uses up to 40--54% lower power than state of the art approaches such as Rubik and μDPM.

Supplemental Material

p150-asyabi-presentation.mov

References

  1. Dan Ardelean, Amer Diwan, and Chandra Erdman. 2018. Performance Analysis of Cloud Applications. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (NSDI'18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Asyabi, S. SanaeeKohroudi, M. Sharifi, and A. Bestavros. 2018. TerrierTail: Mitigating Tail Latency of Cloud Virtual Machines. IEEE Transactions on Parallel and Distributed Systems 29, 10 (2018), 2346--2359.Google ScholarGoogle ScholarCross RefCross Ref
  3. Esmail Asyabi, Erfan Sharafzadeh, SeyedAlireza SanaeeKohroudi, and Mohsen Sharifi. 2019. CTS: An operating system CPU scheduler to mitigate tail latency for latency-sensitive multi-threaded applications. J. Parallel and Distrib. Comput. 133 (2019), 232 -- 243.Google ScholarGoogle ScholarCross RefCross Ref
  4. Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload Analysis of a Large-scale Key-value Store. Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 12) (2012).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tom Barbette, Georgios P Katsikas, Gerald Q Maguire, Jr., and Dejan Kostić. 2019. RSS++: Load and State-aware Receive Side Scaling. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies (CoNEXT '19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 2017. Attack of the killer microseconds. Commun. ACM 60, 4 (2017), 48--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14).Google ScholarGoogle Scholar
  8. C. Chou, L. N. Bhuyan, and D. Wong. 2019. μDPM: Dynamic Power Management for the Microsecond Era. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  9. Chih-Hsun Chou, Daniel Wong, and Laxmi N. Bhuyan. 2016. DynSleep: Fine-grained Power Management for a Latency-Critical Data Center Application. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design (ISLPED '16).Google ScholarGoogle Scholar
  10. Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56, 2 (Feb. 2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Christina Delimitrou and Christos Kozyrakis. 2018. Amdahl's Law for Tail Latency. Commun. ACM 61, 8 (2018), 65--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Qingyuan Deng, David Meisner, Abhishek Bhattacharjee, Thomas F. Wenisch, and Ricardo Bianchini. 2012. CoScale: Coordinating CPU and Memory System DVFS in Server Systems. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45).Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Diego Didona and Willy Zwaenepoel. 2019. Size-aware Sharding For Improving Tail Latencies in In-memory Key-value Stores. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).Google ScholarGoogle Scholar
  14. L. Duan, D. Zhan, and J. Hohnerlein. 2015. Optimizing Cloud Data Center Energy Efficiency via Dynamic Prediction of CPU Idle Intervals. In IEEE 8th International Conference on Cloud Computing.Google ScholarGoogle Scholar
  15. Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A. Kozuch. 2012. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers. ACM Trans. Comput. Syst. 30, 4 (2012), 14:1--14:26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jana Giceva, Gerd Zellweger, Gustavo Alonso, and Timothy Roscoe. 2016. Customized OS support for data-processing. In DaMoN '16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Vishal Gupta, Paul Brett, David Koufaty, Dheeraj Reddy, Scott Hahn, Karsten Schwan, and Ganapati Srinivasa. 2012. The Forgotten `Uncore': On the Energy-Efficiency of Heterogeneous Cores. In USENIX Annual Technical Conference (USENIX ATC 12).Google ScholarGoogle Scholar
  18. U. U. Hafeez, M. Wajahat, and A. Gandhi. 2018. ElMem: Towards an Elastic Memcached System. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).Google ScholarGoogle Scholar
  19. Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17).Google ScholarGoogle Scholar
  20. Mor Harchol-Balter. 2013. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press.Google ScholarGoogle Scholar
  21. C. Hsu, Y. Zhang, M. A. Laurenzano, D. Meisner, T. Wenisch, J. Mars, L. Tang, and R. G. Dreslinski. 2015. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  22. Thomas Ilsche, Marcus Hähnel, Robert Schöne, Mario Bielert, and Daniel Hackenberg. 2018. Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-core Systems. In Euro-Par 2017: Parallel Processing Workshops, Dora B. Heras, Luc Bougé, Gabriele Mencagli, Emmanuel Jeannot, Rizos Sakellariou, Rosa M. Badia, Jorge G. Barbosa, Laura Ricci, Stephen L. Scott, Stefan Lankes, and Josef Weidendorfer (Eds.).Google ScholarGoogle Scholar
  23. Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance Isolation for Commercial Latency-Sensitive Services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18).Google ScholarGoogle Scholar
  24. Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. 2019. Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).Google ScholarGoogle Scholar
  25. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design Guidelines for High Performance RDMA Systems. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '16).Google ScholarGoogle Scholar
  26. S. Kanev, K. Hazelwood, G. Wei, and D. Brooks. 2014. Tradeoffs between power management and tail latency in warehouse-scale applications. In 2014 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarGoogle Scholar
  27. H. Kasture, D. B. Bartolini, N. Beckmann, and D. Sanchez. 2015. Rubik: Fast analytical power management for latency-critical systems. In 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 15).Google ScholarGoogle Scholar
  28. David H. K. Kim, Connor Imes, and Henry Hoffmann. 2015. Racing and Pacing to Idle: Theoretical and Empirical Analysis of Energy Optimization Heuristics. In Proceedings of the 2015 IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications (CPSNA '15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mustafa Korkmaz, Martin Karsten, Kenneth Salem, and Semih Salihoglu. 2018. Workload-Aware CPU Performance Scaling for Transactional Database Systems. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jacob Leverich. 2014. Mutilate: High-Performance Memcached Load Generator.Google ScholarGoogle Scholar
  31. Jacob Leverich and Christos Kozyrakis. 2014. Reconciling High Server Utilization and Sub-millisecond Quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys '14).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, and Lintao Zhang. 2017. KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14).Google ScholarGoogle Scholar
  34. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14).Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yanpei Liu, Stark C. Draper, and Nam Sung Kim. 2014. SleepScale: Runtime Joint Speed Scaling and Sleep States Management for Power Efficient Data Centers. SIGARCH Comput. Archit. News 42, 3 (June 2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards Energy Proportionality for Large-scale Latency-critical Workloads. ACM/IEEE 41st International Symposium on Computer Architecture (ISCA 2014) (2014).Google ScholarGoogle Scholar
  37. Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, and et al. 2019. Snap: A Microkernel Approach to Host Networking (SOSP '19).Google ScholarGoogle Scholar
  38. David Meisner, Brian T. Gold, and Thomas F. Wenisch. 2009. PowerNap: Eliminating Server Idle Power. Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2009).Google ScholarGoogle Scholar
  39. David Meisner and Thomas F. Wenisch. 2012. DreamWeaver: Architectural Support for Deep Sleep. Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 12) (2012).Google ScholarGoogle Scholar
  40. Minh Nguyen, Zhongwei Li, Feng Duan, Hao Che, and Hong Jiang. 2016. The Tail at Scale: How to Predict It?. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16).Google ScholarGoogle Scholar
  41. Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimont. 2018. Welcome to Zombieland: Practical and Energy-efficient Memory Disaggregation in a Datacenter. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Chandandeep Singh Pabla. 2009. Completely Fair Scheduler. Linux Journal 2009, 184 (2009).Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. Peck, C. Olsen, and J.L. Devore. 2011. Introduction to Statistics and Data Analysis. Cengage Learning.Google ScholarGoogle Scholar
  45. George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP 17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. George Prekas, Mia Primorac, Adam Belay, Christos Kozyrakis, and Edouard Bugnion. 2015. Energy Proportionality and Workload Consolidation for Latency-critical Applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC '15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: Core-Aware Thread Management. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. 2011. TritonSort: A Balanced Large-scale Sorting System. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI'11).Google ScholarGoogle Scholar
  49. Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer. 2011. Improving Per-node Efficiency in the Datacenter with New OS Abstractions. In Proceedings of the 2Nd ACM Symposium on Cloud Computing (SOCC '11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Clayne B Robison. 2017. Intel Ethernet Flow Director. https://software.intel.com/content/www/us/en/develop/articles/setting-up-intel-ethernet-flow-director.html. Accessed: 2020-5-22.Google ScholarGoogle Scholar
  51. Alexander Rucker, Muhammad Shahbaz, Tushar Swamy, and Kunle Olukotun. 2019. Elastic RSS: Co-Scheduling Packets and Cores Using Programmable NICs. In Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019 (APNet '19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Marta Rybczyńska. 2019. Improving idle behavior in tickless systems. https://lwn.net/Articles/775618/. Accessed: 2019-11-18.Google ScholarGoogle Scholar
  53. Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A Distributed Graph Engine on a Memory Cloud. In ACM SIGMOD International Conference on Management of Data (SIGMOD '13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Erfan Sharafzadeh, Seyed Alireza Sanaee Kohroudi, Esmail Asyabi, and Mohsen Sharifi. 2019. Yawn: A CPU Idle-State Governor for Datacenter Applications. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys '19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. V. Spiliopoulos, A. Sembrant, and S. Kaxiras. 2012. Power-Sleuth: A Tool for Investigating Your Program's Power Behavior. In 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.Google ScholarGoogle Scholar
  56. Akshitha Sriraman and Thomas F Wenisch. 2018. μTune: Auto-Tuned Threading for OLDI Microservices. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. Sriraman and T. F. Wenisch. 2018. Suite: A Benchmark Suite for Microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarGoogle Scholar
  58. Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. 2015. C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15).Google ScholarGoogle Scholar
  59. Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, and T. N. Vijaykumar. 2015. TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for Online Search. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48).Google ScholarGoogle Scholar
  60. Arjan van de Ven. [n.d.]. cpuidle: A new variant of the menu governor to boost IO performance. https://lwn.net/Articles/352180/. Accessed: 2018-5-25.Google ScholarGoogle Scholar
  61. BP Welford. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4, 3 (1962), 419--420.Google ScholarGoogle ScholarCross RefCross Ref
  62. Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen, Ming Wu, Wei Li, and Lidong Zhou. 2017. Tux2: Distributed Graph Computation for Machine Learning. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17).Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, and Dongyan Xu. 2013. vTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13).Google ScholarGoogle Scholar
  64. X. Zhan, R. Azimi, S. Kanev, D. Brooks, and S. Reda. 2017. CARB: A C-State Power Management Arbiter for Latency-Critical Workloads. IEEE Computer Architecture Letters 16, 1 (2017), 6--9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Y. Zhang, D. Meisner, J. Mars, and L. Tang. 2016. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  66. Timothy Zhu, Anshul Gandhi, Mor Harchol-Balter, and Michael A Kozuch. 2012. Saving Cash by Using Less Cache. In 4th Usenix workshop on hot topics in cloud computing (HotCloud '12).Google ScholarGoogle Scholar

Index Terms

  1. Peafowl: in-application CPU scheduling to reduce power consumption of in-memory key-value stores

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    ACM Conferences cover image
    SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
    October 2020
    535 pages
    ISBN:9781450381376
    DOI:10.1145/3419111

    Copyright © 2020 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Online: 12 October 2020
    • Published: 12 October 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Qualifiers

    • research-article

    Acceptance Rates

    SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%
    Overall Acceptance Rate 247 of 1,182 submissions, 21%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!