skip to main content
research-article
Public Access

Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline

Published:27 March 2017Publication History
Skip Abstract Section

Abstract

Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS.

References

  1. Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). ACM, New York, NY, 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web search for a planet: The google cluster architecture. IEEE Micro 23, 2 (Mar. 2003), 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 49--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 24, 1 (2005), 18--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Intel Corporation. 2008. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors. White paper, Intel Corporation. (November 2008).Google ScholarGoogle Scholar
  6. Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur Mutlu. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of the 8th ACM International Conference on Autonomic Computing (ICAC’11). ACM, New York, NY, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (Feb. 2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Qingyuan Deng, David Meisner, Abhishek Bhattacharjee, Thomas F. Wenisch, and Ricardo Bianchini. 2012a. CoScale: Coordinating CPU and memory system DVFS in server systems. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Qingyuan Deng, David Meisner, Abhishek Bhattacharjee, Thomas F. Wenisch, and Ricardo Bianchini. 2012b. MultiScale: Memory system DVFS with multiple memory controllers. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’12). ACM, New York, NY, 297--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. MemScale: Active low-power modes for main memory. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, 225--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Laurel Emurian, Arun Raghavan, Lei Shao, Jeffrey M. Rosen, Marios Papaefthymiou, Kevin Pipe, Thomas F. Wenisch, and Milo Martin. 2014. Pitfalls of accurately benchmarking thermally adaptive chips. Power (W) 5 (2014), 10.Google ScholarGoogle Scholar
  13. Stijn Eyerman and Lieven Eeckhout. 2011. Fine-grained DVFS using on-chip regulators. ACM Trans. Arch. Code Opt. 8, 1 (2011), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Waclaw Godycki, Christopher Torng, Ivan Bukreyev, Alyssa Apsel, and Christopher Batten. 2014. Enabling realistic fine-grain voltage scaling with reconfigurable power distribution networks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (MICRO-47). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau, A. Schuster, and D. Tsafrir. 2012. It’s time for low latency. In ACM SIGARCH Comput. Arch. News, Vol. 40. 411--422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Jason Mars, Lingjia Tang, and Ronald G. Dreslinski. 2015. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 271--282. Google ScholarGoogle ScholarCross RefCross Ref
  18. Canturk Isci, Alper Buyuktosunoglu, Chen-Yong Cher, Pradip Bose, and Margaret Martonosi. 2006. An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 347--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Stefanos Kaxiras and Margaret Martonosi. 2008. Computer architecture techniques for power-efficiency. Synth. Lect. Comput. Arch. 3, 1 (2008), 1--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wonyoung Kim, D. M. Brooks, and others. 2011. A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation. In Proceedings of the 2011 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 268--270.Google ScholarGoogle ScholarCross RefCross Ref
  21. Wonyoung Kim, M. S. Gupta, et al. 2008. System level analysis of fast, per-core DVFS using on-chip switching regulators. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture, 2008 (HPCA’08). 123--134.Google ScholarGoogle Scholar
  22. Tejaswini Kolpe, Antonia Zhai, and Sachin S. Sapatnekar. 2011. Enabling improved power management in multicore processors through clustered DVFS. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE), 2011. IEEE, 1--6. Google ScholarGoogle ScholarCross RefCross Ref
  23. Michael A. Laurenzano, Yunqi Zhang, Lingjia Tang, and Jason Mars. 2014. Protean code: Achieving near-free online code transformations for warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (MICRO-47). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jungseob Lee and Nam Sung Kim. 2009. Optimizing throughput of power-and thermal-constrained multicore processors using DVFS and per-core power-gating. In Design Automation Conference, 2009. DAC’09. 46th ACM/IEEE. IEEE, 47--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jacob Leverich, Matteo Monchiero, Vanish Talwar, Parthasarathy Ranganathan, and Christos Kozyrakis. 2009. Power management of datacenter workloads using per-core power gating. Comput. Arch. Lett. 8, 2 (2009), 48--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin servers with smart pipes: Designing SoC accelerators for memcached. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 36--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. 2014. Towards energy proportionality for large-scale latency-critical workloads. In Proceeding of the 41st Annual International Symposium on Computer Architecuture. IEEE Press, 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. David Lo and Christos Kozyrakis. 2014. Dynamic management of TurboMode in modern multi-core chips. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA 2014). 2014. 603--613. Google ScholarGoogle ScholarCross RefCross Ref
  29. Jason Mars and Lingjia Tang. 2013. Whare-map: Heterogeneity in “homogeneous” warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA) (ISCA’13). ACM, New York, NY, 619--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-Up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (MICRO-44). ACM, New York, NY, 248--259. Acceptance Rate: 21% - Selected for IEEE MICRO TOP PICKS Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. David Meisner, Brian T. Gold, and Thomas F. Wenisch. 2009. PowerNap: Eliminating server idle power. ACM SIGARCH Comput. Arch. News 37, 1 (2009), 205--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. 2011. Power management of online data-intensive services. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. David Meisner, Junjie Wu, and Thomas F. Wenisch. 2012. BigHouse: A simulation infrastructure for data center systems. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’12). IEEE Computer Society, Washington, DC, 35--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Timothy N. Miller, Xiang Pan, Renji Thomas, Naser Sedaghati, and Radu Teodorescu. 2012. Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling memcache at facebook. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI’13). USENIX Association, Berkeley, CA, 385--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nathaniel Pinckney, Matthew Fojtik, Bharan Giridhar, Dennis Sylvester, and David Blaauw. 2013. Shortstop: An on-chip fast supply boosting technique. In Proceedings of the 2013 Symposium on VLSI Circuits (VLSIC). IEEE, C290--C291.Google ScholarGoogle Scholar
  37. Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui Wang, and Xiaoyun Zhu. 2008. No “power” struggles: Coordinated multi-level power management for the data center. SIGARCH Comput. Arch. News 36, 1 (March 2008), 48--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lingjia Tang, Jason Mars, and Mary Lou Soffa. 2012. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO) (CGO’12). ACM, New York, NY, 1--12. Acceptance Rate: 28% - Best Paper Award! Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lingjia Tang, Jason Mars, Wei Wang, Tanima Dey, and Mary Lou Soffa. 2013. ReQoS: Reactive static/dynamic compilation for QoS in warehouse scale computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (ASPLOS’13). ACM, New York, NY, 89--100. Acceptance Rate: 23% Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. Wang, D. Anand, and others. 2009. Scaling deep trench based eDRAM on SOI to 32nm and Beyond. In Proceedings of the 2009 IEEE International Electron Devices Meeting (IEDM). 1--4. Google ScholarGoogle ScholarCross RefCross Ref
  41. Qiang Wu, Margaret Martonosi, Douglas W. Clark, Vijay Janapa Reddi, Dan Connors, Youfeng Wu, Jin Lee, and David Brooks. 2005. A dynamic compilation framework for controlling microprocessor energy and performance. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 271--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online QoS management for increased utilization in warehouse scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA) (ISCA’13). ACM, New York, NY, 607--618. Acceptance Rate: 19% Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. 2012. DeTail: Reducing the flow completion time tail in datacenter networks. ACM SIGCOMM Comput. Commun. Rev. 42, 4 (2012), 139--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Gerd Zellweger, Simon Gerber, Kornilios Kourtis, and Timothy Roscoe. 2014. Decoupling Cores, Kernels, and Operating Systems. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 17--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. SMiTe: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (MICRO-47). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!