skip to main content
research-article

ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

Authors Info & Claims
Published:16 March 2013Publication History
Skip Abstract Section

Abstract

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server to take advantage of the computational power provided by modern processors. However, many of the applications running in WSCs, such as websearch, are user-facing and have quality of service (QoS) requirements. When multiple applications are co-located on a multicore machine, contention for shared memory resources threatens application QoS as severe cross-core performance interference may occur. WSC operators are left with two options: either disregard QoS to maximize WSC utilization, or disallow the co-location of high-priority user-facing applications with other applications, resulting in low machine utilization and millions of dollars wasted.

This paper presents ReQoS, a static/dynamic compilation approach that enables low-priority applications to adaptively manipulate their own contentiousness to ensure the QoS of high-priority co-runners. ReQoS is composed of a profile guided compilation technique that identifies and inserts markers in contentious code regions in low-priority applications, and a lightweight runtime that monitors the QoS of high-priority applications and reactively reduces the pressure low-priority applications generate to the memory subsystem when cross-core interference is detected. In this work, we show that ReQoS can accurately diagnose contention and significantly reduce performance interference to ensure application QoS. Applying ReQoS to SPEC2006 and SmashBench workloads on real multicore machines, we are able to improve machine utilization by more than 70% in many cases, and more than 50% on average, while enforcing a 90% QoS threshold. We are also able to improve the energy efficiency of modern multicore machines by 47% on average over a policy of disallowing co-locations.

References

  1. Intel 64 and ia-32 architectures software developer's manual volume 2b: Instruction set reference, m-z.Google ScholarGoogle Scholar
  2. M. Banikazemi, D. Poff, and B. Abali. Pam: a novel performance/power aware meta-scheduler for multi-core systems. SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Barroso and U. Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 4(1):1--108, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bhadauria and S. McKee. An approach to resource-aware coscheduling for cmps. ICS 2010, Jun 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. MICRO 39, Dec 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Ebrahimi, C. Lee, O. Mutlu, and Y. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ASPLOS 2010, Mar 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. PACT 2007, Sep 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. MIRCO 2007, pages 343--355, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based qos techniques for cache/memory in cmp platforms. ICS '09: Proceedings of the 23rd international conference on Supercomputing, Jun 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, Jun 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. High Performance Embedded Architectures and Compilers, pages 201--215, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Kandemir, S. Muralidhara, S. Narayanan, Y. Zhang, and O. Ozturk. Optimizing shared cache behavior of chip multiprocessors. Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on DOI UR -, pages 505--516, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, and Y. Zhnag. Cache topology aware computation mapping for multicores. PLDI '10, Jun 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. PACT 2004, Sep 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using os observations to improve performance in multicore systems. IEEE Micro, 28(3):54--66, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. HPCA 2008, pages 367--378, 2008.Google ScholarGoogle Scholar
  17. F. Liu, X. Jiang, and Y. Solihin. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. HPCA 2010, pages 1--12, 2010.Google ScholarGoogle Scholar
  18. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubbleup: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO '11: Proceedings of The 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Increasing utilization in warehouse scale computers using bubble-up! Special Issue: IEEE Micro's Top Picks from 2011 Computer Architecture Conferences, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Mars, N. Vachharajani, R. Hundt, and M. Soffa. Contention aware execution: online contention detection and response. CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, Apr 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Meisner, B. Gold, and T. Wenisch. Powernap: eliminating server idle power. ASPLOS '09: Proceeding of the 14th international conference on Architectural support for programming languages and operating systems, Feb 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. EuroSys '10, Apr 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. MICRO 2006, pages 208 -- 222, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Qureshi and Y. Patt. Utility-based cache partitioning: A lowoverhead, high-performance, runtime mechanism to partition shared caches. MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt. Googlewide profiling: A continuous profiling infrastructure for data centers. IEEE Micro, 30:65--79, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Rus, R. Ashok, and D. Li. Automated locality optimization based on the reuse distance of string operations. CGO '11, pages 181--190, Apr 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. SC 2010, Nov 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. Micro 2008, pages 258 -- 269, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Son, M. Kandemir, M. Karakoy, and D. Chakrabarti. A compilerdirected data prefetching scheme for chip multiprocessors. PPoPP 2009, Feb 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: managing shared caches in chip multiprocessors. ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, Mar 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Tang, J. Mars, and M. L. Soffa. Compiling for niceness: mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 1--12, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, pages 283--294, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Xu, C. Wu, and P.-C. Yew. On mitigating memory bandwidth contention through bandwidth-aware scheduling. PACT 2010, Sep 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Yang, T. Liu, E. D. Berger, S. F. Kaplan, and J. E. B. Moss. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI'08, pages 73--86, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? PPoPP 2010, pages 203--212, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. ASPLOS 2010, Mar 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!