skip to main content
research-article

Improving the Performance of Heterogeneous Data Centers through Redundancy

Authors Info & Claims
Published:15 June 2021Publication History
Skip Abstract Section

Abstract

We analyze the performance of redundancy in a multi-type job and multi-type server system. We assume the job dispatcher is unaware of the servers' capacities, and we set out to study under which circumstances redundancy improves the performance. With redundancy an arriving job dispatches redundant copies to all its compatible servers, and departs as soon as one of its copies completes service. As a benchmark comparison, we take the non-redundant system in which a job arrival is routed to only one randomly selected compatible server. Service times are generally distributed and all copies of a job are identical, i.e., have the same service requirement. In our first main result, we characterize the sufficient and necessary stability conditions of the redundancy system. This condition coincides with that of a system where each job type only dispatches copies into its least-loaded servers, and those copies need to be fully served. In our second result, we compare the stability regions of the system under redundancy to that of no redundancy. We show that if the server's capacities are sufficiently heterogeneous, the stability region under redundancy can be much larger than that without redundancy. We apply the general solution to particular classes of systems, including redundancy-d and nested models, to derive simple conditions on the degree of heterogeneity required for redundancy to improve the stability. As such, our result is the first in showing that redundancy can improve the stability and hence performance of a system when copies are non-i.i.d..

References

  1. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones.. In NSDI, Vol. 13. 185--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri.. In OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation . 265--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Elene Anton, Urtzi Ayesta, Matthieu Jonckheere, and Ina Maria Verloop. 2020. On the stability of redundancy models. To appear in Operations Research .Google ScholarGoogle Scholar
  4. Soeren Asmussen. 2002. Applied Probability and Queues .Springer.Google ScholarGoogle Scholar
  5. Thomas Bonald and Céline Comte. 2017. Balanced fair resource sharing in computer clusters. Performance Evaluation , Vol. 116 (2017), 70--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Maury Bramson. 2008. Stability of Queueing Networks .Springer.Google ScholarGoogle Scholar
  7. James Cruise, Matthieu Jonckheere, and Seva Shneer. 2020. Stability of JSQ in queues with general server-job class compatibilities. Queueing Systems 95 (2020), 271--279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jim G. Dai. 1996. A fluid limit model criterion for instability of multiclass queueing networks. The Annals of Applied Probability , Vol. 6 (1996), 751--757.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM , Vol. 56, 2 (2013), 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Regina Egorova. 2009. Sojourn time tails in processor-sharing systems, Technische Universiteit Eindhoven . Ph.D. Dissertation.Google ScholarGoogle Scholar
  11. Sergey Foss, Dmitry Korshunov, and Stan Zachary. 2013. An introduction to heavy-tailed and subexponential distributions 2nd ed.). Springer.Google ScholarGoogle Scholar
  12. Kristen Gardner, Mor Harchol-Balter, Alan Scheller-Wolf, and Benny van Houdt. 2017. A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size. IEEE/ACM Transactions on Networking , Vol. 25, 6 (2017), 3353--3367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kristen Gardner, Esa Hyyti"a, and Rhonda Righter. 2019. A Little Redundancy Goes a Long Way: Convexity in Redundancy Systems. Performance Evaluation (2019) (2019).Google ScholarGoogle Scholar
  14. Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, Esa Hyyti"a, and Alan Scheller-Wolf. 2016. Queueing with redundant requests: exact analysis. Queueing Systems , Vol. 83, 3--4 (2016), 227--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Christian Gromoll, Philippe Robert, and Bert Zwart. 2008. Fluid Limits for Processor Sharing Queues with Impatience. Math. Oper. Res. , Vol. 33 (05 2008), 375--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action .Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tim Hellemans, Tejas Bodas, and Benny van Houdt. 2019. Performance Analysis of Workload Dependent Load Balancing Policies. POMACS , Vol. 3, 2 (2019), 35:1--35:35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tim Hellemans and Benny van Houdt. 2018. Analysis of redundancy(d) with identical Replicas. Performance Evaluation Review , Vol. 46, 3 (2018), 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2015. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Performance Evaluation Review , Vol. 43, 2 (2015), 54--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ger Koole and Rhonda Righter. 2007. Resource allocation in grid computing. Journal of Scheduling (2007).Google ScholarGoogle Scholar
  21. Rhonda Righter Kristen Gardner, Esa Hyyti"a. 2018. A little redundancy goes a long way: convexity in redundancy systems. Preprint submitted to Elsevier (2018).Google ScholarGoogle Scholar
  22. Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2017a. On scheduling redundant requests with cancellation overheads. IEEE/ACM Transactions on Networking (TON) , Vol. 25, 2 (2017), 1279--1290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kangwook Lee, Nihar B. Shah, Longbo Huang, and Kannan Ramchandran. 2017b. The mds queue: Analysing the latency performance of erasure codes. IEEE Transactions on Information Theory , Vol. 63, 5 (2017), 2822--2842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nam H. Lee. 2008. A sufficient condition for stochastic stability of an Internet congestion control model in terms of fluid model stability, UC San Diego . Ph.D. Dissertation.Google ScholarGoogle Scholar
  25. Sean Meyn and Richard Tweedie. 1993. Generalized resolvents and Harris recurrence of Markov processes. Contemp. Math. , Vol. 149 (1993), 227--250.Google ScholarGoogle ScholarCross RefCross Ref
  26. Fernando Paganini, Ao Tang, Andrés Ferragut, and Lachlan Andrew. 2012. Network Stability under Alpha Fair Bandwidth Allocation with General File Size Distribution. IEEE Transactions. on Automatic Control , Vol. 57, 3 (2012), 579--591.Google ScholarGoogle ScholarCross RefCross Ref
  27. Youri Raaijmakers, Sem Borst, and Onno Boxma. 2019. Redundancy scheduling with scaled Bernoulli service requirements. Queueing Systems , Vol. Volume 93 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  28. Youri Raaijmakers, Sem Borst, and Onno Boxma. 2020. Stability of Redundancy Systems with Processor Sharing. In Proceedings of the 13th EAI International Conference on Performance Evaluation Methodologies and Tools (VALUETOOLS '20). Association for Computing Machinery, New York, NY, USA, 120--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2016. When do redundant requests reduce latency? IEEE Transactions on Communications , Vol. 64, 2 (2016), 715--722.Google ScholarGoogle ScholarCross RefCross Ref
  30. Ashish Vulimiri, Philip Brighten Godfrey, Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. 2013. Low latency via redundancy. In Proceedings of the ACM conference on Emerging networking experiments and technologies. ACM, 283--294. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving the Performance of Heterogeneous Data Centers through Redundancy

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!