Abstract
We analyze the performance of redundancy in a multi-type job and multi-type server system. We assume the job dispatcher is unaware of the servers' capacities, and we set out to study under which circumstances redundancy improves the performance. With redundancy an arriving job dispatches redundant copies to all its compatible servers, and departs as soon as one of its copies completes service. As a benchmark comparison, we take the non-redundant system in which a job arrival is routed to only one randomly selected compatible server. Service times are generally distributed and all copies of a job are identical, i.e., have the same service requirement. In our first main result, we characterize the sufficient and necessary stability conditions of the redundancy system. This condition coincides with that of a system where each job type only dispatches copies into its least-loaded servers, and those copies need to be fully served. In our second result, we compare the stability regions of the system under redundancy to that of no redundancy. We show that if the server's capacities are sufficiently heterogeneous, the stability region under redundancy can be much larger than that without redundancy. We apply the general solution to particular classes of systems, including redundancy-d and nested models, to derive simple conditions on the degree of heterogeneity required for redundancy to improve the stability. As such, our result is the first in showing that redundancy can improve the stability and hence performance of a system when copies are non-i.i.d..
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones.. In NSDI, Vol. 13. 185--198. Google Scholar
Digital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri.. In OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation . 265--278. Google Scholar
Digital Library
- Elene Anton, Urtzi Ayesta, Matthieu Jonckheere, and Ina Maria Verloop. 2020. On the stability of redundancy models. To appear in Operations Research .Google Scholar
- Soeren Asmussen. 2002. Applied Probability and Queues .Springer.Google Scholar
- Thomas Bonald and Céline Comte. 2017. Balanced fair resource sharing in computer clusters. Performance Evaluation , Vol. 116 (2017), 70--83. Google Scholar
Digital Library
- Maury Bramson. 2008. Stability of Queueing Networks .Springer.Google Scholar
- James Cruise, Matthieu Jonckheere, and Seva Shneer. 2020. Stability of JSQ in queues with general server-job class compatibilities. Queueing Systems 95 (2020), 271--279.Google Scholar
Digital Library
- Jim G. Dai. 1996. A fluid limit model criterion for instability of multiclass queueing networks. The Annals of Applied Probability , Vol. 6 (1996), 751--757.Google Scholar
Cross Ref
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM , Vol. 56, 2 (2013), 74--80. Google Scholar
Digital Library
- Regina Egorova. 2009. Sojourn time tails in processor-sharing systems, Technische Universiteit Eindhoven . Ph.D. Dissertation.Google Scholar
- Sergey Foss, Dmitry Korshunov, and Stan Zachary. 2013. An introduction to heavy-tailed and subexponential distributions 2nd ed.). Springer.Google Scholar
- Kristen Gardner, Mor Harchol-Balter, Alan Scheller-Wolf, and Benny van Houdt. 2017. A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size. IEEE/ACM Transactions on Networking , Vol. 25, 6 (2017), 3353--3367. Google Scholar
Digital Library
- Kristen Gardner, Esa Hyyti"a, and Rhonda Righter. 2019. A Little Redundancy Goes a Long Way: Convexity in Redundancy Systems. Performance Evaluation (2019) (2019).Google Scholar
- Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, Esa Hyyti"a, and Alan Scheller-Wolf. 2016. Queueing with redundant requests: exact analysis. Queueing Systems , Vol. 83, 3--4 (2016), 227--259. Google Scholar
Digital Library
- H. Christian Gromoll, Philippe Robert, and Bert Zwart. 2008. Fluid Limits for Processor Sharing Queues with Impatience. Math. Oper. Res. , Vol. 33 (05 2008), 375--402. Google Scholar
Digital Library
- Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action .Cambridge University Press. Google Scholar
Digital Library
- Tim Hellemans, Tejas Bodas, and Benny van Houdt. 2019. Performance Analysis of Workload Dependent Load Balancing Policies. POMACS , Vol. 3, 2 (2019), 35:1--35:35. Google Scholar
Digital Library
- Tim Hellemans and Benny van Houdt. 2018. Analysis of redundancy(d) with identical Replicas. Performance Evaluation Review , Vol. 46, 3 (2018), 1--6. Google Scholar
Digital Library
- Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2015. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Performance Evaluation Review , Vol. 43, 2 (2015), 54--56. Google Scholar
Digital Library
- Ger Koole and Rhonda Righter. 2007. Resource allocation in grid computing. Journal of Scheduling (2007).Google Scholar
- Rhonda Righter Kristen Gardner, Esa Hyyti"a. 2018. A little redundancy goes a long way: convexity in redundancy systems. Preprint submitted to Elsevier (2018).Google Scholar
- Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. 2017a. On scheduling redundant requests with cancellation overheads. IEEE/ACM Transactions on Networking (TON) , Vol. 25, 2 (2017), 1279--1290. Google Scholar
Digital Library
- Kangwook Lee, Nihar B. Shah, Longbo Huang, and Kannan Ramchandran. 2017b. The mds queue: Analysing the latency performance of erasure codes. IEEE Transactions on Information Theory , Vol. 63, 5 (2017), 2822--2842. Google Scholar
Digital Library
- Nam H. Lee. 2008. A sufficient condition for stochastic stability of an Internet congestion control model in terms of fluid model stability, UC San Diego . Ph.D. Dissertation.Google Scholar
- Sean Meyn and Richard Tweedie. 1993. Generalized resolvents and Harris recurrence of Markov processes. Contemp. Math. , Vol. 149 (1993), 227--250.Google Scholar
Cross Ref
- Fernando Paganini, Ao Tang, Andrés Ferragut, and Lachlan Andrew. 2012. Network Stability under Alpha Fair Bandwidth Allocation with General File Size Distribution. IEEE Transactions. on Automatic Control , Vol. 57, 3 (2012), 579--591.Google Scholar
Cross Ref
- Youri Raaijmakers, Sem Borst, and Onno Boxma. 2019. Redundancy scheduling with scaled Bernoulli service requirements. Queueing Systems , Vol. Volume 93 (2019).Google Scholar
Cross Ref
- Youri Raaijmakers, Sem Borst, and Onno Boxma. 2020. Stability of Redundancy Systems with Processor Sharing. In Proceedings of the 13th EAI International Conference on Performance Evaluation Methodologies and Tools (VALUETOOLS '20). Association for Computing Machinery, New York, NY, USA, 120--127. Google Scholar
Digital Library
- Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2016. When do redundant requests reduce latency? IEEE Transactions on Communications , Vol. 64, 2 (2016), 715--722.Google Scholar
Cross Ref
- Ashish Vulimiri, Philip Brighten Godfrey, Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. 2013. Low latency via redundancy. In Proceedings of the ACM conference on Emerging networking experiments and technologies. ACM, 283--294. Google Scholar
Digital Library
Index Terms
Improving the Performance of Heterogeneous Data Centers through Redundancy
Recommendations
Improving the Performance of Heterogeneous Data Centers through Redundancy
SIGMETRICS '21: Abstract Proceedings of the 2021 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer SystemsWe analyze the performance of redundancy in a multi-type job and multi-type server system where PS is implemented. We characterize the stability condition, which coincides with that of a system where each job type only dispatches copies into its least-...
Improving the Performance of Heterogeneous Data Centers through Redundancy
SIGMETRICS '21We analyze the performance of redundancy in a multi-type job and multi-type server system where PS is implemented. We characterize the stability condition, which coincides with that of a system where each job type only dispatches copies into its least-...






Comments