Abstract
We consider a system with N~parallel servers where incoming jobs are immediately replicated to, say, d~servers. Each of the N servers has its own queue and follows a FCFS discipline. As soon as the first job replica is completed, the remaining replicas are abandoned. We investigate the achievable stability region for a quite general workload model with different job types and heterogeneous servers, reflecting job-server affinity relations which may arise from data locality issues and soft compatibility constraints. Under the assumption that job types are known beforehand we show for New-Better-than-Used (NBU) distributed speed variations that no replication $(d=1)$ gives a strictly larger stability region than replication $(d>1)$. Strikingly, this does not depend on the underlying distribution of the intrinsic job sizes, but observing the job types is essential for this statement to hold. In case of non-observable job types we show that for New-Worse-than-Used (NWU) distributed speed variations full replication ($d=N$) gives a larger stability region than no replication $(d=1)$.
- M. Aktas, G. Joshi, S. Kadhe, F. Kazemi, and E. Soljanin. 2020. Service rate region: A new aspect of coded distributed system design. ArXiv 2009.01598 (2020), 1--43.Google Scholar
- S.E. Anderson, A. Johnston, G. Joshi, G.L. Matthews, C. Mayer, and E. Soljanin. 2018. Service rate region of content access from erasure coded storage. Proceedings of the 2018 Information Theory Workshop (2018), 600--605.Google Scholar
- E. Anton, U. Ayesta, M. Jonckheere, and I.M. Verloop. 2019. On the stability of redudancy models. ArXiv 1903.04414 (2019).Google Scholar
- E. Anton, U. Ayesta, M. Jonckheere, and I.M. Verloop. 2020. Improving the performance of heterogeneous data centers through redundancy. ArXiv 2003.01394 (2020).Google Scholar
- K. Bimpikis and M.G. Markakis. 2018. Learning and hierarchies in service systems. Management Science , Vol. 65, 3 (2018), 1--18.Google Scholar
- K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, and B. Van Houdt. 2017a. A better model for job redundancy: Decoupling server slowdown and job size. IEEE ACM Transactions on Networking , Vol. 25, 6 (2017), 3353--3367.Google Scholar
Digital Library
- K. Gardner, M. Harchol-Balter, A. Scheller-Wolf, M. Velednitsky, and S. Zbarsky. 2017b. Redundancy-d: The power of d choices for redundancy. Operations Research , Vol. 65, 4 (2017), 1078--1094.Google Scholar
Digital Library
- J.M. Harrison and M.J. López. 1999. Heavy traffic resource pooling in parallel-server systems. Queueing Systems , Vol. 33, 4 (1999), 339--368.Google Scholar
Digital Library
- T. Hellemans, T. Bodas, and B. Van Houdt. 2019. Performance analysis of workload dependent load balancing policies. Proceedings of the ACM on Measurement and Analysis of Computing Systems , Vol. 3, 2 (2019), 1--35.Google Scholar
Digital Library
- T. Hellemans and B. Van Houdt. 2019. Performance of Redundancy(d) with identical/independent replicas. ACM Transactions on Modeling and Performance Evaluation of Computing Systems , Vol. 4, 2 (2019), 1--28.Google Scholar
Digital Library
- G. Joshi. 2016. Efficient Redundancy Techniques to Reduce Delay in Cloud Systems. Ph.D. Dissertation. Massachusetts Institute of Technology.Google Scholar
- G. Joshi. 2018. Synergy via Redundancy: Boosting service capacity with adaptive replication. ACM SIGMETRICS Performance Evaluation Review , Vol. 45, 3 (2018), 21--28.Google Scholar
Digital Library
- G. Joshi, Y. Liu, and E. Soljanin. 2014. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE Journal on Selected Areas in Communications , Vol. 32, 5 (2014), 989--997.Google Scholar
Cross Ref
- Y. Kim, R. Righter, and R. Wolff. 2009. Job replication on multiserver systems. Advances in Applied Probability , Vol. 41, 2 (2009), 546--575.Google Scholar
Cross Ref
- G. Koole and R. Righter. 2008. Resource allocation in grid computing. Journal of Scheduling , Vol. 11 (2008), 163--173.Google Scholar
Digital Library
- G. Mendelson. 2020. A lower bound on the stability region for redundancy-d with FIFO service discipline. ArXiv 2004.14793 (2020).Google Scholar
- F. Poloczek and F. Ciucu. 2016. Contrasting effects of replication in parallel systems: From overload to underload and back. ACM SIGMETRICS Performance Evaluation Review , Vol. 44, 1 (2016), 375--376.Google Scholar
Digital Library
- Y. Raaijmakers, S.C. Borst, and O.J. Boxma. 2018. Delta probing policies for redundancy. Performance Evaluation , Vol. 127--128 (2018), 21--35.Google Scholar
- Y. Raaijmakers, S.C. Borst, and O.J. Boxma. 2019. Redundancy scheduling with scaled Bernoulli service requirements. Queueing Systems , Vol. 93, 1--2 (2019), 67--82.Google Scholar
Cross Ref
- Y. Raaijmakers, S.C. Borst, and O.J. Boxma. 2020. Stability of redundancy systems with processor sharing. VALUETOOLS '20: Proceedings of the 13th EAI International Conference on Performance Evaluation Methodologies and Tools (2020), 120--127.Google Scholar
Digital Library
- N.B. Shah, K. Lee, and K. Ramchandran. 2016. When do redundant requests reduce latency? IEEE Transactions on Communications , Vol. 64, 2 (2016), 715--722.Google Scholar
Cross Ref
- A.L. Stolyar. 2005. Optimal routing in output-queued flexible server systems. Probability in the Engineering and Informational Sciences , Vol. 19, 2 (2005), 141--189.Google Scholar
Digital Library
- D. Stoyan. 1983. Comparison Methods for Queues and Other Stochastic Models. Chichester, Wiley. (edited with revisions by D.J. Daley).Google Scholar
- Y. Sun, C.E. Koksal, and N.B. Shroff. 2017. On delay-optimal scheduling in queueing systems with replications computing. ArXiv 1603.07322v8 (2017).Google Scholar
- D. Wang, G. Joshi, and G.W. Wornell. 2019. Efficient straggler replication in large-scale parallel computing. ACM Transactions on Modeling and Performance Evaluation of Computing Systems , Vol. 4, 2 (2019), 1--23.Google Scholar
Digital Library
Index Terms
Achievable Stability in Redundancy Systems
Recommendations
Achievable Stability in Redundancy Systems
SIGMETRICS '21We investigate the achievable stability region for redundancy systems and a quite general workload model with different job types and heterogeneous servers, reflecting job-server affinity relations which may arise from data locality issues and soft ...
Achievable Stability in Redundancy Systems
SIGMETRICS '21: Abstract Proceedings of the 2021 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer SystemsWe investigate the achievable stability region for redundancy systems and a quite general workload model with different job types and heterogeneous servers, reflecting job-server affinity relations which may arise from data locality issues and soft ...
Stability and tail behavior of redundancy systems with processor sharing
AbstractWe investigate the stability condition for redundancy-d systems where each of the servers follows a processor-sharing (PS) discipline. We allow for generally distributed job sizes, with possible dependence among the d replica sizes ...






Comments