Abstract
We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in T slots, starting from any state, compared to the best feasible randomized stationary policy in hindsight. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tight O(√T ) regret and constraint violations simultaneously. To obtain such a bound, we combine several new ingredients including ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online constrained optimization, a drift analysis for queue processes, and a perturbation analysis based on Farkas' Lemma.
- Shipra Agrawal and Randy Jia. 2017. Posterior sampling for reinforcement learning: worst-case regret bounds. arXiv preprint arXiv:1705.07041 (2017).Google Scholar
- Eitan Altman. 1999. Constrained Markov decision processes. Vol. Vol. 7. CRC Press.Google Scholar
- Dimitri P. Bertsekas. 1995. Dynamic programming and optimal control. Vol. Vol. 1. Athena scientific Belmont, MA. Google Scholar
Digital Library
- Dimitri P. Bertsekas. 2009. Convex optimization theory. Athena Scientific Belmont.Google Scholar
- Craig Boutilier and Tyler Lu. 2016. Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes. UAI. Google Scholar
Digital Library
- Constantine Caramanis, Nedialko B. Dimitrov, and David P. Morton. 2014. Efficient Algorithms for Budget-Constrained Markov Decision Processes. IEEE Trans. Automat. Control Vol. 59, 10 (2014), 2813--2817.Google Scholar
Cross Ref
- Tianyi Chen, Qing Ling, and Georgios B. Giannakis. 2017. An Online Convex Optimization Approach to Dynamic Network Resource Allocation. arXiv preprint arXiv:1701.03974 (2017).Google Scholar
- Yichen Chen and Mengdi Wang. 2016. Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning. arXiv preprint arXiv:1612.02516 (2016).Google Scholar
- Travis Dick, Andras Gyorgy, and Csaba Szepesvari. 2014. Online learning in Markov decision processes with changing cost sequences Proceedings of the 31st International Conference on Machine Learning (ICML-14). 512--520. Google Scholar
Digital Library
- Atilla Eryilmaz and R. Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems Vol. 72, 3--4 (2012), 311--359. Google Scholar
Digital Library
- Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. 2009. Online Markov decision processes. Mathematics of Operations Research Vol. 34, 3 (2009), 726--736. Google Scholar
Digital Library
- Bennett Fox. 1966. Markov renewal programming by linear fractional programming. SIAM J. Appl. Math. Vol. 14, 6 (1966), 1418--1432.Google Scholar
Digital Library
- Anshul Gandhi. 2013. Dynamic server provisioning for data center power management. Ph.D. Dissertation. Carnegie Mellon University.Google Scholar
- Anshul Gandhi, Sherwin Doroudi, Mor Harchol-Balter, and Alan Scheller-Wolf. 2013. Exact analysis of the M/M/k/setup class of Markov chains via recursive renewal reward ACM SIGMETRICS Performance Evaluation Review, Vol. Vol. 41. ACM, 153--166. Google Scholar
Digital Library
- Peng Guan, Maxim Raginsky, and Rebecca M. Willett. 2014. Online Markov decision processes with Kullback-Leibler control cost. IEEE Trans. Automat. Control Vol. 59, 6 (2014), 1423--1438.Google Scholar
Cross Ref
- Bruce Hajek. 1982. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied probability Vol. 14, 3 (1982), 502--525.Google Scholar
- Elad Hazan et almbox.. 2016. Introduction to online convex optimization. Foundations and Trends® in Optimization Vol. 2, 3--4 (2016), 157--325. Google Scholar
Digital Library
- Elad Hazan, Amit Agarwal, and Satyen Kale. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning Vol. 69 (2007), 169--192. Google Scholar
Digital Library
- Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. 2016. Adaptive algorithms for online convex optimization with long-term constraints International Conference on Machine Learning. 402--411. Google Scholar
Digital Library
- Tor Lattimore, Marcus Hutter, Peter Sunehag, et almbox.. 2013. The sample-complexity of general reinforcement learning Proceedings of the 30th International Conference on Machine Learning. Journal of Machine Learning Research. Google Scholar
Digital Library
- David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. 2006. Markov chains and mixing times. American Mathematical Society.Google Scholar
- Minghong Lin, Adam Wierman, Lachlan LH Andrew, and Eno Thereska. 2013. Dynamic right-sizing for power-proportional data centers. IEEE/ACM Transactions on Networking (TON) Vol. 21, 5 (2013), 1378--1391. Google Scholar
Digital Library
- Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. 2012. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research Vol. 13, Sep (2012), 2503--2528. Google Scholar
Digital Library
- Angelia Nedić and Asuman Ozdaglar. 2009. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization Vol. 19, 4 (2009), 1757--1780. Google Scholar
Digital Library
- Michael J. Neely. 2011. Online fractional programming for Markov decision systems Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on. IEEE, 353--360.Google Scholar
Cross Ref
- Michael J. Neely and Hao Yu. 2017. Online Convex Optimization with Time-Varying Constraints. arXiv preprint arXiv:1702.04783 (2017).Google Scholar
- Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. 2010. Online Markov decision processes under bandit feedback Advances in Neural Information Processing Systems. 1804--1812. Google Scholar
Digital Library
- Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning: An introduction. Vol. Vol. 1. MIT press Cambridge. Google Scholar
Digital Library
- Rahul Urgaonkar, Bhuvan Urgaonkar, Michael J. Neely, and Anand Sivasubramaniam. 2011. Optimal power cost management using stored energy in data centers. ACM SIGMETRICS Performance Evaluation Review Vol. 39, 1 (2011), 181--192. Google Scholar
Digital Library
- Xiaohan Wei and Michael Neely. 2017. Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems. IEEE/ACM Transactions on Networking Vol. PP, 99 (2017), 1--15. Google Scholar
Digital Library
Index Terms
Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study
Recommendations
Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study
SIGMETRICS '18We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision ...
Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsWe consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision ...
Variability Sensitive Markov Decision Processes
Considered are time-average Markov Decision Processes MDPs with finite state and action spaces. Two definitions of variability are introduced, namely, the expected time-average variability and time-average expected variability. The two criteria are in ...






Comments