skip to main content
research-article
Public Access

Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study

Authors Info & Claims
Published:03 April 2018Publication History
Skip Abstract Section

Abstract

We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in T slots, starting from any state, compared to the best feasible randomized stationary policy in hindsight. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tight O(√T ) regret and constraint violations simultaneously. To obtain such a bound, we combine several new ingredients including ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online constrained optimization, a drift analysis for queue processes, and a perturbation analysis based on Farkas' Lemma.

References

  1. Shipra Agrawal and Randy Jia. 2017. Posterior sampling for reinforcement learning: worst-case regret bounds. arXiv preprint arXiv:1705.07041 (2017).Google ScholarGoogle Scholar
  2. Eitan Altman. 1999. Constrained Markov decision processes. Vol. Vol. 7. CRC Press.Google ScholarGoogle Scholar
  3. Dimitri P. Bertsekas. 1995. Dynamic programming and optimal control. Vol. Vol. 1. Athena scientific Belmont, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dimitri P. Bertsekas. 2009. Convex optimization theory. Athena Scientific Belmont.Google ScholarGoogle Scholar
  5. Craig Boutilier and Tyler Lu. 2016. Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes. UAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Constantine Caramanis, Nedialko B. Dimitrov, and David P. Morton. 2014. Efficient Algorithms for Budget-Constrained Markov Decision Processes. IEEE Trans. Automat. Control Vol. 59, 10 (2014), 2813--2817.Google ScholarGoogle ScholarCross RefCross Ref
  7. Tianyi Chen, Qing Ling, and Georgios B. Giannakis. 2017. An Online Convex Optimization Approach to Dynamic Network Resource Allocation. arXiv preprint arXiv:1701.03974 (2017).Google ScholarGoogle Scholar
  8. Yichen Chen and Mengdi Wang. 2016. Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning. arXiv preprint arXiv:1612.02516 (2016).Google ScholarGoogle Scholar
  9. Travis Dick, Andras Gyorgy, and Csaba Szepesvari. 2014. Online learning in Markov decision processes with changing cost sequences Proceedings of the 31st International Conference on Machine Learning (ICML-14). 512--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Atilla Eryilmaz and R. Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems Vol. 72, 3--4 (2012), 311--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. 2009. Online Markov decision processes. Mathematics of Operations Research Vol. 34, 3 (2009), 726--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bennett Fox. 1966. Markov renewal programming by linear fractional programming. SIAM J. Appl. Math. Vol. 14, 6 (1966), 1418--1432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Anshul Gandhi. 2013. Dynamic server provisioning for data center power management. Ph.D. Dissertation. Carnegie Mellon University.Google ScholarGoogle Scholar
  14. Anshul Gandhi, Sherwin Doroudi, Mor Harchol-Balter, and Alan Scheller-Wolf. 2013. Exact analysis of the M/M/k/setup class of Markov chains via recursive renewal reward ACM SIGMETRICS Performance Evaluation Review, Vol. Vol. 41. ACM, 153--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peng Guan, Maxim Raginsky, and Rebecca M. Willett. 2014. Online Markov decision processes with Kullback-Leibler control cost. IEEE Trans. Automat. Control Vol. 59, 6 (2014), 1423--1438.Google ScholarGoogle ScholarCross RefCross Ref
  16. Bruce Hajek. 1982. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied probability Vol. 14, 3 (1982), 502--525.Google ScholarGoogle Scholar
  17. Elad Hazan et almbox.. 2016. Introduction to online convex optimization. Foundations and Trends® in Optimization Vol. 2, 3--4 (2016), 157--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Elad Hazan, Amit Agarwal, and Satyen Kale. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning Vol. 69 (2007), 169--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. 2016. Adaptive algorithms for online convex optimization with long-term constraints International Conference on Machine Learning. 402--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tor Lattimore, Marcus Hutter, Peter Sunehag, et almbox.. 2013. The sample-complexity of general reinforcement learning Proceedings of the 30th International Conference on Machine Learning. Journal of Machine Learning Research. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. 2006. Markov chains and mixing times. American Mathematical Society.Google ScholarGoogle Scholar
  22. Minghong Lin, Adam Wierman, Lachlan LH Andrew, and Eno Thereska. 2013. Dynamic right-sizing for power-proportional data centers. IEEE/ACM Transactions on Networking (TON) Vol. 21, 5 (2013), 1378--1391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. 2012. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research Vol. 13, Sep (2012), 2503--2528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Angelia Nedić and Asuman Ozdaglar. 2009. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization Vol. 19, 4 (2009), 1757--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michael J. Neely. 2011. Online fractional programming for Markov decision systems Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on. IEEE, 353--360.Google ScholarGoogle ScholarCross RefCross Ref
  26. Michael J. Neely and Hao Yu. 2017. Online Convex Optimization with Time-Varying Constraints. arXiv preprint arXiv:1702.04783 (2017).Google ScholarGoogle Scholar
  27. Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. 2010. Online Markov decision processes under bandit feedback Advances in Neural Information Processing Systems. 1804--1812. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning: An introduction. Vol. Vol. 1. MIT press Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rahul Urgaonkar, Bhuvan Urgaonkar, Michael J. Neely, and Anand Sivasubramaniam. 2011. Optimal power cost management using stored energy in data centers. ACM SIGMETRICS Performance Evaluation Review Vol. 39, 1 (2011), 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xiaohan Wei and Michael Neely. 2017. Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems. IEEE/ACM Transactions on Networking Vol. PP, 99 (2017), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!