skip to main content
research-article

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

References

  1. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems. 2312--2320.Google ScholarGoogle Scholar
  2. Yasin Abbasi-Yadkori and Csaba Szepesvári. 2011. Regret bounds for the adaptive control of linear quadratic systems. In Conference on Learning Theory. 1--26.Google ScholarGoogle Scholar
  3. Michael Athans. 1971. The role and use of the stochastic linear-quadratic-Gaussian problem in control system design. IEEE transactions on automatic control , Vol. 16, 6 (1971), 529--552.Google ScholarGoogle ScholarCross RefCross Ref
  4. Dimitri Bertsekas. 2012. Dynamic programming and optimal control: Volume I. Vol. 1. Athena scientific.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Omar Besbes, Yonatan Gur, and Assaf Zeevi. 2014. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems , Vol. 27 (2014), 199--207.Google ScholarGoogle Scholar
  6. Nicholas M Boffi, Stephen Tu, and Jean-Jacques E Slotine. 2021. Regret bounds for adaptive nonlinear control. In Learning for Dynamics and Control. PMLR, 471--483.Google ScholarGoogle Scholar
  7. Asaf Cassel, Alon Cohen, and Tomer Koren. 2020. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning. PMLR, 1328--1337.Google ScholarGoogle Scholar
  8. Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. 2019. A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free. In COLT. 696--726. http://proceedings.mlr.press/v99/chen19b.htmlGoogle ScholarGoogle Scholar
  9. Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. 2019 a. Learning to Optimize under Non-Stationarity. In Proceedings of Machine Learning Research (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, 1079--1087. http://proceedings.mlr.press/v89/cheung19b.htmlGoogle ScholarGoogle Scholar
  10. Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. 2019 b. Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818 (2019).Google ScholarGoogle Scholar
  11. Gregory C Chow. 1976. Control methods for macroeconomic policy analysis. The American Economic Review , Vol. 66, 2 (1976), 340--345.Google ScholarGoogle Scholar
  12. Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. 2018. Online linear quadratic control. arXiv preprint arXiv:1806.07104 (2018).Google ScholarGoogle Scholar
  13. Alon Cohen, Tomer Koren, and Yishay Mansour. 2019. Learning Linear-Quadratic Regulators Efficiently with only $sqrtT$ Regret. arxiv: 1902.06223 [cs.LG]Google ScholarGoogle Scholar
  14. Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. 2015. Strongly adaptive online learning. In International Conference on Machine Learning . PMLR, 1405--1411.Google ScholarGoogle Scholar
  15. Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. 2018. Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 4192--4201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. 2018. Finite-time adaptive stabilization of linear systems. IEEE Trans. Automat. Control , Vol. 64, 8 (2018), 3498--3505.Google ScholarGoogle ScholarCross RefCross Ref
  17. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. 2020. Input perturbations for adaptive control and learning. Automatica , Vol. 117 (2020), 108950.Google ScholarGoogle ScholarCross RefCross Ref
  18. P.M. Gahinet, A.J. Laub, C.S. Kenney, and G.A. Hewer. 1990. Sensitivity of the stable discrete-time Lyapunov equation. IEEE Trans. Automat. Control , Vol. 35, 11 (1990), 1209--1217. https://doi.org/10.1109/9.59806Google ScholarGoogle ScholarCross RefCross Ref
  19. Pratik Gajane, Ronald Ortner, and Peter Auer. 2018. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066 (2018).Google ScholarGoogle Scholar
  20. Aurélien Garivier and Eric Moulines. 2011. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory. Springer, 174--188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gautam Goel and Babak Hassibi. 2021. Regret-optimal Estimation and Contro. arXiv preprint arXiv:2106.12097 (2021).Google ScholarGoogle Scholar
  22. Paula Gradu, Elad Hazan, and Edgar Minasyan. 2020. Adaptive regret for control of time-varying dynamics. arXiv preprint arXiv:2007.04393 (2020).Google ScholarGoogle Scholar
  23. Bruce Hajek. 1982. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied probability (1982), 502--525.Google ScholarGoogle Scholar
  24. Elad Hazan, Sham Kakade, and Karan Singh. 2020. The nonstochastic control problem. In Algorithmic Learning Theory. PMLR, 408--421.Google ScholarGoogle Scholar
  25. Elad Hazan and Comandur Seshadhri. 2009. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning. 393--400.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mark Herbster and Manfred K Warmuth. 1998. Tracking the best expert. Machine learning , Vol. 32, 2 (1998), 151--178.Google ScholarGoogle Scholar
  27. Morteza Ibrahimi, Adel Javanmard, and Benjamin V Roy. 2012. Efficient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems. 2636--2644.Google ScholarGoogle Scholar
  28. Yassir Jedra and Alexandre Proutiere. 2021. Minimal Expected Regret in Linear Quadratic Control. arxiv: 2109.14429 [cs.LG]Google ScholarGoogle Scholar
  29. Beatrice Laurent and Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics (2000), 1302--1338.Google ScholarGoogle Scholar
  30. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , Vol. 17, 1 (2016), 1334--1373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Horia Mania, Stephen Tu, and Benjamin Recht. 2019. Certainty equivalence is efficient for linear quadratic control. arXiv preprint arXiv:1902.07826 (2019).Google ScholarGoogle Scholar
  32. Prasad A Naik. 2014. Marketing dynamics: A primer on estimation and control. Foundations and Trends in Marketing , Vol. 9, 3 (2014), 175--266.Google ScholarGoogle ScholarCross RefCross Ref
  33. Ronald Ortner, Pratik Gajane, and Peter Auer. 2020. Variational regret bounds for reinforcement learning. In Uncertainty in Artificial Intelligence. PMLR, 81--90.Google ScholarGoogle Scholar
  34. Yoan Russac, Claire Vernade, and Olivier Cappé. 2019. Weighted Linear Bandits for Non-Stationary Environments. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/263fc48aae39f219b4c71d9d4bb4aed2-Paper.pdfGoogle ScholarGoogle Scholar
  35. Max Simchowitz and Dylan Foster. 2020. Naive exploration is optimal for online LQR. In International Conference on Machine Learning . PMLR, 8937--8948.Google ScholarGoogle Scholar
  36. Max Simchowitz, Karan Singh, and Elad Hazan. 2020. Improper learning for non-stochastic control. In Conference on Learning Theory. PMLR, 3320--3436.Google ScholarGoogle Scholar
  37. Russ Tedrake. 2009. Underactuated robotics: Learning, planning, and control for efficient and agile machines course notes for MIT 6.832. Working draft edition , Vol. 3 (2009).Google ScholarGoogle Scholar
  38. Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).Google ScholarGoogle Scholar
  39. Chen-Yu Wei and Haipeng Luo. 2021. Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach. arXiv preprint arXiv:2102.05406 (2021).Google ScholarGoogle Scholar
  40. Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. 2009. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research , Vol. 34, 3 (2009), 737--757.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Peng Zhao and Lijun Zhang. 2021. Non-stationary linear bandits revisited. arXiv preprint arXiv:2103.05324 (2021).Google ScholarGoogle Scholar
  42. Martin Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03) . 928--936.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!