Abstract
We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.
- Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems. 2312--2320.Google Scholar
- Yasin Abbasi-Yadkori and Csaba Szepesvári. 2011. Regret bounds for the adaptive control of linear quadratic systems. In Conference on Learning Theory. 1--26.Google Scholar
- Michael Athans. 1971. The role and use of the stochastic linear-quadratic-Gaussian problem in control system design. IEEE transactions on automatic control , Vol. 16, 6 (1971), 529--552.Google Scholar
Cross Ref
- Dimitri Bertsekas. 2012. Dynamic programming and optimal control: Volume I. Vol. 1. Athena scientific.Google Scholar
Digital Library
- Omar Besbes, Yonatan Gur, and Assaf Zeevi. 2014. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems , Vol. 27 (2014), 199--207.Google Scholar
- Nicholas M Boffi, Stephen Tu, and Jean-Jacques E Slotine. 2021. Regret bounds for adaptive nonlinear control. In Learning for Dynamics and Control. PMLR, 471--483.Google Scholar
- Asaf Cassel, Alon Cohen, and Tomer Koren. 2020. Logarithmic regret for learning linear quadratic regulators efficiently. In International Conference on Machine Learning. PMLR, 1328--1337.Google Scholar
- Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. 2019. A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free. In COLT. 696--726. http://proceedings.mlr.press/v99/chen19b.htmlGoogle Scholar
- Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. 2019 a. Learning to Optimize under Non-Stationarity. In Proceedings of Machine Learning Research (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, 1079--1087. http://proceedings.mlr.press/v89/cheung19b.htmlGoogle Scholar
- Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. 2019 b. Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818 (2019).Google Scholar
- Gregory C Chow. 1976. Control methods for macroeconomic policy analysis. The American Economic Review , Vol. 66, 2 (1976), 340--345.Google Scholar
- Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. 2018. Online linear quadratic control. arXiv preprint arXiv:1806.07104 (2018).Google Scholar
- Alon Cohen, Tomer Koren, and Yishay Mansour. 2019. Learning Linear-Quadratic Regulators Efficiently with only $sqrtT$ Regret. arxiv: 1902.06223 [cs.LG]Google Scholar
- Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. 2015. Strongly adaptive online learning. In International Conference on Machine Learning . PMLR, 1405--1411.Google Scholar
- Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. 2018. Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 4192--4201.Google Scholar
Digital Library
- Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. 2018. Finite-time adaptive stabilization of linear systems. IEEE Trans. Automat. Control , Vol. 64, 8 (2018), 3498--3505.Google Scholar
Cross Ref
- Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. 2020. Input perturbations for adaptive control and learning. Automatica , Vol. 117 (2020), 108950.Google Scholar
Cross Ref
- P.M. Gahinet, A.J. Laub, C.S. Kenney, and G.A. Hewer. 1990. Sensitivity of the stable discrete-time Lyapunov equation. IEEE Trans. Automat. Control , Vol. 35, 11 (1990), 1209--1217. https://doi.org/10.1109/9.59806Google Scholar
Cross Ref
- Pratik Gajane, Ronald Ortner, and Peter Auer. 2018. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066 (2018).Google Scholar
- Aurélien Garivier and Eric Moulines. 2011. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory. Springer, 174--188.Google Scholar
Digital Library
- Gautam Goel and Babak Hassibi. 2021. Regret-optimal Estimation and Contro. arXiv preprint arXiv:2106.12097 (2021).Google Scholar
- Paula Gradu, Elad Hazan, and Edgar Minasyan. 2020. Adaptive regret for control of time-varying dynamics. arXiv preprint arXiv:2007.04393 (2020).Google Scholar
- Bruce Hajek. 1982. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied probability (1982), 502--525.Google Scholar
- Elad Hazan, Sham Kakade, and Karan Singh. 2020. The nonstochastic control problem. In Algorithmic Learning Theory. PMLR, 408--421.Google Scholar
- Elad Hazan and Comandur Seshadhri. 2009. Efficient learning algorithms for changing environments. In Proceedings of the 26th annual international conference on machine learning. 393--400.Google Scholar
Digital Library
- Mark Herbster and Manfred K Warmuth. 1998. Tracking the best expert. Machine learning , Vol. 32, 2 (1998), 151--178.Google Scholar
- Morteza Ibrahimi, Adel Javanmard, and Benjamin V Roy. 2012. Efficient reinforcement learning for high dimensional linear quadratic systems. In Advances in Neural Information Processing Systems. 2636--2644.Google Scholar
- Yassir Jedra and Alexandre Proutiere. 2021. Minimal Expected Regret in Linear Quadratic Control. arxiv: 2109.14429 [cs.LG]Google Scholar
- Beatrice Laurent and Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics (2000), 1302--1338.Google Scholar
- Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , Vol. 17, 1 (2016), 1334--1373.Google Scholar
Digital Library
- Horia Mania, Stephen Tu, and Benjamin Recht. 2019. Certainty equivalence is efficient for linear quadratic control. arXiv preprint arXiv:1902.07826 (2019).Google Scholar
- Prasad A Naik. 2014. Marketing dynamics: A primer on estimation and control. Foundations and Trends in Marketing , Vol. 9, 3 (2014), 175--266.Google Scholar
Cross Ref
- Ronald Ortner, Pratik Gajane, and Peter Auer. 2020. Variational regret bounds for reinforcement learning. In Uncertainty in Artificial Intelligence. PMLR, 81--90.Google Scholar
- Yoan Russac, Claire Vernade, and Olivier Cappé. 2019. Weighted Linear Bandits for Non-Stationary Environments. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/263fc48aae39f219b4c71d9d4bb4aed2-Paper.pdfGoogle Scholar
- Max Simchowitz and Dylan Foster. 2020. Naive exploration is optimal for online LQR. In International Conference on Machine Learning . PMLR, 8937--8948.Google Scholar
- Max Simchowitz, Karan Singh, and Elad Hazan. 2020. Improper learning for non-stochastic control. In Conference on Learning Theory. PMLR, 3320--3436.Google Scholar
- Russ Tedrake. 2009. Underactuated robotics: Learning, planning, and control for efficient and agile machines course notes for MIT 6.832. Working draft edition , Vol. 3 (2009).Google Scholar
- Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).Google Scholar
- Chen-Yu Wei and Haipeng Luo. 2021. Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach. arXiv preprint arXiv:2102.05406 (2021).Google Scholar
- Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. 2009. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research , Vol. 34, 3 (2009), 737--757.Google Scholar
Digital Library
- Peng Zhao and Lijun Zhang. 2021. Non-stationary linear bandits revisited. arXiv preprint arXiv:2103.05324 (2021).Google Scholar
- Martin Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03) . 928--936.Google Scholar
Digital Library
Index Terms
Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems
Recommendations
Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems
SIGMETRICS '22We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics At, Bt. The sequence of dynamics matrices can be arbitrary, but with ...
Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsWe consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics At, Bt. The sequence of dynamics matrices can be arbitrary, but with ...
The Study of Universal Logics Control and Its Comparison to Other Control Methods for Double-Order Inverted Pendulum
ICMTMA '09: Proceedings of the 2009 International Conference on Measuring Technology and Mechatronics Automation - Volume 03A control model for the double-order inverted pendulum based on universal logics is proposed in this paper. And linear quadratic regulator, human-imitating controller and fuzzy logic controller for the double-order inverted pendulum are also designed ...






Comments