Abstract
Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant stepsize SA algorithms do not converge to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithm with a smooth and strongly convex objective, (2) linear SA algorithm involving a Hurwitz matrix, and (3) nonlinear SA algorithm involving a contractive operator. When the iterate is scaled by 1/α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an implicit equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a heuristic formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.
- Stefan Banach. 1922. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. math , Vol. 3, 1 (1922), 133--181.Google Scholar
- Amir Beck. 2017. First-order methods in optimization . Vol. 25. SIAM.Google Scholar
Digital Library
- Albert Benveniste, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithms and stochastic approximations. Vol. 22. Springer Science & Business Media.Google Scholar
- Dimitri P Bertsekas and John N Tsitsiklis. 1996. Neuro-dynamic programming .Athena Scientific.Google Scholar
- Jalaj Bhandari, Daniel Russo, and Raghav Singal. 2018. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. In Conference On Learning Theory . 1691--1692.Google Scholar
- Pascal Bianchi, Walid Hachem, and Sholom Schechtman. 2020. Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Preprint arXiv:2005.08513 (2020).Google Scholar
- Vivek S Borkar. 2009. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer.Google Scholar
- Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review , Vol. 60, 2 (2018), 223--311.Google Scholar
Cross Ref
- Adithya M Devraj, Ana Buvs ić, and Sean Meyn. 2019. Zap Q-Learning-A User's Guide. In 2019 Fifth Indian Control Conference (ICC). IEEE, 10--15.Google Scholar
Cross Ref
- Persi Diaconis and David Freedman. 1999. Iterated random functions. SIAM review , Vol. 41, 1 (1999), 45--76.Google Scholar
- Aymeric Dieuleveut, Alain Durmus, and Francis Bach. 2020. Bridging the gap between constant step size stochastic gradient descent and markov chains. The Annals of Statistics , Vol. 48, 3 (2020), 1348--1382.Google Scholar
Cross Ref
- Alain Durmus, Pablo Jiménez, Éric Moulines, and SAID Salem. 2021. On Riemannian Stochastic Approximation Schemes with Fixed Step-Size. In International Conference on Artificial Intelligence and Statistics. PMLR, 1018--1026.Google Scholar
- Rick Durrett. 2019. Probability: theory and examples . Vol. 49. Cambridge university press.Google Scholar
Digital Library
- Atilla Eryilmaz and Rayadurgam Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems , Vol. 72, 3--4 (2012), 311--359.Google Scholar
Digital Library
- Vaclav Fabian. 1968. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics (1968), 1327--1332.Google Scholar
- Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. 2020. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research , Vol. 21 (2020).Google Scholar
- Yuanyuan Feng, Lei Li, and Jian-Guo Liu. 2018. Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Communications in Mathematical Sciences , Vol. 16, 3 (2018), 777--789.Google Scholar
Cross Ref
- Harley Flanders. 1973. Differentiation under the integral sign. The American Mathematical Monthly , Vol. 80, 6 (1973), 615--627.Google Scholar
Cross Ref
- D. Gamarnik and A. Zeevi. 2006. Validity of Heavy Traffic Steady-State Approximations in Generalized Jackson Networks. The Annals of Applied Probability (2006), 56--90.Google Scholar
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.Google Scholar
Digital Library
- Robert Gower, Othmane Sebbouh, and Nicolas Loizou. 2021. SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics. PMLR, 1315--1323.Google Scholar
- Wassim M Haddad and VijaySekhar Chellaboina. 2011. Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.Google Scholar
- J.M. Harrison. 1988. Brownian Models of Queueing Networks with Heterogeneous Customer Populations. In Stochastic Differential Systems, Stochastic Control Theory and Applications. Springer, 147--186.Google Scholar
- J. M. Harrison. 1998. Heavy traffic analysis of a system with parallel servers: Asymptotic optimality of discrete review policies. Ann. App. Probab. (1998), 822--848.Google Scholar
- J. M. Harrison and M. J. López. 1999. Heavy traffic resource pooling in parallel-server systems. Queueing Systems (1999), 339--368.Google Scholar
- Karla Hernandez and James C Spall. 2019. Generalization of a result of Fabian on the asymptotic normality of stochastic approximation. Automatica , Vol. 99 (2019), 420--424.Google Scholar
Cross Ref
- Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. 2019. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications , Vol. 4, 1 (2019).Google Scholar
Cross Ref
- Daniela Hurtado-Lange and Siva Theja Maguluri. 2020. Transform methods for heavy-traffic analysis. Stochastic Systems , Vol. 10, 4 (2020), 275--309.Google Scholar
Cross Ref
- Daniela Hurtado-Lange, Sushil Mahavir Varma, and Siva Theja Maguluri. 2020. Logarithmic Heavy Traffic Error Bounds in Generalized Switch and Load Balancing Systems. Preprint arXiv:2003.07821 (2020).Google Scholar
- Harry Kesten et almbox. 1958. Accelerated stochastic approximation. Annals of Mathematical Statistics , Vol. 29, 1 (1958), 41--59.Google Scholar
Cross Ref
- Hassan K Khalil and Jessy W Grizzle. 2002. Nonlinear systems . Vol. 3. Prentice hall Upper Saddle River, NJ.Google Scholar
- Guanghui Lan. 2020. First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature.Google Scholar
- Jonas Latz. 2021. Analysis of stochastic gradient descent in continuous time. Statistics and Computing , Vol. 31, 4 (2021), 1--25.Google Scholar
Digital Library
- Qianxiao Li, Cheng Tai, and E Weinan. 2017. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning. PMLR, 2101--2110.Google Scholar
- Xiaoyu Li and Francesco Orabona. 2019. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 983--992.Google Scholar
- Siva Theja Maguluri, Sai Kiran Burle, and Rayadurgam Srikant. 2018. Optimal heavy-traffic queue length scaling in an incompletely saturated switch. Queueing Systems , Vol. 88, 3--4 (2018), 279--309.Google Scholar
Digital Library
- Siva Theja Maguluri and R Srikant. 2016. Heavy traffic queue length behavior in a switch under the MaxWeight algorithm. Stochastic Systems , Vol. 6, 1 (2016), 211--250.Google Scholar
Cross Ref
- Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. 2020. On the almost sure convergence of stochastic gradient descent in non-convex problems. Preprint arXiv:2006.11144 (2020).Google Scholar
- Shancong Mou and Siva Theja Maguluri. 2020. Heavy Traffic Queue Length Behaviour in a Switch under Markovian Arrivals. Preprint arXiv:2006.06150 (2020).Google Scholar
- Angel Muleshkov and Tan Nguyen. 2016. Easy proof of the Jacobian for the n-dimensional polar coordinates . Pi Mu Epsilon Journal , Vol. 14 (2016), 269--273.Google Scholar
- Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics (1951), 400--407.Google Scholar
- Timothy Sauer. 2012. Numerical solution of stochastic differential equations in finance. In Handbook of computational finance . Springer, 529--550.Google Scholar
- Ohad Shamir and Tong Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning . PMLR, 71--79.Google Scholar
- Justin Sirignano and Konstantinos Spiliopoulos. 2020. Stochastic gradient descent in continuous time: A central limit theorem. Stochastic Systems , Vol. 10, 2 (2020), 124--151.Google Scholar
Cross Ref
- R Srikant and Lei Ying. 2019. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning. In Conference on Learning Theory. 2803--2830.Google Scholar
- Alexander L Stolyar et almbox. 2004. Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic. The Annals of Applied Probability , Vol. 14, 1 (2004), 1--53.Google Scholar
Cross Ref
- Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning , Vol. 3, 1 (1988), 9--44.Google Scholar
- Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.Google Scholar
Digital Library
- John N Tsitsiklis and Benjamin Van Roy. 1997. Analysis of temporal-difference learning with function approximation. In Advances in neural information processing systems. 1075--1081.Google Scholar
- Aad W Van der Vaart. 2000. Asymptotic statistics . Vol. 3. Cambridge university press.Google Scholar
- Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning , Vol. 8, 3--4 (1992), 279--292.Google Scholar
- Mou Wenlong, Flammarion Nicolas, Wainwright Martin J., and Bartlett Peter L. 2019. Improved Bounds for Discretization of Langevin Diffusions: Near-Optimal Rates without Convexity. https://arxiv.org/pdf/1907.11331 (2019).Google Scholar
- R. J. Williams. 1998. Diffusion approximations for open multiclass queueing networks: Sufficient conditions involving state space collapse. Queueing Systems Theory and Applications (1998), 27 -- 88.Google Scholar
- Yuege Xie, Xiaoxia Wu, and Rachel Ward. 2020. Linear convergence of adaptive stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics. PMLR, 1475--1485.Google Scholar
- Jiaojiao Yang, Wenqing Hu, and Chris Junchi Li. 2021. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis , Vol. 122, 3--4 (2021), 371--393.Google Scholar
Cross Ref
- Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. 2020. An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias . Preprint arXiv:2006.07904 (2020).Google Scholar
Index Terms
Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization
Recommendations
Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization
SIGMETRICS '22Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant ...
Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsStochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant ...
Asymptotic Behavior of a Markovian Stochastic Algorithm with Constant Step
We first derive from abstract results on Feller transition kernels that, under some mild assumptions, a Markov stochastic algorithm with constant step size $\varepsilon$ usually has a tight family of invariant distributions $\nu^{\varepsilon}$, $\...






Comments