skip to main content
research-article
Public Access

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant stepsize SA algorithms do not converge to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithm with a smooth and strongly convex objective, (2) linear SA algorithm involving a Hurwitz matrix, and (3) nonlinear SA algorithm involving a contractive operator. When the iterate is scaled by 1/α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an implicit equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a heuristic formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

References

  1. Stefan Banach. 1922. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fund. math , Vol. 3, 1 (1922), 133--181.Google ScholarGoogle Scholar
  2. Amir Beck. 2017. First-order methods in optimization . Vol. 25. SIAM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Albert Benveniste, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithms and stochastic approximations. Vol. 22. Springer Science & Business Media.Google ScholarGoogle Scholar
  4. Dimitri P Bertsekas and John N Tsitsiklis. 1996. Neuro-dynamic programming .Athena Scientific.Google ScholarGoogle Scholar
  5. Jalaj Bhandari, Daniel Russo, and Raghav Singal. 2018. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. In Conference On Learning Theory . 1691--1692.Google ScholarGoogle Scholar
  6. Pascal Bianchi, Walid Hachem, and Sholom Schechtman. 2020. Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Preprint arXiv:2005.08513 (2020).Google ScholarGoogle Scholar
  7. Vivek S Borkar. 2009. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer.Google ScholarGoogle Scholar
  8. Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. Siam Review , Vol. 60, 2 (2018), 223--311.Google ScholarGoogle ScholarCross RefCross Ref
  9. Adithya M Devraj, Ana Buvs ić, and Sean Meyn. 2019. Zap Q-Learning-A User's Guide. In 2019 Fifth Indian Control Conference (ICC). IEEE, 10--15.Google ScholarGoogle ScholarCross RefCross Ref
  10. Persi Diaconis and David Freedman. 1999. Iterated random functions. SIAM review , Vol. 41, 1 (1999), 45--76.Google ScholarGoogle Scholar
  11. Aymeric Dieuleveut, Alain Durmus, and Francis Bach. 2020. Bridging the gap between constant step size stochastic gradient descent and markov chains. The Annals of Statistics , Vol. 48, 3 (2020), 1348--1382.Google ScholarGoogle ScholarCross RefCross Ref
  12. Alain Durmus, Pablo Jiménez, Éric Moulines, and SAID Salem. 2021. On Riemannian Stochastic Approximation Schemes with Fixed Step-Size. In International Conference on Artificial Intelligence and Statistics. PMLR, 1018--1026.Google ScholarGoogle Scholar
  13. Rick Durrett. 2019. Probability: theory and examples . Vol. 49. Cambridge university press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Atilla Eryilmaz and Rayadurgam Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems , Vol. 72, 3--4 (2012), 311--359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Vaclav Fabian. 1968. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics (1968), 1327--1332.Google ScholarGoogle Scholar
  16. Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. 2020. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research , Vol. 21 (2020).Google ScholarGoogle Scholar
  17. Yuanyuan Feng, Lei Li, and Jian-Guo Liu. 2018. Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Communications in Mathematical Sciences , Vol. 16, 3 (2018), 777--789.Google ScholarGoogle ScholarCross RefCross Ref
  18. Harley Flanders. 1973. Differentiation under the integral sign. The American Mathematical Monthly , Vol. 80, 6 (1973), 615--627.Google ScholarGoogle ScholarCross RefCross Ref
  19. D. Gamarnik and A. Zeevi. 2006. Validity of Heavy Traffic Steady-State Approximations in Generalized Jackson Networks. The Annals of Applied Probability (2006), 56--90.Google ScholarGoogle Scholar
  20. Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Robert Gower, Othmane Sebbouh, and Nicolas Loizou. 2021. SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics. PMLR, 1315--1323.Google ScholarGoogle Scholar
  22. Wassim M Haddad and VijaySekhar Chellaboina. 2011. Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press.Google ScholarGoogle Scholar
  23. J.M. Harrison. 1988. Brownian Models of Queueing Networks with Heterogeneous Customer Populations. In Stochastic Differential Systems, Stochastic Control Theory and Applications. Springer, 147--186.Google ScholarGoogle Scholar
  24. J. M. Harrison. 1998. Heavy traffic analysis of a system with parallel servers: Asymptotic optimality of discrete review policies. Ann. App. Probab. (1998), 822--848.Google ScholarGoogle Scholar
  25. J. M. Harrison and M. J. López. 1999. Heavy traffic resource pooling in parallel-server systems. Queueing Systems (1999), 339--368.Google ScholarGoogle Scholar
  26. Karla Hernandez and James C Spall. 2019. Generalization of a result of Fabian on the asymptotic normality of stochastic approximation. Automatica , Vol. 99 (2019), 420--424.Google ScholarGoogle ScholarCross RefCross Ref
  27. Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. 2019. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications , Vol. 4, 1 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  28. Daniela Hurtado-Lange and Siva Theja Maguluri. 2020. Transform methods for heavy-traffic analysis. Stochastic Systems , Vol. 10, 4 (2020), 275--309.Google ScholarGoogle ScholarCross RefCross Ref
  29. Daniela Hurtado-Lange, Sushil Mahavir Varma, and Siva Theja Maguluri. 2020. Logarithmic Heavy Traffic Error Bounds in Generalized Switch and Load Balancing Systems. Preprint arXiv:2003.07821 (2020).Google ScholarGoogle Scholar
  30. Harry Kesten et almbox. 1958. Accelerated stochastic approximation. Annals of Mathematical Statistics , Vol. 29, 1 (1958), 41--59.Google ScholarGoogle ScholarCross RefCross Ref
  31. Hassan K Khalil and Jessy W Grizzle. 2002. Nonlinear systems . Vol. 3. Prentice hall Upper Saddle River, NJ.Google ScholarGoogle Scholar
  32. Guanghui Lan. 2020. First-order and Stochastic Optimization Methods for Machine Learning. Springer Nature.Google ScholarGoogle Scholar
  33. Jonas Latz. 2021. Analysis of stochastic gradient descent in continuous time. Statistics and Computing , Vol. 31, 4 (2021), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Qianxiao Li, Cheng Tai, and E Weinan. 2017. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning. PMLR, 2101--2110.Google ScholarGoogle Scholar
  35. Xiaoyu Li and Francesco Orabona. 2019. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 983--992.Google ScholarGoogle Scholar
  36. Siva Theja Maguluri, Sai Kiran Burle, and Rayadurgam Srikant. 2018. Optimal heavy-traffic queue length scaling in an incompletely saturated switch. Queueing Systems , Vol. 88, 3--4 (2018), 279--309.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Siva Theja Maguluri and R Srikant. 2016. Heavy traffic queue length behavior in a switch under the MaxWeight algorithm. Stochastic Systems , Vol. 6, 1 (2016), 211--250.Google ScholarGoogle ScholarCross RefCross Ref
  38. Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. 2020. On the almost sure convergence of stochastic gradient descent in non-convex problems. Preprint arXiv:2006.11144 (2020).Google ScholarGoogle Scholar
  39. Shancong Mou and Siva Theja Maguluri. 2020. Heavy Traffic Queue Length Behaviour in a Switch under Markovian Arrivals. Preprint arXiv:2006.06150 (2020).Google ScholarGoogle Scholar
  40. Angel Muleshkov and Tan Nguyen. 2016. Easy proof of the Jacobian for the n-dimensional polar coordinates . Pi Mu Epsilon Journal , Vol. 14 (2016), 269--273.Google ScholarGoogle Scholar
  41. Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics (1951), 400--407.Google ScholarGoogle Scholar
  42. Timothy Sauer. 2012. Numerical solution of stochastic differential equations in finance. In Handbook of computational finance . Springer, 529--550.Google ScholarGoogle Scholar
  43. Ohad Shamir and Tong Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning . PMLR, 71--79.Google ScholarGoogle Scholar
  44. Justin Sirignano and Konstantinos Spiliopoulos. 2020. Stochastic gradient descent in continuous time: A central limit theorem. Stochastic Systems , Vol. 10, 2 (2020), 124--151.Google ScholarGoogle ScholarCross RefCross Ref
  45. R Srikant and Lei Ying. 2019. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning. In Conference on Learning Theory. 2803--2830.Google ScholarGoogle Scholar
  46. Alexander L Stolyar et almbox. 2004. Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic. The Annals of Applied Probability , Vol. 14, 1 (2004), 1--53.Google ScholarGoogle ScholarCross RefCross Ref
  47. Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning , Vol. 3, 1 (1988), 9--44.Google ScholarGoogle Scholar
  48. Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction .MIT press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. John N Tsitsiklis and Benjamin Van Roy. 1997. Analysis of temporal-difference learning with function approximation. In Advances in neural information processing systems. 1075--1081.Google ScholarGoogle Scholar
  50. Aad W Van der Vaart. 2000. Asymptotic statistics . Vol. 3. Cambridge university press.Google ScholarGoogle Scholar
  51. Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning , Vol. 8, 3--4 (1992), 279--292.Google ScholarGoogle Scholar
  52. Mou Wenlong, Flammarion Nicolas, Wainwright Martin J., and Bartlett Peter L. 2019. Improved Bounds for Discretization of Langevin Diffusions: Near-Optimal Rates without Convexity. https://arxiv.org/pdf/1907.11331 (2019).Google ScholarGoogle Scholar
  53. R. J. Williams. 1998. Diffusion approximations for open multiclass queueing networks: Sufficient conditions involving state space collapse. Queueing Systems Theory and Applications (1998), 27 -- 88.Google ScholarGoogle Scholar
  54. Yuege Xie, Xiaoxia Wu, and Rachel Ward. 2020. Linear convergence of adaptive stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics. PMLR, 1475--1485.Google ScholarGoogle Scholar
  55. Jiaojiao Yang, Wenqing Hu, and Chris Junchi Li. 2021. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis , Vol. 122, 3--4 (2021), 371--393.Google ScholarGoogle ScholarCross RefCross Ref
  56. Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A Erdogdu. 2020. An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias . Preprint arXiv:2006.07904 (2020).Google ScholarGoogle Scholar

Index Terms

  1. Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!