ABSTRACT
Sparsity regularized loss minimization problems play an important role in various fields including machine learning, data mining, and modern statistics. Proximal gradient descent method and coordinate descent method are the most popular approaches to solving the minimization problem. Although existing methods can achieve implicit model identification, aka support set identification, in a finite number of iterations, these methods still suffer from huge computational costs and memory burdens in high-dimensional scenarios. The reason is that the support set identification in these methods is implicit and thus cannot explicitly identify the low-complexity structure in practice, namely, they cannot discard useless coefficients of the associated features to achieve algorithmic acceleration via dimension reduction. To address this challenge, we propose a novel accelerated doubly stochastic gradient descent (ADSGD) method for sparsity regularized loss minimization problems, which can reduce the number of block iterations by eliminating inactive coefficients during the optimization process and eventually achieve faster explicit model identification and improve the algorithm efficiency. Theoretically, we first prove that ADSGD can achieve a linear convergence rate and lower overall computational complexity. More importantly, we prove that ADSGD can achieve a linear rate of explicit model identification. Numerically, experimental results on benchmark datasets confirm the efficiency of our proposed method.
- Runxue Bao, Bin Gu, and Heng Huang. 2020. Fast OSCAR and OWL regression via safe screening rules. In International Conference on Machine Learning. PMLR, 653--663.Google Scholar
- Runxue Bao, Bin Gu, and Heng Huang. 2022a. An Accelerated Doubly Stochastic Gradient Method with Faster Explicit Model Identification. arXiv preprint arXiv:2208.06058 (2022).Google Scholar
- Runxue Bao, Xidong Wu, Wenhan Xian, and Heng Huang. 2022b. Doubly Sparse Asynchronous Learning for Stochastic Composite Optimization. In IJCAI.Google Scholar
- Heinz H Bauschke, Patrick L Combettes, et al. 2011. Convex analysis and monotone operator theory in Hilbert spaces. Vol. 408. Springer.Google Scholar
- Patrick L Combettes and Valérie R Wajs. 2005. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, Vol. 4, 4 (2005), 1168--1200.Google Scholar
Cross Ref
- Cong D Dang and Guanghui Lan. 2015. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM Journal on Optimization, Vol. 25, 2 (2015), 856--881.Google Scholar
Digital Library
- Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. 2014. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems. 1646--1654.Google Scholar
Digital Library
- John Duchi and Yoram Singer. 2009. Efficient online and batch learning using forward backward splitting. The Journal of Machine Learning Research, Vol. 10 (2009), 2899--2934.Google Scholar
Digital Library
- Celestine Dünner, Simone Forte, Martin Takác, and Martin Jaggi. 2016. Primal-dual rates and certificates. In International Conference on Machine Learning. PMLR, 783--792.Google Scholar
- Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2015. Mind the duality gap: safer rules for the Lasso. In International Conference on Machine Learning. 333--342.Google Scholar
- Tyler Johnson and Carlos Guestrin. 2015. Blitz: A principled meta-algorithm for scaling sparse optimization. In International Conference on Machine Learning. 1171--1179.Google Scholar
Digital Library
- A Ya Kruger. 2003. On fréchet subdifferentials. Journal of Mathematical Sciences, Vol. 116, 3 (2003), 3325--3358.Google Scholar
Cross Ref
- Adrian S Lewis. 2002. Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization, Vol. 13, 3 (2002), 702--725.Google Scholar
Digital Library
- Adrian S Lewis and Stephen J Wright. 2016. A proximal method for composite minimization. Mathematical Programming, Vol. 158, 1 (2016), 501--546.Google Scholar
Digital Library
- Jingwei Liang, Jalal Fadili, and Gabriel Peyré. 2017. Activity Identification and Local Linear Convergence of Forward--Backward-type Methods. SIAM Journal on Optimization, Vol. 27, 1 (2017), 408--437.Google Scholar
Digital Library
- Pierre-Louis Lions and Bertrand Mercier. 1979. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal., Vol. 16, 6 (1979), 964--979.Google Scholar
Digital Library
- Boris S Mordukhovich, Nguyen Mau Nam, and ND Yen. 2006. Fréchet subdifferential calculus and optimality conditions in nondifferentiable programming. Optimization, Vol. 55, 5--6 (2006), 685--708.Google Scholar
Cross Ref
- Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. The Journal of Machine Learning Research, Vol. 18, 1 (2017), 4671--4703.Google Scholar
Digital Library
- Eugene Ndiaye, Olivier Fercoq, and Joseph Salmon. 2020. Screening rules and its complexity for active set identification. arXiv preprint arXiv:2009.02709 (2020).Google Scholar
- Andrew Y Ng. 2004. Feature selection, L 1 vs. L 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning. 78.Google Scholar
Digital Library
- Clarice Poon, Jingwei Liang, and Carola Schoenlieb. 2018. Local convergence properties of SAGA/Prox-SVRG and acceleration. In International Conference on Machine Learning. PMLR, 4124--4132.Google Scholar
- Alain Rakotomamonjy, Gilles Gasso, and Joseph Salmon. 2019. Screening rules for Lasso with non-convex Sparse Regularizers. In International Conference on Machine Learning. 5341--5350.Google Scholar
- Peter Richtárik and Martin Takávc. 2014. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, Vol. 144, 1--2 (2014), 1--38.Google Scholar
Digital Library
- Shai Shalev-Shwartz and Ambuj Tewari. 2011. Stochastic methods for l 1-regularized loss minimization. The Journal of Machine Learning Research, Vol. 12 (2011), 1865--1892.Google Scholar
Digital Library
- Zebang Shen, Hui Qian, Tongzhou Mu, and Chao Zhang. 2017. Accelerated doubly stochastic gradient algorithm for large-scale empirical risk minimization. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2715--2721.Google Scholar
Cross Ref
- Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2013. A sparse-group lasso. Journal of computational and graphical statistics, Vol. 22, 2 (2013), 231--245.Google Scholar
Cross Ref
- Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), Vol. 58, 1 (1996), 267--288.Google Scholar
Cross Ref
- Ryan J Tibshirani et al. 2013. The lasso problem and uniqueness. Electronic Journal of statistics, Vol. 7 (2013), 1456--1490.Google Scholar
- Huahua Wang and Arindam Banerjee. 2014. Randomized block coordinate descent for online and stochastic optimization. arXiv preprint arXiv:1407.0107 (2014).Google Scholar
- Lin Xiao and Tong Zhang. 2014. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, Vol. 24, 4 (2014), 2057--2075.Google Scholar
Digital Library
- Jianhua Xu, Jiali Liu, Jing Yin, and Chengyu Sun. 2016. A multi-label feature extraction algorithm via maximizing featurxue variance and feature-label dependence simultaneously. Knowledge-Based Systems, Vol. 98 (2016), 172--184.Google Scholar
Digital Library
- Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 68, 1 (2006), 49--67.Google Scholar
Cross Ref
- Tuo Zhao, Mo Yu, Yiming Wang, Raman Arora, and Han Liu. 2014. Accelerated mini-batch randomized block coordinate descent method. Advances in neural information processing systems, Vol. 27 (2014), 3329--3337.Google Scholar
- Ji Zhu, Saharon Rosset, Robert Tibshirani, and Trevor J Hastie. 2003. 1-norm support vector machines. In Advances in neural information processing systems. Citeseer, None.Google Scholar
- Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), Vol. 67, 2 (2005), 301--320.Google Scholar
Cross Ref
Index Terms
- An Accelerated Doubly Stochastic Gradient Method with Faster Explicit Model Identification
Recommendations
Folded-concave penalization approaches to tensor completion
The existing studies involving matrix or tensor completion problems are commonly under the nuclear norm penalization framework due to the computational efficiency of the resulting convex optimization problem. Folded-concave penalization methods have ...
Accelerated Doubly Stochastic Gradient Descent for Tensor CP Decomposition
AbstractIn this paper, we focus on the acceleration of doubly stochastic gradient descent method for computing the CANDECOMP/PARAFAC (CP) decomposition of tensors. This optimization problem has N blocks, where N is the order of the tensor. Under the ...
Block Mirror Stochastic Gradient Method For Stochastic Optimization
AbstractIn this paper, a block mirror stochastic gradient method is developed to solve stochastic optimization problems involving convex and nonconvex cases, where the feasible set and the variables are treated as multiple blocks. The proposed method ...





Comments