skip to main content
research-article
Open Access

Securing Distributed Gradient Descent in High Dimensional Statistical Learning

Published:26 March 2019Publication History
Skip Abstract Section

Abstract

We consider unreliable distributed learning systems wherein the training data is kept confidential by external workers, and the learner has to interact closely with those workers to train a model. In particular, we assume that there exists a system adversary that can adaptively compromise some workers; the compromised workers deviate from their local designed specifications by sending out arbitrarily malicious messages. We assume in each communication round, up to q out of the m workers suffer Byzantine faults. Each worker keeps a local sample of size n and the total sample size is N=nm. We propose a secured variant of the gradient descent method that can tolerate up to a constant fraction of Byzantine workers, i.e., q/m = O(1). Moreover, we show the statistical estimation error of the iterates converges in O(log N) rounds to O(√/N + √/N ), where d is the model dimension. As long as q=O(d), our proposed algorithm achieves the optimal error rate O(√/N $. Our results are obtained under some technical assumptions. Specifically, we assume strongly-convex population risk. Nevertheless, the empirical risk (sample version) is allowed to be non-convex. The core of our method is to robustly aggregate the gradients computed by the workers based on the filtering procedure proposed by Steinhardt et al. On the technical front, deviating from the existing literature on robustly estimating a finite-dimensional mean vector, we establish a uniform concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. To get a near-optimal uniform concentration bound, we develop a new matrix concentration inequality, which might be of independent interest.

References

  1. Radosław Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. 2010. Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of the American Mathematical Society , Vol. 23, 2 (2010), 535--561.Google ScholarGoogle ScholarCross RefCross Ref
  2. Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. 2018. Byzantine Stochastic Gradient Descent. arXiv preprint arXiv:1803.08917 (2018).Google ScholarGoogle Scholar
  3. Dimitri P Bertsekas and Athena Scientific. 2015. Convex optimization algorithms .Athena Scientific Belmont.Google ScholarGoogle Scholar
  4. Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. 2017. Byzantine-Tolerant Machine Learning. arXiv preprint arXiv:1703.02757 (2017).Google ScholarGoogle Scholar
  5. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et almbox. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning , Vol. 3, 1 (2011), 1--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization .Cambridge university press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Moses Charikar, Jacob Steinhardt, and Gregory Valiant. 2017. Learning from Untrusted Data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017). ACM, New York, NY, USA, 47--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yudong Chen, Lili Su, and Jiaming Xu. 2017. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent. Proc. ACM Meas. Anal. Comput. Syst. , Vol. 1, 2, Article 44 (Dec. 2017), bibinfonumpages25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. 2016. Robust Estimators in High Dimensions without the Computational Intractability. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS) . 655--664.Google ScholarGoogle Scholar
  10. Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. 2017. Being Robust (in High Dimensions) Can Be Practical. CoRR , Vol. abs/1703.00893 (2017). arxiv: 1703.00893 http://arxiv.org/abs/1703.00893 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. 2018. Sever: A Robust Meta-Algorithm for Stochastic Optimization. arXiv preprint arXiv:1803.02815 (2018).Google ScholarGoogle Scholar
  12. John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2014. Privacy Aware Learning. J. ACM , Vol. 61, 6, Article 38 (Dec. 2014), bibinfonumpages57 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jiashi Feng, Huan Xu, and Shie Mannor. 2014. Distributed Robust Learning. arXiv preprint arXiv:1409.5937 (2014).Google ScholarGoogle Scholar
  14. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction .Springer Series in Statistics.Google ScholarGoogle Scholar
  15. Peter J Huber. 2011. Robust statistics. International Encyclopedia of Statistical Science. Springer, 1248--1251.Google ScholarGoogle Scholar
  16. Adam Klivans, Pravesh K Kothari, and Raghu Meka. 2018. Efficient Algorithms for Outlier-Robust Regression. arXiv preprint arXiv:1803.03241 (2018).Google ScholarGoogle Scholar
  17. Jakub Konevc nỳ , Brendan McMahan, and Daniel Ramage. 2015. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575 (2015).Google ScholarGoogle Scholar
  18. Jakub Konev cný , H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated Learning: Strategies for Improving Communication Efficiency. In NIPS Workshop on Private Multi-Party Machine Learning . https://arxiv.org/abs/1610.05492Google ScholarGoogle Scholar
  19. Kevin A Lai, Anup B Rao, and Santosh Vempala. 2016. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 665--674.Google ScholarGoogle ScholarCross RefCross Ref
  20. Nancy A. Lynch. 1996. Distributed Algorithms .Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. 2017. Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution. arXiv preprint arXiv:1711.10467 (2017).Google ScholarGoogle Scholar
  22. Brendan McMahan and Daniel Ramage. 2017. Federated Learning: Collaborative Machine Learning without Centralized Training Data. https://research.googleblog.com/2017/04/federated-learning-collaborative.html . (April 2017). Accessed: 2017-04-06.Google ScholarGoogle Scholar
  23. Song Mei, Yu Bai, and Andrea Montanari. 2016. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534 (2016).Google ScholarGoogle Scholar
  24. Sahand Negahban and Martin J Wainwright. 2011. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research , Vol. 13, 1 (2011), 1665--1697. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. 2018. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485 (2018).Google ScholarGoogle Scholar
  26. Maxim Raginsky. {n. d.}. ECE 543: Statistical Learning Theory Bruce Hajek. ({n. d.}).Google ScholarGoogle Scholar
  27. Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms .Cambridge university press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alex Smola and SVN Vishwanathan. 2008. Introduction to machine learning. Cambridge University, UK , Vol. 32 (2008), 34.Google ScholarGoogle Scholar
  29. Jacob Steinhardt, Moses Charikar, and Gregory Valiant. 2018. Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018) (Leibniz International Proceedings in Informatics (LIPIcs)), Anna R. Karlin (Ed.), Vol. 94. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 45:1--45:21.Google ScholarGoogle Scholar
  30. Lili Su. 2017. Defending distributed systems against adversarial attacks: Consensus, consensus-based learning, and statistical learning . Ph.D. Dissertation. University of Illinois at Urbana-Champaign.Google ScholarGoogle Scholar
  31. Lili Su and Nitin H. Vaidya. 2016. Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms. In Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing (PODC '16). ACM, New York, NY, USA, 425--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Tao. 2012. Topics in random matrix theory .American Mathematical Society, Providence, RI, USA.Google ScholarGoogle Scholar
  33. Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).Google ScholarGoogle Scholar
  34. Roman Vershynin. 2012. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability , Vol. 25, 3 (2012), 655--686.Google ScholarGoogle ScholarCross RefCross Ref
  35. Roman Vershynin. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science .Cambridge university press.Google ScholarGoogle Scholar
  36. Martin Wainwright. 2015. Basic tail and concentration bounds. URl: https://www. stat. berkeley. edu/.../Chap2_TailBounds_Jan22_2015. pdf (visited on 12/31/2017) (2015).Google ScholarGoogle Scholar
  37. Yihong Wu. 2017. Lecture Notes on Information-theoretic Methods For High-dimensional Statistics. (April 2017). http://www.stat.yale.edu/ yw562/teaching/it-stats.pdf.Google ScholarGoogle Scholar
  38. Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. 2018a. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. arXiv preprint arXiv:1803.01498 (2018).Google ScholarGoogle Scholar
  39. Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. 2018b. Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning. arXiv preprint arXiv:1806.05358 (2018).Google ScholarGoogle Scholar
  40. Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. 2013. Communication-Efficient Algorithms for Statistical Optimization. Journal of Machine Learning Research , Vol. 14 (2013), 3321--3363. http://jmlr.org/papers/v14/zhang13b.html Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Securing Distributed Gradient Descent in High Dimensional Statistical Learning

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!