skip to main content
research-article
Public Access

Fundamental Limits of Approximate Gradient Coding

Published:17 December 2019Publication History
Skip Abstract Section

Abstract

In the distributed graident coding problem, it has been established that, to exactly recover the gradient under s slow machines, the mmimum computation load (number of stored data partitions) of each worker is at least linear ($s+1$), which incurs a large overhead when s is large~\citetandon2017gradient. In this paper, we focus on approximate gradient coding that aims to recover the gradient with bounded error ε. Theoretically, our main contributions are three-fold: (i) we analyze the structure of optimal gradient codes, and derive the information-theoretical lower bound of minimum computation load: $O(łog(n)/łog(n/s))$ for ε = 0$ and $d\geq O(łog(1/ε)/łog(n/s))$ for ε>0$, where d is the computation load, and ε is the error in the gradient computation; (ii) we design two approximate gradient coding schemes that exactly match such lower bounds based on random edge removal process; (iii) we implement our schemes and demonstrate the advantage of the approaches over the current fastest gradient coding strategies. The proposed schemes provide order-wise improvement over the state of the art in terms of computation load, and are also optimal in terms of both computation load and latency.

References

  1. Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010 . Springer, 177--186.Google ScholarGoogle ScholarCross RefCross Ref
  2. Zachary Charles, Dimitris Papailiopoulos, and Jordan Ellenberg. 2017. Approximate gradient coding via sparse random graphs. arXiv preprint arXiv:1711.06771 (2017).Google ScholarGoogle Scholar
  3. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.Google ScholarGoogle Scholar
  4. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM , Vol. 51, 1 (2008), 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems. 2100--2108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. 2017. Straggler mitigation in distributed optimization through data encoding. In Advances in Neural Information Processing Systems. 5434--5442.Google ScholarGoogle Scholar
  7. Quoc V Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 265--272.Google ScholarGoogle Scholar
  8. Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017a. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory (2017).Google ScholarGoogle Scholar
  9. Kangwook Lee, Changho Suh, and Kannan Ramchandran. 2017b. High-dimensional coded matrix multiplication. In Information Theory (ISIT), 2017 IEEE International Symposium on. IEEE, 2418--2422.Google ScholarGoogle ScholarCross RefCross Ref
  10. Nathan Linial and Noam Nisan. 1990. Approximate inclusion-exclusion. Combinatorica , Vol. 10, 4 (1990), 349--365.Google ScholarGoogle ScholarCross RefCross Ref
  11. Michael G Luby, Michael Mitzenmacher, Mohammad Amin Shokrollahi, and Daniel A Spielman. 2001. Efficient erasure correcting codes. IEEE Transactions on Information Theory , Vol. 47, 2 (2001), 569--584.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Raj Kumar Maity, Ankit Singh Rawat, and Arya Mazumdar. 2018. Robust Gradient Descent via Moment Encoding with LDPC Codes. SysML (2018).Google ScholarGoogle Scholar
  13. Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. 2015. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015).Google ScholarGoogle Scholar
  14. Netanel Raviv, Itzhak Tamo, Rashish Tandon, and Alexandros G Dimakis. 2018. Gradient coding from cyclic MDS codes and expander graphs. (2018).Google ScholarGoogle Scholar
  15. Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems. 2647--2655.Google ScholarGoogle Scholar
  16. Amin Shokrollahi. 2006. Raptor codes. IEEE/ACM Transactions on Networking (TON) , Vol. 14, SI (2006), 2551--2567.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. 2017. Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning. 3368--3376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sinong Wang, Jiashang Liu, and Ness Shroff. 2018. Coded Sparse Matrix Multiplication. In International Conference on Machine Learning .Google ScholarGoogle Scholar
  19. Sinong Wang, Jiashang Liu, Ness Shroff, and Pengyu Yang. 2019. Computation Efficient Coded Linear Transform. In International Conference on Artificial Intelligence and Statistics .Google ScholarGoogle Scholar
  20. Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, and Randy Katz. 2016. Multi-task learning for straggler avoiding predictive job scheduling. The Journal of Machine Learning Research , Vol. 17, 1 (2016), 3692--3728.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Min Ye and Emmanuel Abbe. 2018. Communication-computation efficient gradient coding. In International Conference on Machine Learning .Google ScholarGoogle Scholar
  22. Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. 2017. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems. 4406--4416.Google ScholarGoogle Scholar
  23. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud , Vol. 10, 10--10 (2010), 95.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fundamental Limits of Approximate Gradient Coding

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!