Abstract
In the distributed graident coding problem, it has been established that, to exactly recover the gradient under s slow machines, the mmimum computation load (number of stored data partitions) of each worker is at least linear ($s+1$), which incurs a large overhead when s is large~\citetandon2017gradient. In this paper, we focus on approximate gradient coding that aims to recover the gradient with bounded error ε. Theoretically, our main contributions are three-fold: (i) we analyze the structure of optimal gradient codes, and derive the information-theoretical lower bound of minimum computation load: $O(łog(n)/łog(n/s))$ for ε = 0$ and $d\geq O(łog(1/ε)/łog(n/s))$ for ε>0$, where d is the computation load, and ε is the error in the gradient computation; (ii) we design two approximate gradient coding schemes that exactly match such lower bounds based on random edge removal process; (iii) we implement our schemes and demonstrate the advantage of the approaches over the current fastest gradient coding strategies. The proposed schemes provide order-wise improvement over the state of the art in terms of computation load, and are also optimal in terms of both computation load and latency.
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010 . Springer, 177--186.Google Scholar
Cross Ref
- Zachary Charles, Dimitris Papailiopoulos, and Jordan Ellenberg. 2017. Approximate gradient coding via sparse random graphs. arXiv preprint arXiv:1711.06771 (2017).Google Scholar
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM , Vol. 51, 1 (2008), 107--113.Google Scholar
Digital Library
- Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems. 2100--2108.Google Scholar
Digital Library
- Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. 2017. Straggler mitigation in distributed optimization through data encoding. In Advances in Neural Information Processing Systems. 5434--5442.Google Scholar
- Quoc V Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 265--272.Google Scholar
- Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017a. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory (2017).Google Scholar
- Kangwook Lee, Changho Suh, and Kannan Ramchandran. 2017b. High-dimensional coded matrix multiplication. In Information Theory (ISIT), 2017 IEEE International Symposium on. IEEE, 2418--2422.Google Scholar
Cross Ref
- Nathan Linial and Noam Nisan. 1990. Approximate inclusion-exclusion. Combinatorica , Vol. 10, 4 (1990), 349--365.Google Scholar
Cross Ref
- Michael G Luby, Michael Mitzenmacher, Mohammad Amin Shokrollahi, and Daniel A Spielman. 2001. Efficient erasure correcting codes. IEEE Transactions on Information Theory , Vol. 47, 2 (2001), 569--584.Google Scholar
Digital Library
- Raj Kumar Maity, Ankit Singh Rawat, and Arya Mazumdar. 2018. Robust Gradient Descent via Moment Encoding with LDPC Codes. SysML (2018).Google Scholar
- Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. 2015. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015).Google Scholar
- Netanel Raviv, Itzhak Tamo, Rashish Tandon, and Alexandros G Dimakis. 2018. Gradient coding from cyclic MDS codes and expander graphs. (2018).Google Scholar
- Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems. 2647--2655.Google Scholar
- Amin Shokrollahi. 2006. Raptor codes. IEEE/ACM Transactions on Networking (TON) , Vol. 14, SI (2006), 2551--2567.Google Scholar
Digital Library
- Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. 2017. Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning. 3368--3376.Google Scholar
Digital Library
- Sinong Wang, Jiashang Liu, and Ness Shroff. 2018. Coded Sparse Matrix Multiplication. In International Conference on Machine Learning .Google Scholar
- Sinong Wang, Jiashang Liu, Ness Shroff, and Pengyu Yang. 2019. Computation Efficient Coded Linear Transform. In International Conference on Artificial Intelligence and Statistics .Google Scholar
- Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, and Randy Katz. 2016. Multi-task learning for straggler avoiding predictive job scheduling. The Journal of Machine Learning Research , Vol. 17, 1 (2016), 3692--3728.Google Scholar
Digital Library
- Min Ye and Emmanuel Abbe. 2018. Communication-computation efficient gradient coding. In International Conference on Machine Learning .Google Scholar
- Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. 2017. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems. 4406--4416.Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud , Vol. 10, 10--10 (2010), 95.Google Scholar
Digital Library
Index Terms
Fundamental Limits of Approximate Gradient Coding
Recommendations
Fundamental Limits of Approximate Gradient Coding
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsIn the distributed graident coding problem, it has been established that, to exactly recover the gradient under s slow machines, the mmimum computation load (number of stored data partitions) of each worker is at least linear ($s+1$), which incurs a ...
Fundamental Limits of Approximate Gradient Coding
In the distributed graident coding problem, it has been established that, to exactly recover the gradient under s slow machines, the mmimum computation load (number of stored data partitions) of each worker is at least linear (s + 1), which incurs a ...
Approximate Gradient Coding with Optimal Decoding
2021 IEEE International Symposium on Information Theory (ISIT)In distributed optimization problems, a technique called gradient coding, which involves replicating data points, has been used to mitigate the effect of straggling machines. Recent work has studied approximate gradient coding, which concerns coding ...






Comments