Abstract
Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes -- computing nodes that unpredictably slowdown or fail -- is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes thus resulting in a lot of redundant computation. We propose a rateless fountain coding strategy that achieves the best of both worlds -- we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. We conduct experiments in three computing environments: local parallel computing, Amazon EC2, and Amazon Lambda, which show that rateless coding gives as much as 3x speed-up over uncoded schemes.
- Amazon. 2006. Amazon Web Services EC2. https://aws.amazon.com/ec2/.Google Scholar
- Amazon. 2014. Amazon Web Services Lambda. https://aws.amazon.com/lambda/.Google Scholar
- William F Ames. 2014. Numerical Methods for Partial Differential Equations. Academic Press.Google Scholar
- Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), Vol. 13. 185--198.Google Scholar
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), Vol. 10. 24.Google Scholar
- Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 215--223.Google Scholar
- William Dally. 2015. High-performance Hardware for Machine Learning. NIPS Tutorial (2015).Google Scholar
- Dask Development Team. 2016. Dask: Library for dynamic task scheduling. https://dask.orgGoogle Scholar
- H. A. David and H. N. Nagaraja. 2003. Order statistics .John Wiley, Hoboken, N.J.Google Scholar
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM , Vol. 56, 2 (2013), 74--80.Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM , Vol. 51, 1 (2008), 107--113.Google Scholar
Digital Library
- James Dinan, D Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable work stealing. In High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on. IEEE, 1--11.Google Scholar
- James Dinan, Stephen Olivier, Gerald Sabin, Jan Prins, P Sadayappan, and Chau-Wen Tseng. 2007. Dynamic load balancing of unbalanced computations using message passing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 1--8.Google Scholar
Cross Ref
- Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems. 2100--2108.Google Scholar
Digital Library
- Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2017. Coded convolution for parallel and distributed computing within a deadline. In IEEE International Symposium on Information Theory (ISIT). IEEE, 2403--2407.Google Scholar
Cross Ref
- Sanghamitra Dutta, Mohammad Fahim, Farzin Haddadpour, Haewon Jeong, Viveck Cadambe, and Pulkit Grover. 2018. On the Optimal Recovery Threshold of Coded Matrix Multiplication. arXiv preprint arXiv:1801.10292 (2018).Google Scholar
- Python Software Foundation. [n. d.]. Multiprocessing. https://docs.python.org/3/library/multiprocessing.html .Google Scholar
- Geoffrey C. Fox, Steve W. Otto, and Anthony JG. Hey. 1987. Matrix Algorithms on a Hypercube I: Matrix Multiplication. Parallel Comput. , Vol. 4, 1 (1987), 17--31.Google Scholar
Cross Ref
- K. Gardner, M. Harchol-Balter, and A. Scheller-Wolf. 2016. A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size. In Proceedings of IEEE MASCOTS .Google Scholar
- K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyyti"a, and A. Scheller-Wolf. 2015a. Reducing Latency via Redundant Requests: Exact Analysis. In Proceedings of the ACM SIGMETRICS.Google Scholar
- K. Gardner, S. Zbarsky, M. Harchol-Balter, and A. Scheller-Wolf. 2015b. Analyzing Response Time in the Redundancy-d System. In Carnegie Mellon University-CS-15--141 archive .Google Scholar
- Google. 2015. Kubernetes . https://kubernetes.io .Google Scholar
- Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran. 2018. Oversketch: Approximate matrix multiplication for the cloud. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 298--304.Google Scholar
Cross Ref
- Wael Halbawi, Navid Azizan-Ruhi, Fariborz Salehi, and Babak Hassibi. 2017. Improving distributed gradient descent using reed-solomon codes. arXiv preprint arXiv:1706.05436 (2017).Google Scholar
- Mor Harchol-Balter. 2013. Performance modeling and design of computer systems: queueing theory in action .Cambridge University Press.Google Scholar
- Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. In Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, 98--111.Google Scholar
Digital Library
- Kuang-Hua Huang et almbox. 1984. Algorithm-based Fault Tolerance for Matrix Operations. IEEE Trans. Comput. , Vol. 100, 6 (1984), 518--528.Google Scholar
Digital Library
- Longbo Huang, S. Pawar, Hao Zhang, and K. Ramchandran. 2012. Codes can reduce queueing delay in data centers. In IEEE International Symposium on Information Theory Proceedings (ISIT). 2766--2770.Google Scholar
- Gauri Joshi. 2017. Boosting Service Capacity via Adaptive Task Replication. SIGMETRICS Perform. Eval. Rev. , Vol. 45, 2 (Oct. 2017), 9--11. https://doi.org/10.1145/3152042.3152046Google Scholar
Digital Library
- Gauri Joshi. 2018. Synergy via Redundancy: Boosting Service Capacity with Adaptive Replication. SIGMETRICS Performance Evaluation Review , Vol. 45, 3 (March 2018), 21--28. http://doi.acm.org/10.1145/3199524.3199530Google Scholar
Digital Library
- Gauri Joshi, Yanpei Liu, and Emina Soljanin. 2012. Coding for fast content download. In Allerton Conference on Communication, Control, and Computing. IEEE, 326--333.Google Scholar
Cross Ref
- Gauri Joshi, Yanpei Liu, and Emina Soljanin. 2014. On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems. IEEE Journal on Selected Areas of Communications , Vol. 32, 5 (May 2014), 989--997.Google Scholar
Cross Ref
- Gauri Joshi, Joong Bum Rhim, John Sun, and Da Wang. 2010. Fountain codes. In Global telecommunications conference (GLOBECOM 2010). 7--12.Google Scholar
- Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2015. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Performance Evaluation Review , Vol. 43, 2 (2015), 54--56.Google Scholar
Digital Library
- Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2017. Efficient Redundancy Techniques for Latency Reduction in Cloud Systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems , Vol. 2, 12 (may 2017).Google Scholar
Digital Library
- Can Karakus, Yifan Sun, and Suhas Diggavi. 2017. Encoded distributed optimization. In IEEE International Symposium on Information Theory (ISIT). IEEE, 2890--2894.Google Scholar
Cross Ref
- C. Kim and A. K. Agrawala. 1989. Analysis of the Fork-Join Queue . IEEE Trans. Comput. , Vol. 38, 2 (Feb. 1989), 250--255.Google Scholar
Digital Library
- Jack Kosaian, K. V. Rashmi, and Shivaram Venkataraman. 2018. Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation. CoRR , Vol. abs/1806.01259 (2018). arxiv: 1806.01259 http://arxiv.org/abs/1806.01259Google Scholar
- Jack Kosaian, K. V. Rashmi, and Shivaram Venkataraman. 2019. Parity Models: A General Framework for Coding-Based Resilience in ML Inference. CoRR , Vol. abs/1905.00863 (2019). arxiv: 1905.00863 http://arxiv.org/abs/1905.00863Google Scholar
Digital Library
- Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms. Vol. 400. Benjamin/Cummings Redwood City.Google Scholar
- Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017a. Speeding Up Distributed Machine Learning Using Codes. IEEE Transactions on Information Theory (2017).Google Scholar
- Kangwook Lee, Nihar B. Shah, Longbo Huang, and Kannan Ramchandran. 2017b. The MDS Queue: Analysing the Latency Performance of Erasure Codes. IEEE Transactions on Information Theory , Vol. 63, 5 (May 2017), 2822--2842.Google Scholar
- Songze Li, Seyed Mohammadreza Mousavi Kalan, A Salman Avestimehr, and Mahdi Soltanolkotabi. 2017. Near-Optimal Straggler Mitigation for Distributed Gradient Methods. arXiv preprint arXiv:1710.09990 (2017).Google Scholar
- Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2016. A Unified Coding Framework for Distributed Computing with Straggling Servers. In IEEE Global Communications Conference (GLOBECOM) Workshops. IEEE, 1--6.Google Scholar
Cross Ref
- Michael Luby. 2002. LT codes. In null . IEEE, 271.Google Scholar
- David JC MacKay. 2003. Information theory, inference and learning algorithms .Cambridge university press.Google Scholar
- R. Nelson and A. Tantawi. 1988. Approximate Analysis of Fork/Join Synchronization in Parallel Queues. IEEE Trans. Comput. , Vol. 37, 6 (Jun. 1988), 739--743.Google Scholar
Digital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing order to the Web. Technical Report. Stanford InfoLab.Google Scholar
- Albin Severinson, Alexandre Graell i Amat, and Eirik Rosnes. 2017. Block-Diagonal and LT Codes for Distributed Computing With Straggling Servers. arXiv preprint arXiv:1712.08230 (dec 2017).Google Scholar
- Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2016. When Do Redundant Requests Reduce Latency? IEEE Transactions on Communications , Vol. 64, 2 (Feb 2016), 715--722.Google Scholar
Cross Ref
- Vaishaal Shankar, Karl Krauth, Qifan Pu, Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht, and Jonathan Ragan-Kelley. 2018. numpywren: serverless linear algebra. arXiv preprint arXiv:1810.09679 (2018).Google Scholar
- Amin Shokrollahi. 2006. Raptor codes. IEEE/ACM Transactions on Networking (TON) , Vol. 14, SI (2006), 2551--2567.Google Scholar
Digital Library
- Amin Shokrollahi, Michael Luby, et almbox. 2011. Raptor codes. Foundations and trends® in communications and information theory , Vol. 6, 3--4 (2011), 213--322.Google Scholar
- Yin Sun, Can Emre Koksal, and Ness B. Shroff. 2016. On Delay-Optimal Scheduling in Queueing Systems with Replications. arXiv:1603.07322 (March 2016).Google Scholar
- Yin Sun, Zizhan Zheng, Can Emre Koksal, Kyu-Han Kim, and Ness B. Shroff. 2015. Provably Delay Efficient Data Retrieving in Storage Clouds. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM) .Google Scholar
- Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. 2017. Gradient Coding: Avoiding Stragglers in Synchronous Gradient Descent. stat , Vol. 1050 (2017), 8.Google Scholar
- Elizabeth Varki, Arif Merchant, and Hui Chen. 2008. The M/M/1 fork-join queue with variable sub-tasks. unpublished, available online (2008).Google Scholar
- Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response times in Parallel Computation. In ACM SIGMETRICS Performance Evaluation Review, Vol. 42. ACM, 599--600.Google Scholar
Digital Library
- Da Wang, Gauri Joshi, and Gregory Wornell. 2015. Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing. ACM SIGMETRICS Performance Evaluation Review , Vol. 43, 3 (2015), 7--11.Google Scholar
Digital Library
- Da Wang, Gauri Joshi, and Gregory W. Wornell. 2019. Efficient Straggler Replication in Large-Scale Parallel Computing. ACM Trans. Model. Perform. Eval. Comput. Syst. , Vol. 4, 2, Article 7 (April 2019), bibinfonumpages23 pages. http://doi.acm.org/10.1145/3310336Google Scholar
Digital Library
- Sinong Wang, Jiashang Liu, and Ness Shroff. 2018. Coded Sparse Matrix Multiplication. arXiv preprint arXiv:1802.03430 (2018).Google Scholar
- Yaoqing Yang, Pulkit Grover, and Soummya Kar. 2017. Coded Distributed Computing for Inverse Problems. In Advances in Neural Information Processing Systems. 709--719.Google Scholar
- Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. 2017a. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems. 4406--4416.Google Scholar
- Qian Yu, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2017b. Coded Fourier Transform. arXiv preprint arXiv:1710.06471 (2017).Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud , Vol. 10, 10--10 (2010), 95.Google Scholar
Digital Library
Index Terms
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
Recommendations
Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication
Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes---...
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsLarge-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of stragglers -- nodes that ...
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrixmatrix multiplication operations that need to be parallelized across multiple nodes. The presence of stragglers - nodes that ...






Comments