skip to main content
research-article
Open Access

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

Published:17 December 2019Publication History
Skip Abstract Section

Abstract

Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes -- computing nodes that unpredictably slowdown or fail -- is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes thus resulting in a lot of redundant computation. We propose a rateless fountain coding strategy that achieves the best of both worlds -- we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. We conduct experiments in three computing environments: local parallel computing, Amazon EC2, and Amazon Lambda, which show that rateless coding gives as much as 3x speed-up over uncoded schemes.

References

  1. Amazon. 2006. Amazon Web Services EC2. https://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  2. Amazon. 2014. Amazon Web Services Lambda. https://aws.amazon.com/lambda/.Google ScholarGoogle Scholar
  3. William F Ames. 2014. Numerical Methods for Partial Differential Equations. Academic Press.Google ScholarGoogle Scholar
  4. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), Vol. 13. 185--198.Google ScholarGoogle Scholar
  5. Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), Vol. 10. 24.Google ScholarGoogle Scholar
  6. Adam Coates, Andrew Ng, and Honglak Lee. 2011. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 215--223.Google ScholarGoogle Scholar
  7. William Dally. 2015. High-performance Hardware for Machine Learning. NIPS Tutorial (2015).Google ScholarGoogle Scholar
  8. Dask Development Team. 2016. Dask: Library for dynamic task scheduling. https://dask.orgGoogle ScholarGoogle Scholar
  9. H. A. David and H. N. Nagaraja. 2003. Order statistics .John Wiley, Hoboken, N.J.Google ScholarGoogle Scholar
  10. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM , Vol. 56, 2 (2013), 74--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM , Vol. 51, 1 (2008), 107--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. James Dinan, D Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable work stealing. In High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on. IEEE, 1--11.Google ScholarGoogle Scholar
  13. James Dinan, Stephen Olivier, Gerald Sabin, Jan Prins, P Sadayappan, and Chau-Wen Tseng. 2007. Dynamic load balancing of unbalanced computations using message passing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  14. Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2016. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems. 2100--2108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. 2017. Coded convolution for parallel and distributed computing within a deadline. In IEEE International Symposium on Information Theory (ISIT). IEEE, 2403--2407.Google ScholarGoogle ScholarCross RefCross Ref
  16. Sanghamitra Dutta, Mohammad Fahim, Farzin Haddadpour, Haewon Jeong, Viveck Cadambe, and Pulkit Grover. 2018. On the Optimal Recovery Threshold of Coded Matrix Multiplication. arXiv preprint arXiv:1801.10292 (2018).Google ScholarGoogle Scholar
  17. Python Software Foundation. [n. d.]. Multiprocessing. https://docs.python.org/3/library/multiprocessing.html .Google ScholarGoogle Scholar
  18. Geoffrey C. Fox, Steve W. Otto, and Anthony JG. Hey. 1987. Matrix Algorithms on a Hypercube I: Matrix Multiplication. Parallel Comput. , Vol. 4, 1 (1987), 17--31.Google ScholarGoogle ScholarCross RefCross Ref
  19. K. Gardner, M. Harchol-Balter, and A. Scheller-Wolf. 2016. A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size. In Proceedings of IEEE MASCOTS .Google ScholarGoogle Scholar
  20. K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, E. Hyyti"a, and A. Scheller-Wolf. 2015a. Reducing Latency via Redundant Requests: Exact Analysis. In Proceedings of the ACM SIGMETRICS.Google ScholarGoogle Scholar
  21. K. Gardner, S. Zbarsky, M. Harchol-Balter, and A. Scheller-Wolf. 2015b. Analyzing Response Time in the Redundancy-d System. In Carnegie Mellon University-CS-15--141 archive .Google ScholarGoogle Scholar
  22. Google. 2015. Kubernetes . https://kubernetes.io .Google ScholarGoogle Scholar
  23. Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran. 2018. Oversketch: Approximate matrix multiplication for the cloud. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 298--304.Google ScholarGoogle ScholarCross RefCross Ref
  24. Wael Halbawi, Navid Azizan-Ruhi, Fariborz Salehi, and Babak Hassibi. 2017. Improving distributed gradient descent using reed-solomon codes. arXiv preprint arXiv:1706.05436 (2017).Google ScholarGoogle Scholar
  25. Mor Harchol-Balter. 2013. Performance modeling and design of computer systems: queueing theory in action .Cambridge University Press.Google ScholarGoogle Scholar
  26. Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. In Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, 98--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kuang-Hua Huang et almbox. 1984. Algorithm-based Fault Tolerance for Matrix Operations. IEEE Trans. Comput. , Vol. 100, 6 (1984), 518--528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Longbo Huang, S. Pawar, Hao Zhang, and K. Ramchandran. 2012. Codes can reduce queueing delay in data centers. In IEEE International Symposium on Information Theory Proceedings (ISIT). 2766--2770.Google ScholarGoogle Scholar
  29. Gauri Joshi. 2017. Boosting Service Capacity via Adaptive Task Replication. SIGMETRICS Perform. Eval. Rev. , Vol. 45, 2 (Oct. 2017), 9--11. https://doi.org/10.1145/3152042.3152046Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gauri Joshi. 2018. Synergy via Redundancy: Boosting Service Capacity with Adaptive Replication. SIGMETRICS Performance Evaluation Review , Vol. 45, 3 (March 2018), 21--28. http://doi.acm.org/10.1145/3199524.3199530Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Gauri Joshi, Yanpei Liu, and Emina Soljanin. 2012. Coding for fast content download. In Allerton Conference on Communication, Control, and Computing. IEEE, 326--333.Google ScholarGoogle ScholarCross RefCross Ref
  32. Gauri Joshi, Yanpei Liu, and Emina Soljanin. 2014. On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems. IEEE Journal on Selected Areas of Communications , Vol. 32, 5 (May 2014), 989--997.Google ScholarGoogle ScholarCross RefCross Ref
  33. Gauri Joshi, Joong Bum Rhim, John Sun, and Da Wang. 2010. Fountain codes. In Global telecommunications conference (GLOBECOM 2010). 7--12.Google ScholarGoogle Scholar
  34. Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2015. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Performance Evaluation Review , Vol. 43, 2 (2015), 54--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2017. Efficient Redundancy Techniques for Latency Reduction in Cloud Systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems , Vol. 2, 12 (may 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Can Karakus, Yifan Sun, and Suhas Diggavi. 2017. Encoded distributed optimization. In IEEE International Symposium on Information Theory (ISIT). IEEE, 2890--2894.Google ScholarGoogle ScholarCross RefCross Ref
  37. C. Kim and A. K. Agrawala. 1989. Analysis of the Fork-Join Queue . IEEE Trans. Comput. , Vol. 38, 2 (Feb. 1989), 250--255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jack Kosaian, K. V. Rashmi, and Shivaram Venkataraman. 2018. Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation. CoRR , Vol. abs/1806.01259 (2018). arxiv: 1806.01259 http://arxiv.org/abs/1806.01259Google ScholarGoogle Scholar
  39. Jack Kosaian, K. V. Rashmi, and Shivaram Venkataraman. 2019. Parity Models: A General Framework for Coding-Based Resilience in ML Inference. CoRR , Vol. abs/1905.00863 (2019). arxiv: 1905.00863 http://arxiv.org/abs/1905.00863Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms. Vol. 400. Benjamin/Cummings Redwood City.Google ScholarGoogle Scholar
  41. Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017a. Speeding Up Distributed Machine Learning Using Codes. IEEE Transactions on Information Theory (2017).Google ScholarGoogle Scholar
  42. Kangwook Lee, Nihar B. Shah, Longbo Huang, and Kannan Ramchandran. 2017b. The MDS Queue: Analysing the Latency Performance of Erasure Codes. IEEE Transactions on Information Theory , Vol. 63, 5 (May 2017), 2822--2842.Google ScholarGoogle Scholar
  43. Songze Li, Seyed Mohammadreza Mousavi Kalan, A Salman Avestimehr, and Mahdi Soltanolkotabi. 2017. Near-Optimal Straggler Mitigation for Distributed Gradient Methods. arXiv preprint arXiv:1710.09990 (2017).Google ScholarGoogle Scholar
  44. Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2016. A Unified Coding Framework for Distributed Computing with Straggling Servers. In IEEE Global Communications Conference (GLOBECOM) Workshops. IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  45. Michael Luby. 2002. LT codes. In null . IEEE, 271.Google ScholarGoogle Scholar
  46. David JC MacKay. 2003. Information theory, inference and learning algorithms .Cambridge university press.Google ScholarGoogle Scholar
  47. R. Nelson and A. Tantawi. 1988. Approximate Analysis of Fork/Join Synchronization in Parallel Queues. IEEE Trans. Comput. , Vol. 37, 6 (Jun. 1988), 739--743.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing order to the Web. Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  49. Albin Severinson, Alexandre Graell i Amat, and Eirik Rosnes. 2017. Block-Diagonal and LT Codes for Distributed Computing With Straggling Servers. arXiv preprint arXiv:1712.08230 (dec 2017).Google ScholarGoogle Scholar
  50. Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. 2016. When Do Redundant Requests Reduce Latency? IEEE Transactions on Communications , Vol. 64, 2 (Feb 2016), 715--722.Google ScholarGoogle ScholarCross RefCross Ref
  51. Vaishaal Shankar, Karl Krauth, Qifan Pu, Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht, and Jonathan Ragan-Kelley. 2018. numpywren: serverless linear algebra. arXiv preprint arXiv:1810.09679 (2018).Google ScholarGoogle Scholar
  52. Amin Shokrollahi. 2006. Raptor codes. IEEE/ACM Transactions on Networking (TON) , Vol. 14, SI (2006), 2551--2567.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Amin Shokrollahi, Michael Luby, et almbox. 2011. Raptor codes. Foundations and trends® in communications and information theory , Vol. 6, 3--4 (2011), 213--322.Google ScholarGoogle Scholar
  54. Yin Sun, Can Emre Koksal, and Ness B. Shroff. 2016. On Delay-Optimal Scheduling in Queueing Systems with Replications. arXiv:1603.07322 (March 2016).Google ScholarGoogle Scholar
  55. Yin Sun, Zizhan Zheng, Can Emre Koksal, Kyu-Han Kim, and Ness B. Shroff. 2015. Provably Delay Efficient Data Retrieving in Storage Clouds. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM) .Google ScholarGoogle Scholar
  56. Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. 2017. Gradient Coding: Avoiding Stragglers in Synchronous Gradient Descent. stat , Vol. 1050 (2017), 8.Google ScholarGoogle Scholar
  57. Elizabeth Varki, Arif Merchant, and Hui Chen. 2008. The M/M/1 fork-join queue with variable sub-tasks. unpublished, available online (2008).Google ScholarGoogle Scholar
  58. Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response times in Parallel Computation. In ACM SIGMETRICS Performance Evaluation Review, Vol. 42. ACM, 599--600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Da Wang, Gauri Joshi, and Gregory Wornell. 2015. Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing. ACM SIGMETRICS Performance Evaluation Review , Vol. 43, 3 (2015), 7--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Da Wang, Gauri Joshi, and Gregory W. Wornell. 2019. Efficient Straggler Replication in Large-Scale Parallel Computing. ACM Trans. Model. Perform. Eval. Comput. Syst. , Vol. 4, 2, Article 7 (April 2019), bibinfonumpages23 pages. http://doi.acm.org/10.1145/3310336Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sinong Wang, Jiashang Liu, and Ness Shroff. 2018. Coded Sparse Matrix Multiplication. arXiv preprint arXiv:1802.03430 (2018).Google ScholarGoogle Scholar
  62. Yaoqing Yang, Pulkit Grover, and Soummya Kar. 2017. Coded Distributed Computing for Inverse Problems. In Advances in Neural Information Processing Systems. 709--719.Google ScholarGoogle Scholar
  63. Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. 2017a. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems. 4406--4416.Google ScholarGoogle Scholar
  64. Qian Yu, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. 2017b. Coded Fourier Transform. arXiv preprint arXiv:1710.06471 (2017).Google ScholarGoogle Scholar
  65. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud , Vol. 10, 10--10 (2010), 95.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!