skip to main content
research-article

Bridging the gap between HPC and big data frameworks

Published:01 April 2017Publication History
Skip Abstract Section

Abstract

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

References

  1. D. C. Anastasiu and G. Karypis. L2knng: Fast exact k-nearest neighbor graph construction with l2-norm pruning. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, pages 791--800, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Anderson, N. Sundaram, N. Satish, M. M. A. Patwary, T. L. Willke, and P. Dubey. GraphPad: Optimized graph primitives for parallel and distributed platforms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 313--322, May 2016.Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2009, pages 27--34, Arlington, Virginia, United States, 2009. AUAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Axtmann, T. Bingmann, E. Jöbstl, S. Lamm, H. C. Nguyen, A. Noe, M. Stumpp, P. Sanders, S. Schlag, and T. Sturm. Thrill - distributed big data batch processing framework in C++. http://project-thrill.org/, 2016.Google ScholarGoogle Scholar
  5. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bosagh Zadeh, X. Meng, A. Ulanov, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Staple, and M. Zaharia. Matrix computations and optimization in Apache Spark. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pages 31--38, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. deeplearning4j. Deep Learning for Java. Open-Source, Distributed, Deep Learning Library for the JVM. http://deeplearning4j.org/, 2016.Google ScholarGoogle Scholar
  8. G. E. Fagg and J. J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI), pages 346--353. Springer Nature, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. P. I. Forum. Mpi: A message-passing interface standard version 3.1. Technical report, 2015.Google ScholarGoogle Scholar
  10. M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring automatic, online failure recovery for scientific applications at extreme scales. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu, K. Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy, M. W. Mahoney, and Prabhat. Matrix factorizations at scale: A comparison of scientific data analytics in Spark and C+MPI using three case studies. In 2016 IEEE International Conference on Big Data (Big Data), pages 204--213, Dec 2016.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Grossman and V. Sarkar. SWAT: A programmable, in-memory, distributed, high-performance computing platform. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 81--92, New York, New York, USA, 2016. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H2O.ai. Sparkling Water. https://github.com/h2oai/sparkling-water, 2016.Google ScholarGoogle Scholar
  14. S. Jha, J. Qiu, A. Luckow, P. Mantha, and G. C. Fox. A tale of two data-intensive paradigms: Applications, abstractions, and architectures. In IEEE International Congress on Big Data. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. O. Kaya and B. Uçar. Scalable sparse tensor decompositions in distributed memory systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 77. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. G. Kolda and B. Bader. The TOPHITS model for higher-order web link analysis. In Proceedings of Link Analysis, Counterterrorism and Security 2006, 2006.Google ScholarGoogle Scholar
  17. T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455--500, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Kwak, C. Lee, H. Park, and S. B. Moon. What is Twitter, a social network or a news media? In WWW, pages 591--600, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Lu, F. Liang, B. Wang, L. Zha, and Z. Xu. Datampi: Extending MPI to Hadoop-like big data computing. In 28th IEEE International Parallel and Distributed Processing Symposium, pages 829--838, May 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on recommender systems, pages 165--172. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res., 17(1):1235--1241, Jan. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making sense of performance in data analytics frameworks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI 2015, pages 293--307, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Raveendran, T. Bicer, and G. Agrawal. A framework for elastic execution of existing MPI programs. In IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW), pages 940--947, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. L. Reyes-Ortiz, L. Oneto, and D. Anguita. Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf. Procedia Computer Science, 53:121 -- 130, 2015.Google ScholarGoogle Scholar
  25. N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pages 979--990, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, A. Hanjalic, and N. Oliver. TFMAP: optimizing map for top-n context-aware recommendation. In Proceedings of the 35th International ACM SIGIR conference on Research and development in information retrieval, pages 155--164. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. arXiv preprint arXiv:1607.01668, 2016.Google ScholarGoogle Scholar
  28. G. M. Slota, S. Rajamanickam, and K. Madduri. A case study of complex graph analysis in distributed memory: Implementation and optimization. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 293--302, May 2016.Google ScholarGoogle ScholarCross RefCross Ref
  29. S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. FROSTT: The formidable repository of open sparse tensors and tools, 2017.Google ScholarGoogle Scholar
  30. S. Smith and G. Karypis. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit. http://cs.umn.edu/~splatt/.Google ScholarGoogle Scholar
  31. S. Smith and G. Karypis. A medium-grained algorithm for distributed sparse tensor factorization. In 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2016), 2016.Google ScholarGoogle ScholarCross RefCross Ref
  32. N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey. GraphMat: High performance graph analytics made productive. Proc. VLDB Endow., 8(11):1214--1225, July 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. GraphX: A resilient distributed graph system on Spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pages 2:1--2:6, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. Bridging the gap between HPC and big data frameworks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 10, Issue 8
            April 2017
            60 pages
            ISSN:2150-8097
            Issue’s Table of Contents

            Publisher

            VLDB Endowment

            Publication History

            • Published: 1 April 2017
            Published in pvldb Volume 10, Issue 8

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!