skip to main content
10.1145/2882903.2903740acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper
Public Access

SparkR: Scaling R Programs with Spark

Published:14 June 2016Publication History

ABSTRACT

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

References

  1. 2015 data science salary survey. https://www.oreilly.com/ideas/2015-data-science-salary-survey.Google ScholarGoogle Scholar
  2. Apache Spark Project. http://spark.apache.org.Google ScholarGoogle Scholar
  3. Project Tungsten: Bringing Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.Google ScholarGoogle Scholar
  4. Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More. https://goo.gl/RQS3ld.Google ScholarGoogle Scholar
  5. Rhadoop. http://projects.revolutionanalytics.com/rhadoop.Google ScholarGoogle Scholar
  6. Spark survey 2015. http://go.databricks.com/2015-spark-survey.Google ScholarGoogle Scholar
  7. Visual Analytics for Apache Spark and SparkR. http://goo.gl/zPje2i.Google ScholarGoogle Scholar
  8. A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The Stratosphere platform for big data analytics. VLDB Journal, 23(6):939--964, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin, and M. Zaharia. Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment, 8(12):1840--1843, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, pages 1383--1394, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R, 2014. R package version 1.5.Google ScholarGoogle Scholar
  12. M. Barnett, B. Chandramouli, R. DeLine, S. Drucker, D. Fisher, J. Goldstein, P. Morrison, and J. Platt. Stat!: An interactive analytics environment for big data. In SIGMOD 2013, pages 1013--1016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD 2010, pages 987--998. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1--22, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  15. A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI 2014, pages 599--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. Cleveland. Large complex data: Divide and Recombine (d&r) with RHIPE. Stat, 1(1):53--67, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  18. M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR 2015.Google ScholarGoogle Scholar
  19. H. Lin, S. Yang, and S. Midkiff. RABID: A General Distributed R Processing Framework Targeting Large Data-Set Problems. In IEEE Big Data 2013, pages 423--424, June 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik. cluster: Cluster Analysis Basics and Extensions, 2015.Google ScholarGoogle Scholar
  21. W. McKinney. Data Structures for Statistical Computing in Python. In S. van der Walt and J. Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML Pipelines: A New High-Level API for MLlib. https://goo.gl/pluhq0, 2015.Google ScholarGoogle Scholar
  23. X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, et al. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google ScholarGoogle Scholar
  24. Paradigm4 and B. W. Lewis. scidb: An R Interface to SciDB, 2015. R package version 1.2-0.Google ScholarGoogle Scholar
  25. S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in vertica: Fast data transfer, distributed model creation, and in-database prediction. In SIGMOD 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015.Google ScholarGoogle Scholar
  27. S. Urbanek. rJava: Low-Level R to Java Interface, 2015. R package version 0.9--7.Google ScholarGoogle Scholar
  28. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. In Eurosys 2013, pages 197--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Waldo. Remote Procedure Calls and Java Remote Method Invocation. IEEE Concurrency, 6(3):5--7, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Wickham. nycflights13: Data about flights departing NYC in 2013., 2014. R package version 0.1.Google ScholarGoogle Scholar
  31. H. Wickham and R. Francois. dplyr: A Grammar of Data Manipulation, 2015. R package version 0.4.3.Google ScholarGoogle Scholar
  32. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Yejas, D. Oscar, W. Zhuang, and A. Pannu. Big R: Large-Scale Analytics on Hadoop Using R. In IEEE Big Data 2014, pages 570--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Zaharia, M. Chowdhury, T. Das, A. Dave, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Zhang, W. Zhang, and J. Yang. I/O-efficient statistical computing with RIOT. In ICDE 2010, pages 1157--1160.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SparkR: Scaling R Programs with Spark

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
        June 2016
        2300 pages
        ISBN:9781450335317
        DOI:10.1145/2882903

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 June 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader