ABSTRACT
R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.
- 2015 data science salary survey. https://www.oreilly.com/ideas/2015-data-science-salary-survey.Google Scholar
- Apache Spark Project. http://spark.apache.org.Google Scholar
- Project Tungsten: Bringing Spark Closer to Bare Metal. https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.Google Scholar
- Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and More. https://goo.gl/RQS3ld.Google Scholar
- Rhadoop. http://projects.revolutionanalytics.com/rhadoop.Google Scholar
- Spark survey 2015. http://go.databricks.com/2015-spark-survey.Google Scholar
- Visual Analytics for Apache Spark and SparkR. http://goo.gl/zPje2i.Google Scholar
- A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The Stratosphere platform for big data analytics. VLDB Journal, 23(6):939--964, 2014. Google Scholar
Digital Library
- M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin, and M. Zaharia. Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment, 8(12):1840--1843, 2015. Google Scholar
Digital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, pages 1383--1394, 2015. Google Scholar
Digital Library
- S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R, 2014. R package version 1.5.Google Scholar
- M. Barnett, B. Chandramouli, R. DeLine, S. Drucker, D. Fisher, J. Goldstein, P. Morrison, and J. Platt. Stat!: An interactive analytics environment for big data. In SIGMOD 2013, pages 1013--1016. Google Scholar
Digital Library
- S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD 2010, pages 987--998. ACM, 2010. Google Scholar
Digital Library
- J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1--22, 2010.Google Scholar
Cross Ref
- A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, et al. SystemML: Declarative machine learning on MapReduce. In ICDE, pages 231--242. IEEE, 2011. Google Scholar
Digital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI 2014, pages 599--613. Google Scholar
Digital Library
- S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. Cleveland. Large complex data: Divide and Recombine (d&r) with RHIPE. Stat, 1(1):53--67, 2012.Google Scholar
Cross Ref
- M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, et al. Impala: A modern, open-source SQL engine for Hadoop. In CIDR 2015.Google Scholar
- H. Lin, S. Yang, and S. Midkiff. RABID: A General Distributed R Processing Framework Targeting Large Data-Set Problems. In IEEE Big Data 2013, pages 423--424, June 2013. Google Scholar
Digital Library
- M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik. cluster: Cluster Analysis Basics and Extensions, 2015.Google Scholar
- W. McKinney. Data Structures for Statistical Computing in Python. In S. van der Walt and J. Millman, editors, Proceedings of the 9th Python in Science Conference, pages 51 -- 56, 2010.Google Scholar
Cross Ref
- X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML Pipelines: A New High-Level API for MLlib. https://goo.gl/pluhq0, 2015.Google Scholar
- X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks, et al. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google Scholar
- Paradigm4 and B. W. Lewis. scidb: An R Interface to SciDB, 2015. R package version 1.2-0.Google Scholar
- S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in vertica: Fast data transfer, distributed model creation, and in-database prediction. In SIGMOD 2015. Google Scholar
Digital Library
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015.Google Scholar
- S. Urbanek. rJava: Low-Level R to Java Interface, 2015. R package version 0.9--7.Google Scholar
- S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber. Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. In Eurosys 2013, pages 197--210. Google Scholar
Digital Library
- J. Waldo. Remote Procedure Calls and Java Remote Method Invocation. IEEE Concurrency, 6(3):5--7, 1998. Google Scholar
Digital Library
- H. Wickham. nycflights13: Data about flights departing NYC in 2013., 2014. R package version 0.1.Google Scholar
- H. Wickham and R. Francois. dplyr: A Grammar of Data Manipulation, 2015. R package version 0.4.3.Google Scholar
- R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD 2013. Google Scholar
Digital Library
- L. Yejas, D. Oscar, W. Zhuang, and A. Pannu. Big R: Large-Scale Analytics on Hadoop Using R. In IEEE Big Data 2014, pages 570--577. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012. Google Scholar
Digital Library
- Y. Zhang, W. Zhang, and J. Yang. I/O-efficient statistical computing with RIOT. In ICDE 2010, pages 1157--1160.Google Scholar
Cross Ref
Index Terms
SparkR: Scaling R Programs with Spark
Recommendations
Scalable Data Analytics Using R: Single Machines to Hadoop Spark Clusters
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningR is one of the most popular languages in the data science, statistical and machine learning (ML) community. However, when it comes to scalable data analysis and ML using R, many data scientists are blocked or hindered by (a) its limitations of ...
Performance Evaluation of Apache Spark According to the Number of Nodes using Principal Component Analysis
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and ServicesWith the development of big data collection and storage technology, an analysis for its utilization has recently been expanded in public sector and various industries. Especialy in manufacturing and financial sectors, there has been a very high demand ...
A web interface for XALT log data analysis
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at ScaleXALT is a job-monitoring tool to collect accurate, detailed, and continuous job level and link-time data on all MPI jobs running on a computing cluster. Due to its usefulness and complementariness to other system logs, XALT has been deployed on Stampede ...





Comments