skip to main content
research-article
Public Access

FlashR: parallelize and scale R for machine learning using SSDs

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jeff Bilmes. 1998. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report. International Computer Science Institute.Google ScholarGoogle Scholar
  3. Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1425--1436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas R. Burdick, and Shivakumar Vaithyanathan. 2014. Hybrid Parallelization Strategies for Large-scale Machine Learning in SystemML. Proc. VLDB Endow. 7, 7 (March 2014), 553--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A Domain-Specific Approach to Heterogeneous Parallelism. In Proceedings of the 16th Annual Symposium on Principles and Practice of Parallel Programming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Wai-Mee Ching and Da Zheng. 2012. Automatic Parallelization of Array-oriented Programs for a Multi-core Machine. International Journal of Parallel Programming 40, 5 (2012), 514--531.Google ScholarGoogle ScholarCross RefCross Ref
  7. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun. 2006. Map-reduce for Machine Learning on Multicore. In Proceedings of the 19th International Conference on Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. criteo Accessed 2/11/2017. Criteo's 1TB Click Prediction Dataset. https://blogs.technet.microsoft.com/machinelearning/2015/04/01/now-available-on-azure-ml-criteos-1tb-click-prediction-dataset/. (Accessed 2/11/2017).Google ScholarGoogle Scholar
  9. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large-scale Machine Learning. Proc. VLDB Endow. 9, 12 (Aug. 2016), 960--971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative Machine Learning on MapReduce. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Leo J. Guibas and Douglas K. Wyatt. 1978. Compilation and Delayed Evaluation in APL. In Proceedings of the 5th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H2O Accessed 2/7/2017. H2O machine learning library. http://www.h2o.ai/. (Accessed 2/7/2017).Google ScholarGoogle Scholar
  15. Richard E. Ladner and Michael J. Fischer. 1980. Parallel Prefix Computation. J. ACM 27, 4 (Oct. 1980), 831--838. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. C. Liu and J. Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming: Series A and B (1989).Google ScholarGoogle Scholar
  17. Hang Liu and H. Howie Huang. 2017. Graphene: Fine-Grained IO Management for Graph Computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17). Santa Clara, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Lloyd. 2006. Least Squares Quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Sept. 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. mass Accessed 2/12/2017. Package MASS. https://cran.r-project.org/web/packages/MASS/index.html. (Accessed 2/12/2017).Google ScholarGoogle Scholar
  21. Alexander Matveev, Yaron Meirovitch, Hayk Saribekyan, Wiktor Jakubiuk, Tim Kaler, Gergely Odor, David Budden, Aleksandar Zlateski, and Nir Shavit. 2017. A Multicore Path to Connectomics-on-Demand. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2015. MLlib: Machine Learning in Apache Spark. The Journal of Machine Learning Research 17, 1 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The TileDB Array Data Storage Manager. Proc. VLDB Endow. 10, 4 (Nov. 2016), 349--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Karl Pearson. 1895. Notes on regression and inheritance in the case of two parents. In Proceedings of the Royal Society of London. 240--242.Google ScholarGoogle Scholar
  26. Gregorio Quintana-Ortí, Francisco D. Igual, Mercedes Marqués, Enrique S. Quintana-Ortí, and Robert A. van de Geijn. 2012. A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures. ACM Trans. Math. Softw. 38, 4 (Aug. 2012), 25:1--25:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. rro Accessed 2/12/2017. Microsoft R Open. https://mran.microsoft.com/open/. (Accessed 2/12/2017).Google ScholarGoogle Scholar
  28. Francis P. Russell, Michael R. Mellor, Paul H. J. Kelly, and Olav Beckmann. 2011. DESOLA: An Active Linear Algebra Library Using Delayed Evaluation and Runtime Code Generation. Sci. Comput. Program. (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Arvind K. Sujeeth, Hyoukjoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand R. Atreya, Kunle Olukotun, Tiark Rompf, and Martin Odersky. 2011. OptiML: an implicitly parallel domainspecific language for machine learning. In in Proceedings of the 28th International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Talbot, Z. DeVito, and P. Hanrahan. 2012. Riposte: A trace-driven compiler and parallel VM for vector code in R. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and Kai Li. 2015. RIPQ: Advanced Photo Caching on Flash for Facebook. In 13th USENIX Conference on File and Storage Technologies (FAST 15). Santa Clara, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sivan Toledo. 1999. External Memory Algorithms. Boston, MA, USA, Chapter A Survey of Out-of-core Algorithms in Numerical Linear Algebra, 161--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. webgraph Accessed 4/18/2014. Web graph. http://webdatacommons.org/hyperlinkgraph/. (Accessed 4/18/2014).Google ScholarGoogle Scholar
  34. Maurice V. Wilkes. 2001. The Memory Gap and the Future of High Performance Memories. SIGARCH Comput. Archit. News 29, 1 (March 2001), 2--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Da Zheng, Randal Burns, and Alexander S. Szalay. 2013. Toward Millions of File System IOPS on Low-Cost, Commodity Hardware. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In 13th USENIX Conference on File and Storage Technologies (FAST 15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Da Zheng, Disa Mhembere, Vince Lyzinski, Joshua Vogelstein, Carey E. Priebe, and Randal Burns. 2016. Semi-External Memory Sparse Matrix Multiplication on Billion-node Graphs. IEEE Transactions on Parallel & Distributed Systems (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 1
    PPoPP '18
    January 2018
    426 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3200691
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2018
      442 pages
      ISBN:9781450349826
      DOI:10.1145/3178487

    Copyright © 2018 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 10 February 2018

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!