skip to main content
10.5555/2228298.2228301guideproceedingsArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Published:25 April 2012Publication History

ABSTRACT

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

References

  1. Apache Hive. http://hadoop.apache.org/hive.Google ScholarGoogle Scholar
  2. Scala. http://www.scala-lang.org.Google ScholarGoogle Scholar
  3. G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in datacenter computing considered irrelevant. In HotOS '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquin. Incoop: MapReduce for incremental computations. In ACM SOCC '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37:1-28, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 3:285-296, September 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In PLDI '10. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379-474, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: automatic management of data and computation in datacenters. In OSDI '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: an application-level kernel for record and replay. OSDI'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Publishing Company, New York, NY, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: batched stream processing for data intensive distributed computing. In SoCC '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Heydon, R. Levin, and Y. Yu. Caching function calls using precise dependencies. In ACM SIGPLAN Notices, pages 311-320, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M. J. Franklin, P. Abbeel, and A. M. Bayen. Scaling the Mobile Millennium system in the cloud. In SOCC '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Y. Ko, I. Hoque, B. Cho, and I. Gupta. On availability of intermediate data in cloud computations. In HotOS '09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. SoCC '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. In NSDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Nitzberg and V. Lo. Distributed shared memory: a survey of issues and algorithms. Computer, 24(8):52-60, Aug 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The case for RAMClouds: scalable high-performance storage entirely in DRAM. SIGOPS Op. Sys. Rev., 43:92-105, Jan 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In Proc. OSDI 2010, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., 3 edition, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and evaluation of a real-time URL spam filtering service. In IEEE Symposium on Security and Privacy, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530-531, Sept 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI '08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, UC Berkeley, 2011.Google ScholarGoogle Scholar

Index Terms

  1. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image Guide Proceedings
      NSDI'12: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
      April 2012
      30 pages

      Publisher

      USENIX Association

      United States

      Publication History

      • Published: 25 April 2012

      Qualifiers

      • Article