Abstract
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.
- Cascading. http://www.cascading.org.Google Scholar
- Hadoop. http://hadoop.apache.org.Google Scholar
- Pig. http://hadoop.apache.org/pig.Google Scholar
- R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1 (2), 2008. Google Scholar
Digital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google Scholar
Digital Library
- J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT, 2006. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, no. 1, 2008. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google Scholar
Digital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003. Google Scholar
Digital Library
- R. H. Halstead Jr. New ideas in parallel Lisp: Language design, implementation, and programming tools. In Workshop on Parallel Lisp, 1989. Google Scholar
Digital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007. Google Scholar
Digital Library
- J. R. Larus. C**: A large-grain, object-oriented, data-parallel programming language. In LCPC, 1992. Google Scholar
Digital Library
- C. Lasser and S. M. Omohundro. The essential Star-lisp manual. Technical Report 86.15, Thinking Machines, Inc., 1986.Google Scholar
- E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling objects, relations and XML in the .NET framework. In SIGMOD Conference, 2006. Google Scholar
Digital Library
- R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Academic Press, 2001. Google Scholar
Digital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD Conference, 2008. Google Scholar
Digital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13 (4), 2005. Google Scholar
Digital Library
- J. R. Rose and G. L. Steele Jr. C*: An extended C language. In C Workshop, 1987.Google Scholar
- H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, 2007. Google Scholar
Digital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google Scholar
Digital Library
Index Terms
FlumeJava: easy, efficient data-parallel pipelines
Recommendations
FlumeJava: easy, efficient data-parallel pipelines
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and ImplementationMapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java ...
An object-oriented approach to nested data parallelism
FRONTIERS '95: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)This paper describes an implementation technique for integrating nested data parallelism into an object-oriented language. Data-parallel programming employs data aggregates called "collections" and expresses parallelism as operations performed over the ...
A flexible processor mapping technique toward data localization for block-cyclic data redistribution
Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers. To minimize the redistribution data transfer cost, processor mapping techniques were proposed to reduce the amount of ...







Comments