ABSTRACT
Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.
However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.
We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.
- Apache. Hadoop. http://lucene.apache.org/hadoop/, 2006.Google Scholar
- A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. In SIGMOD 1997, pages 243--254, 1997.Google Scholar
- E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition, Cambridge, MA, 2005. MIT Press.Google Scholar
- F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006. Google Scholar
Digital Library
- L. Chu et al. Optimizing Data Aggregation for Cluster-Based Internet Services. In PPOPP, pages 119--130. ACM, 2003.Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.Google Scholar
Digital Library
- D. J. DeWitt et al. GAMMA-A High Performance Dataflow Database Machine. In VLDB 1986, pages 228--237, 1986. Google Scholar
Digital Library
- D. J. DeWitt and Gerber. R. Multiprocessor Hash-Based Join Algorithms. In VLDB 1985, 1985.Google Scholar
- D. J. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM, 35(6):85--98, 1992. Google Scholar
Digital Library
- S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In SOSP, pages 29--43, 2003.Google Scholar
Digital Library
- J. Gray. Sort Benchmark. http://research.microsoft.com/barc/SortBenchmark/,2006.Google Scholar
- J. Gray et al. Scientific data management in the coming decade. SIGMOD Record, 34(4):34--41, 2005. Google Scholar
Digital Library
- M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007.Google Scholar
Digital Library
- R. Lämmel. Google's MapReduce Programming Model - Revisited. Draft; Online since 2 January, 2006; 26 pages, 22 Jan. 2006.Google Scholar
- R. Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google Scholar
Digital Library
- Teradata. Teradata. http://www.teradata.com/t/go.aspx, 2006.Google Scholar
- TPC. TPC-H. http://www.tpc.org/tpch/default.asp, 2006.Google Scholar
- Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/Redundant Array of Inexpensive Nodes, 2006.Google Scholar
Index Terms
Map-reduce-merge: simplified relational data processing on large clusters
Recommendations
Scale-out beyond map-reduce
The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Distributed parallel architecture for storing and processing large datasets
We live in the data age as data storage technologies, hardware and software, have evolved to a point at which it is very cheap to store large volumes of data, structured and unstructured. The increased popularity of social media has contributed to the ...
A 2-Tier Clustering Algorithm with Map-Reduce
In the field of data mining, clustering is one of the important methods. K-Means is a typical distance-based clustering algorithm; 2-tier clustering should implement scalable clustering by means of dividing, sampling and knowledge integrating. Among ...





Comments