Abstract
While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.
- Apache hadoop. http://hadoop.apache.org .Google Scholar
- Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, and TN Vijaykumar. Puma: Purdue mapreduce benchmarks suite. 2012.Google Scholar
- Protocol Buffers. Google’s data interchange format, 2011.Google Scholar
- Rong Chen, Haibo Chen, and Binyu Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In International Conference on Parallel Architectures and Compilation Techniques, pages 523–534, 2010. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI, pages 10–10, 2004. Google Scholar
Digital Library
- Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, Pavan Balaji, and Michela Taufer. Mimir: Memory-efficient and scalable mapreduce for large supercomputing systems. In IEEE Intl. Parallel and Distributed Processing Symposium, pages 1098–1108, 2017.Google Scholar
Cross Ref
- Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In USENIX OSDI, volume 12, page 2, 2012. Google Scholar
Digital Library
- Google. MR4C. https://github.com/google/mr4c .Google Scholar
- Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: A mapreduce framework on graphics processors. In International Conference on Parallel Architectures and Compilation Techniques, pages 260–269, 2008. Google Scholar
Digital Library
- Wei Jiang, Vignesh T Ravi, and Gagan Agrawal. A map-reduce system with an alternate api for multi-core environments. In International Conference on Cluster, Cloud and Grid Computing, pages 84–93, 2010. Google Scholar
Digital Library
- Sai Charan Koduru, Rajiv Gupta, and Iulian Neamtiu. Size oblivious programming with InfiniMem. In International Workshop on Languages and Compilers for Parallel Computing, pages 3–19, 2016. Google Scholar
Digital Library
- K Ashwin Kumar, Jonathan Gluck, Amol Deshpande, and Jimmy Lin. Hone: Scaling down hadoop on shared-memory systems. Proceedings of the VLDB Endowment, 6(12):1354–1357, 2013. Google Scholar
Digital Library
- Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. In USENIX OSDI, pages 31–46, 2012. Google Scholar
Digital Library
- Yandong Mao, Robert Morris, and M Frans Kaashoek. Optimizing mapreduce for multicore architectures. In Computer Science and Artificial Intelligence Laboratory, MIT Technical Reports, 2010.Google Scholar
- Frank McSherry, Michael Isard, and Derek G. Murray. Scalability! But at what COST? 15th Workshop on Hot Topics in Operating Systems (HotOS), 2015. Google Scholar
Digital Library
- G.A. Miller, E.B. Newman, and E.A. Friedman. Length-frequency statistics for written english. Information and Control, 1(4), 1958.Google Scholar
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In IEEE International Symposium on High Performance Computer Architecture, pages 13–24, 2007. Google Scholar
Digital Library
- Scott Schneider, Christos D Antonopoulos, and Dimitrios S Nikolopoulos. Scalable locality-conscious multithreaded memory allocation. In International Symposium on Memory Management, pages 84–94. ACM, 2006. Google Scholar
Digital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In IEEE Symposium on Mass Storage Systems and Technologies, pages 1–10, 2010. Google Scholar
Digital Library
- Jeff A. Stuart and John D. Owens. Multi-gpu mapreduce on gpu clusters. In IEEE International Parallel & Distributed Processing Symposium, pages 1068–1079, 2011. Google Scholar
Digital Library
- Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. Phoenix++: Modular mapreduce for shared-memory systems. In International Workshop on MapReduce and Its Applications, pages 9–16, 2011. Google Scholar
Digital Library
- Keval Vora, Guoqing Xu, and Rajiv Gupta. Load the edges you need: A generic i/o optimization for disk-based graph processing. In USENIX Annual Technical Conference, pages 507–522, 2016. Google Scholar
Digital Library
- Keval Vora, Rajiv Gupta, and Guoqing Xu. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 237-251, 2017. Google Scholar
Digital Library
- Keval Vora, Chen Tian, Rajiv Gupta, and Ziang Hu. CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 223-236, 2017. Google Scholar
Digital Library
- Keval Vora, Sai Charan Koduru, and Rajiv Gupta. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM. In International Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), pages 861-878, 2014. Google Scholar
Digital Library
- Richard M Yoo, Anthony Romano, and Christos Kozyrakis. Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In IEEE International Symposium on Workload Characterization, pages 198–207, 2009. Google Scholar
Digital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. USENIX HotCloud, 10(10-10):95, 2010. Google Scholar
Digital Library
Index Terms
OMR: out-of-core MapReduce for large data sets
Recommendations
OMR: out-of-core MapReduce for large data sets
ISMM 2018: Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory ManagementWhile single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-...
A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network TechnologiesBig Data is a huge amount of data that cannot be managed by the traditional data management system. Hadoop is a technological answer to Big Data. Hadoop Distributed File System (HDFS) and MapReduce programming model is used for storage and retrieval of ...
Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications
The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small-and medium-...







Comments