skip to main content
article

OMR: out-of-core MapReduce for large data sets

Published:18 June 2018Publication History
Skip Abstract Section

Abstract

While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.

References

  1. Apache hadoop. http://hadoop.apache.org .Google ScholarGoogle Scholar
  2. Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, and TN Vijaykumar. Puma: Purdue mapreduce benchmarks suite. 2012.Google ScholarGoogle Scholar
  3. Protocol Buffers. Google’s data interchange format, 2011.Google ScholarGoogle Scholar
  4. Rong Chen, Haibo Chen, and Binyu Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In International Conference on Parallel Architectures and Compilation Techniques, pages 523–534, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI, pages 10–10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, Pavan Balaji, and Michela Taufer. Mimir: Memory-efficient and scalable mapreduce for large supercomputing systems. In IEEE Intl. Parallel and Distributed Processing Symposium, pages 1098–1108, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  7. Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In USENIX OSDI, volume 12, page 2, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Google. MR4C. https://github.com/google/mr4c .Google ScholarGoogle Scholar
  9. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: A mapreduce framework on graphics processors. In International Conference on Parallel Architectures and Compilation Techniques, pages 260–269, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wei Jiang, Vignesh T Ravi, and Gagan Agrawal. A map-reduce system with an alternate api for multi-core environments. In International Conference on Cluster, Cloud and Grid Computing, pages 84–93, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sai Charan Koduru, Rajiv Gupta, and Iulian Neamtiu. Size oblivious programming with InfiniMem. In International Workshop on Languages and Compilers for Parallel Computing, pages 3–19, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K Ashwin Kumar, Jonathan Gluck, Amol Deshpande, and Jimmy Lin. Hone: Scaling down hadoop on shared-memory systems. Proceedings of the VLDB Endowment, 6(12):1354–1357, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. In USENIX OSDI, pages 31–46, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yandong Mao, Robert Morris, and M Frans Kaashoek. Optimizing mapreduce for multicore architectures. In Computer Science and Artificial Intelligence Laboratory, MIT Technical Reports, 2010.Google ScholarGoogle Scholar
  15. Frank McSherry, Michael Isard, and Derek G. Murray. Scalability! But at what COST? 15th Workshop on Hot Topics in Operating Systems (HotOS), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G.A. Miller, E.B. Newman, and E.A. Friedman. Length-frequency statistics for written english. Information and Control, 1(4), 1958.Google ScholarGoogle Scholar
  17. Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In IEEE International Symposium on High Performance Computer Architecture, pages 13–24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Scott Schneider, Christos D Antonopoulos, and Dimitrios S Nikolopoulos. Scalable locality-conscious multithreaded memory allocation. In International Symposium on Memory Management, pages 84–94. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In IEEE Symposium on Mass Storage Systems and Technologies, pages 1–10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jeff A. Stuart and John D. Owens. Multi-gpu mapreduce on gpu clusters. In IEEE International Parallel & Distributed Processing Symposium, pages 1068–1079, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. Phoenix++: Modular mapreduce for shared-memory systems. In International Workshop on MapReduce and Its Applications, pages 9–16, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Keval Vora, Guoqing Xu, and Rajiv Gupta. Load the edges you need: A generic i/o optimization for disk-based graph processing. In USENIX Annual Technical Conference, pages 507–522, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Keval Vora, Rajiv Gupta, and Guoqing Xu. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 237-251, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Keval Vora, Chen Tian, Rajiv Gupta, and Ziang Hu. CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 223-236, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Keval Vora, Sai Charan Koduru, and Rajiv Gupta. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM. In International Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA), pages 861-878, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Richard M Yoo, Anthony Romano, and Christos Kozyrakis. Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In IEEE International Symposium on Workload Characterization, pages 198–207, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. USENIX HotCloud, 10(10-10):95, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OMR: out-of-core MapReduce for large data sets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!