Abstract

MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests.
- Hadoop: Open source implementation of mapreduce. http://lucene.apache.org/hadoop/.Google Scholar
- Jsr-166, prelease of java.util.concurrent package. http://gee.cs.oswego.edu/dl/concurrency-interest/index.html.Google Scholar
- Organizations using hadoop. http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
- E. Allen, D. Chase, C. Flood, V. Luchangco, J.-W. Maessen, S. Ryu, and G. L. Steele Jr. Project Fortress: A multicore language for multicore processors. Linux Magazine, Sep 2007.Google Scholar
- M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Communications of the ACM, 53:50--58, April 2010. Google Scholar
Digital Library
- M. Bhadauria, V. M. Weaver, and S. A. McKee. Understanding PARSEC performance on contemporary CMPs. In Proceedings of the 2009 International Symposium on Workload Characterization, October 2009. Google Scholar
Digital Library
- R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 207--216, 1995. Google Scholar
Digital Library
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google Scholar
Digital Library
- P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, and V. Sarkar. X10: An object-oriented approach to non-uniform clustered computing. In Proceedings of OOPSLA 2005, 2005. Google Scholar
Digital Library
- C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems, page 281, 2007.Google Scholar
- S. Daruru, N. M. Marin, M. Walker, and J. Ghosh. Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse netflix data. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1115--1124, 2009. Google Scholar
Digital Library
- M. de Kruijf and K. Sankaralingam. MapReduce for the CELL B.E. architecture. IBM Journal of Research and Development, 53(5), 2009. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th symposium on operating systems design and implementation, pages 137--150, 2004. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google Scholar
Digital Library
- D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The Database Column, 2008.Google Scholar
- J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for data intensive scientific analyses. In Fourth IEEE International Conference on eScience, pages 277--284, 2008. Google Scholar
Digital Library
- R. Fitzgerald and D. Tarditi. The case for profile-directed selection of garbage collectors. In Proceedings of the 2nd International Symposium on Memory Management, pages 111--120, 2000. Google Scholar
Digital Library
- Google. Appengine. http://code.google.com/appengine/.Google Scholar
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009. Google Scholar
Digital Library
- M. Hauswirth, P. Sweeney, A. Diwan, and M. Hind. Vertical profiling: understanding the behavior of object-oriented applications. ACM SIGPLAN Notices, 39(10):251--269, 2004. Google Scholar
Digital Library
- B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on parallel architectures and compilation techniques, 2008. Google Scholar
Digital Library
- Intel. Single-chip cloud computer, 2009. http://techresearch.intel.com/UserFiles/en-us/File/terascale/SCC-Overview.pdf.Google Scholar
- G. Kovoor, J. Singer, and M. Lujan. Building a Java mapreduce framework for multi-core architectures. In Proceedings of the Third Workshop on Programmability Issues for Multi-Core Computers, pages 87--98, 2010.Google Scholar
- D. Lea. A java fork/join framework. In Proceedings of the ACM conference on Java Grande, 2000. Google Scholar
Digital Library
- D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 227--242, 2009. Google Scholar
Digital Library
- F. Mao and X. Shen. Cross-input learning and discriminative prediction in evolvable virtual machines. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, pages 92--101, 2009. Google Scholar
Digital Library
- F. Mao, E. Z. Zhang, and X. Shen. Influence of program inputs on the selection of garbage collectors. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, pages 91--100, 2009. Google Scholar
Digital Library
- T. Printezis. Hot-swapping between a mark&sweep and a mark&compact garbage collector in a generational environment. In Proceedings of the 1st Java Virtual Machine Research and Technology Symposium, pages 171--184, 2001. Google Scholar
Digital Library
- T. Printezis and D. Detlefs. A generational mostly-concurrent garbage collector. In Proceedings of the 2nd international symposium on Memory management, pages 143--154, 2000. Google Scholar
Digital Library
- J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, 1986. Google Scholar
Digital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 13th International Symposium on High Performance Computer Architecture, pages 13--24, 2007. Google Scholar
Digital Library
- R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101--142, 2004. Google Scholar
Digital Library
- X. Shen, F. Mao, K. Tian, and E. Z. Zhang. The study and handling of program inputs in the selection of garbage collectors. ACM SIGOPS Operating Systems Review, 43:48--61, July 2009. Google Scholar
Digital Library
- J. Singer, G. Brown, I. Watson, and J. Cavazos. Intelligent selection of application-specific garbage collectors. In Proceedings of the 6th International Symposium on Memory Management, pages 91--102, Oct 2007. Google Scholar
Digital Library
- S. Soman, C. Krintz, and D. F. Bacon. Dynamic selection of application-specific garbage collectors. In Proceedings of the 4th International Symposium on Memory Management, pages 49--60, 2004. Google Scholar
Digital Library
- P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 511--518, 2001.Google Scholar
Cross Ref
- H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029--1040, 2007. Google Scholar
Digital Library
- R. Yoo, A. Romano, and C. Kozyrakis. Phoenix rebirth: Scalable MapReduce on a NUMA system. In Proceedings of the International Symposium on Workload Characterization, 2009. Google Scholar
Digital Library
- C. Zhang and M. Hirzel. Online phase-adaptive data layout selection. In ECOOP 2008 Object-Oriented Programming, pages 309--334, 2008. Google Scholar
Digital Library
- Y. Zhao, J. Shi, K. Zheng, H. Wang, H. Lin, and L. Shao. Allocation wall: a limiting factor of Java applications on emerging multi-core platforms. ACM SIGPLAN Notices, 44(10):361--376, 2009. Google Scholar
Digital Library
Index Terms
Garbage collection auto-tuning for Java mapreduce on multi-cores
Recommendations
Garbage collection auto-tuning for Java mapreduce on multi-cores
ISMM '11: Proceedings of the international symposium on Memory managementMapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational ...
Controlling garbage collection and heap growth to reduce the execution time of Java applications
In systems that support garbage collection, a tension exists between collecting garbage too frequently and not collecting it frequently enough. Garbage collection that occurs too frequently may introduce unnecessary overheads at the risk of not ...
Concurrent, parallel, real-time garbage-collection
ISMM '10: Proceedings of the 2010 international symposium on Memory managementWith the current developments in CPU implementations, it becomes obvious that ever more parallel multicore systems will be used even in embedded controllers that require real-time guarantees. When garbage collection is used in these systems, parallel ...







Comments