Abstract
Over the past decade, the increasing demands on data-driven business intelligence have led to the proliferation of large-scale, data-intensive applications that often have huge amounts of data (often at terabyte or petabyte scale) to process. An object-oriented programming language such as Java is often the developer's choice for implementing such applications, primarily due to its quick development cycle and rich community resource. While the use of such languages makes programming easier, significant performance problems can often be seen --- the combination of the inefficiencies inherent in a managed run-time system and the impact of the huge amount of data to be processed in the limited memory space often leads to memory bloat and performance degradation at a surprisingly early stage.
This paper proposes a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented GC enabled languages. To motivate this work, we first perform a study on the impact of several typical memory bloat patterns. These patterns are summarized from the user complaints on the mailing lists of two widely-used open-source Big Data applications. Next, we discuss our design paradigm to eliminate bloat. Using examples and real-world experience, we demonstrate that programming under this paradigm does not incur significant programming burden. We have implemented a few common data processing tasks both using this design and using the conventional object-oriented design. Our experimental results show that this new design paradigm is extremely effective in improving performance --- even for the moderate-size data sets processed, we have observed 2.5x+ performance gains, and the improvement grows substantially with the size of the data set.
- Algebricks. https://code.google.com/p/hyracks/source/browse/#git%2Ffullstack%2Falgebricks.Google Scholar
- AsterixDB. https://code.google.com/p/asterixdb/wiki/AsterixAlphaRelease.Google Scholar
- Hivesterix. http://hyracks.org/projects/hivesterix/.Google Scholar
- Hyracks: A data parallel platform. http://code.google.com/p/hyracks/.Google Scholar
- Pregelix. http://hyracks.org/projects/pregelix/.Google Scholar
- VXQuery. http://incubator.apache.org/vxquery/.Google Scholar
- A. Aiken,M. Fähndrich, and R. Levien. Better static memory management: improving region-based analysis of higher-order languages. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--185, 1995. Google Scholar
Digital Library
- E. Altman, M. Arnold, S. Fink, and N.Mitchell. Performance analysis of idle programs. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 739--753, 2010. Google Scholar
Digital Library
- Giraph: Open-source implementation of Pregel. http://incubator.apache.org/giraph/.Google Scholar
- Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google Scholar
- The Hive Project. http://hive.apache.org/.Google Scholar
- The Mahout Project. http://mahout.apache.org/.Google Scholar
- W. S. Beebee andM. C. Rinard. An implementation of scoped memory for real-time java. In International Conference on Embedded Software (EMSOFT), pages 289--305, 2001. Google Scholar
Digital Library
- A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: towards a scalable, semistructured data platform for evolvingworld models. Distrib. Parallel Databases, 29:185--216, June 2011. Google Scholar
Digital Library
- S. M. Blackburn and K. S. McKinley. Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 22--32, 2008. Google Scholar
Digital Library
- V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE), pages 1151--1162, 2011. Google Scholar
Digital Library
- C. Boyapati, A. Salcianu, W. Beebee, Jr., and M. Rinard. Ownership types for safe region-based memory management in real-time java. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 324--337, 2003. Google Scholar
Digital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 363--375, 2010. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, 2004. Google Scholar
Digital Library
- B. Dufour, B. G. Ryder, and G. Sevitsky. A scalable technique for characterizing the usage of temporaries in framework-intensive Java applications. In ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), pages 59--70, 2008. Google Scholar
Digital Library
- D. Gay and A. Aiken. Memory management with explicit regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 313--323, 1998. Google Scholar
Digital Library
- D. Gay and A. Aiken. Language support for regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 70--80, 2001. Google Scholar
Digital Library
- G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google Scholar
Digital Library
- D. Grossman, G. Morrisett, T. Jim, M. Hicks, Y.Wang, and J. Cheney. Region-based memory management in cyclone. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 282--293, 2002. Google Scholar
Digital Library
- N. Hallenberg, M. Elsman, and M. Tofte. Combining region inference and garbage collection. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 141--152, 2002. Google Scholar
Digital Library
- M. Hicks, G.Morrisett, D. Grossman, and T. Jim. Experience with safe manual memory-management in cyclone. In International Symposium on Memory Management (ISMM), pages 73--84, 2004. Google Scholar
Digital Library
- S. Kowshik, D. Dhurjati, and V. Adve. Ensuring code safet without runtime checks for real-time control systems. In International Conference on Architecture and Synthesis for Embedded Systems (CASES), pages 288--297, 2002. Google Scholar
Digital Library
- H. Makholm. A region-based memory manager for prolog. In International Symposium on Memory Management (ISMM), pages 25--34, 2000. Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 135--146, 2010. Google Scholar
Digital Library
- N. Mitchell, E. Schonberg, and G. Sevitsky. Making sense of large heaps. In European Conference on Object-Oriented Programming (ECOOP), pages 77--97, 2009. Google Scholar
Digital Library
- N. Mitchell, E. Schonberg, and G. Sevitsky. Four trends leading to Java runtime bloat. IEEE Software, 27(1):56--63, 2010. Google Scholar
Digital Library
- N. Mitchell and G. Sevitsky. The causes of bloat, the limits of health. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 245--260, 2007. Google Scholar
Digital Library
- N. Mitchell, G. Sevitsky, and H. Srinivasan. Modeling runtime behavior in framework-based applications. In European Conference on Object-Oriented Programming (ECOOP), pages 429--451, 2006. Google Scholar
Digital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1099--1110, 2008. Google Scholar
Digital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.Google Scholar
- R. Ramakrishnan and J. Gehrke. Database Management Systems (3. ed.). McGraw-Hill, 2003. Google Scholar
Digital Library
- M. A. Shah, S. Madden, M. J. Franklin, and J. M. Hellerstein. Java support for data-intensive systems: Experiences building the telegraph dataflow system. SIGMOD Record, 30(4):103--114, 2001. Google Scholar
Digital Library
- A. Shankar, M. Arnold, and R. Bodik. JOLT: Lightweight dynamic analysis and removal of object churn. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 127--142, 2008. Google Scholar
Digital Library
- M. Tofte and J.-P. Talpin. Implementation of the typed call-by-value lamda-calculus using a stack of regions. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 188--201, 1994. Google Scholar
Digital Library
- Storm: dstributed and fault-tolerant realtime computation. https://github.com/nathanmarz/storm.Google Scholar
- G. Xu. Finding reusable data structures. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 1017--1034, 2012. Google Scholar
Digital Library
- G. Xu, M. Arnold, N. Mitchell, A. Rountev, E. Schonberg, and G. Sevitsky. Finding low-utility data structures. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--186, 2010. Google Scholar
Digital Library
- G. Xu, M. Arnold, N. Mitchell, A. Rountev, and G. Sevitsky. Go with the flow: Profiling copies to find runtime bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 419--430, 2009. Google Scholar
Digital Library
- G. Xu, N.Mitchell, M. Arnold, A. Rountev, and G. Sevitsky. Software bloat analysis: Finding, removing, and preventing performance problems in modern large-scale object-oriented applications. In FSE/SDP Working Conference on the Future of Software Engineering Research (FoSER), pages 421--426, 2010. Google Scholar
Digital Library
- G. Xu and A. Rountev. Precise memory leak detection for Java software using container profiling. In International Conference on Software Engineering (ICSE), pages 151--160, 2008. Google Scholar
Digital Library
- G. Xu and A. Rountev. Detecting inefficiently-used containers to avoid bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 160--173, 2010. Google Scholar
Digital Library
- G. Xu, D. Yan, and A. Rountev. Static detection of loop-invariant data structures. In European Conference on Object-Oriented Programming (ECOOP), pages 738--763, 2012. Google Scholar
Digital Library
- Yahoo! Webscope Program. http://webscope.sandbox.yahoo.com/.Google Scholar
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. USENIX Workshop on Hot Topics in Cloud Computing, page 10, Berkeley, CA, USA, 2010. Google Scholar
Digital Library
Index Terms
A bloat-aware design for big data applications
Recommendations
A bloat-aware design for big data applications
ISMM '13: Proceedings of the 2013 international symposium on memory managementOver the past decade, the increasing demands on data-driven business intelligence have led to the proliferation of large-scale, data-intensive applications that often have huge amounts of data (often at terabyte or petabyte scale) to process. An object-...
A bloat-aware design for big data applications
ISMM '13: Proceedings of the 2013 international symposium on memory managementOver the past decade, the increasing demands on data-driven business intelligence have led to the proliferation of large-scale, data-intensive applications that often have huge amounts of data (often at terabyte or petabyte scale) to process. An object-...
Big Data Processing Technology Research and Application Prospects
IMCCC '14: Proceedings of the 2014 Fourth International Conference on Instrumentation and Measurement, Computer, Communication and ControlWith the rapid development of cloud computing, Internet of Things, Mobile Internet and other related technologies, data is growing at an unprecedented rate in both scales and types. Nowadays, data has been a kind of enormous business resources in the ...







Comments