Abstract
The past decade has witnessed the increasing demands on data-driven business intelligence that led to the proliferation of data-intensive applications. A managed object-oriented programming language such as Java is often the developer's choice for implementing such applications, due to its quick development cycle and rich community resource. While the use of such languages makes programming easier, their automated memory management comes at a cost. When the managed runtime meets Big Data, this cost is significantly magnified and becomes a scalability-prohibiting bottleneck. This paper presents a novel compiler framework, called Facade, that can generate highly-efficient data manipulation code by automatically transforming the data path of an existing Big Data application. The key treatment is that in the generated code, the number of runtime heap objects created for data types in each thread is (almost) statically bounded, leading to significantly reduced memory management cost and improved scalability. We have implemented Facade and used it to transform 7 common applications on 3 real-world, already well-optimized Big Data frameworks: GraphChi, Hyracks, and GPS. Our experimental results are very positive: the generated programs have (1) achieved a 3%--48% execution time reduction and an up to 88X GC reduction; (2) consumed up to 50% less memory, and (3) scaled to much larger datasets.
- Zing: Java for the real time business. http://www.azulsystems.com/products/zing/whatisit.Google Scholar
- Hyracks: A data parallel platform. http://code.google.com/p/hyracks/, 2014.Google Scholar
- The Rust programming language. http://www.rust-lang.org/, 2014.Google Scholar
- Soot framework. http://www.sable.mcgill.ca/soot/, 2014.Google Scholar
- Yahoo! webscope program. http://webscope.sandbox.yahoo.com/, 2014.Google Scholar
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map- reduce environment. In International Conference on Extending Database Technology (EDBT), pages 99--110, 2010. Google Scholar
Digital Library
- P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. Proc. VLDB Endow., 1(1):958--969, 2008. Google Scholar
Digital Library
- A. Aiken, M. Fäahndrich, and R. Levien. Better static memory management: improving region-based analysis of higher-order languages. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--185, 1995. Google Scholar
Digital Library
- Giraph: Open-source implementation of Pregel. http://incubator.apache.org/giraph/.Google Scholar
- Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google Scholar
- G. Back and W. C. Hsieh. The KaffeOS Java Runtime System. ACM Transactions on Programming Languages and Systems (TOPLAS), 27(4):583--630, 2005. Google Scholar
Digital Library
- G. Banga, P. Druschel, and J. C. Mogul. Resource containers: A new facility for resource management in server systems. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 45--58, 1999. Google Scholar
Digital Library
- W. S. Beebee and M. C. Rinard. An implementation of scoped memory for real-time java. In International Conference on Embedded Software (EMSOFT), pages 289--305, 2001. Google Scholar
Digital Library
- S. M. Blackburn and K. S. McKinley. Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 22--32, 2008. Google Scholar
Digital Library
- B. Blanchet. Escape analysis for object-oriented languages. Applications to Java. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 20--34, 1999. Google Scholar
Digital Library
- V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE), pages 1151--1162, 2011. Google Scholar
Digital Library
- C. Boyapati, A. Salcianu, W. Beebee, Jr., and M. Rinard. Ownership types for safe region-based memory management in real-time java. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 324--337, 2003. Google Scholar
Digital Library
- Y. Bu, V. Borkar, G. Xu, and M. J. Carey. A bloat-aware design for big data applications. In ACM SIGNPLAN International Symposium on Memory Management (ISMM), pages 119--130, 2013. Google Scholar
Digital Library
- R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google Scholar
Digital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 363--375, 2010. Google Scholar
Digital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 363--375, 2010. Google Scholar
Digital Library
- J. Choi, M. Gupta, M. Serrano, V. Sreedhar, and S. Midkiff. Escape analysis for Java. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Lan- guages, and Applications (OOPSLA), pages 1--19, 1999. Google Scholar
Digital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 21--21, 2010. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google Scholar
Digital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3:515--529, 2010. Google Scholar
Digital Library
- J. Dolby and A. Chien. An automatic object inlining optimization and its evaluation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 345--357, 2000. Google Scholar
Digital Library
- K. Fisher, Y. Mandelbaum, and D. Walker. The next 700 data description languages. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 2--15, 2006. Google Scholar
Digital Library
- E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995. Google Scholar
Digital Library
- D. Gay and A. Aiken. Memory management with explicit regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 313--323, 1998. Google Scholar
Digital Library
- D. Gay and A. Aiken. Language support for regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 70--80, 2001. Google Scholar
Digital Library
- G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google Scholar
Digital Library
- D. Grossman, G. Morrisett, T. Jim, M. Hicks, Y. Wang, and J. Cheney. Region-based memory management in cyclone. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 282--293, 2002. Google Scholar
Digital Library
- Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 121--133, 2012. Google Scholar
Digital Library
- S. Z. Guyer, K. S. McKinley, and D. Frampton. Free-Me: a static analysis for automatic individual object reclamation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 364--375, 2006. Google Scholar
Digital Library
- N. Hallenberg, M. Elsman, and M. Tofte. Combining region inference and garbage collection. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 141--152, 2002. Google Scholar
Digital Library
- C. Hawblitzel and T. von Eicken. Luna: A flexible Java protection system. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 391--403, 2002. Google Scholar
Digital Library
- M. Hicks, G. Morrisett, D. Grossman, and T. Jim. Experience with safe manual memory-management in cyclone. In ACM SIGNPLAN International Symposium on Memory Management (ISMM), pages 73--84, 2004. Google Scholar
Digital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, 2007. Google Scholar
Digital Library
- S. Kowshik, D. Dhurjati, and V. Adve. Ensuring code safety without runtime checks for real-time control systems. In International Conference on Architecture and Synthesis for Embedded Systems (CASES), pages 288--297, 2002. Google Scholar
Digital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In International World Wide Web Conference (WWW), pages 591--600, 2010. Google Scholar
Digital Library
- A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 31--46, 2012. Google Scholar
Digital Library
- C. Lattner. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana- Champaign, 2005. Google Scholar
Digital Library
- C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 129--142, 2005. Google Scholar
Digital Library
- C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 278--289, 2007. Google Scholar
Digital Library
- R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another SQL-to-MapReduce translator. In IEEE International Conference on Distributed Computing Systems (ICDCS), pages 25--36, 2011. Google Scholar
Digital Library
- O. Lhotak and L. Hendren. Run-time evaluation of opportunities for object inlining in Java. Concurrency and Computation: Practice and Experience, 17(5-6):515--537, 2005. Google Scholar
Digital Library
- J. Liu, N. Ravi, S. Chakradhar, and M. Kandemir. Panacea: Towards holistic optimization of MapReduce applications. In International Symposium on Code Generation and Optimization (CGO), pages 33--43, 2012. Google Scholar
Digital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 340--349, 2010.Google Scholar
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. Proc. VLDB Endow., 5(8):716--727, 2012. Google Scholar
Digital Library
- H. Makholm. A region-based memory manager for prolog. In ACM SIGNPLAN International Symposium on Memory Management (ISMM), pages 25--34, 2000. Google Scholar
Digital Library
- Y. Mandelbaum, K. Fisher, D. Walker, M. F. Fernández, and A. Gleyzer. PADS/ML: a functional data description language. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 77--83, 2007. Google Scholar
Digital Library
- D. G. Murray, M. Isard, and Y. Yu. Steno: automatic optimization of declarative queries. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 121--131, 2011. Google Scholar
Digital Library
- K. Nguyen and G. Xu. Cachetor: Detecting cacheable data to remove bloat. In ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), pages 268--278, 2013. Google Scholar
Digital Library
- T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, 2010. Google Scholar
Digital Library
- C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX USENIX Annual Technical Conference (ATC), pages 267--273, 2008. Google Scholar
Digital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1099--1110, 2008. Google Scholar
Digital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13(4):277--298, 2005. Google Scholar
Digital Library
- S. Salihoglu and J. Widom. GPS: A graph processing system. In Scientific and Statistical Database Management, July 2013. Google Scholar
Digital Library
- Y. Shuf, M. Gupta, R. Bordawekar, and J. P. Singh. Exploiting prolific types for memory management and optimizations. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 295--306, 2002. Google Scholar
Digital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626--1629, 2009. Google Scholar
Digital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In International Conference on Data Engineering (ICDE), pages 996--1005, 2010.Google Scholar
Cross Ref
- M. Tofte and J.-P. Talpin. Implementation of the typed call-by-value lamda-calculus using a stack of regions. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 188--201, 1994. Google Scholar
Digital Library
- Storm: dstributed and fault-tolerant realtime computation. https://github.com/nathanmarz/storm.Google Scholar
- R. Vallée-Rai, E. Gagnon, L. Hendren, P. Lam, P. Pominville, and V. Sundaresan. Optimizing Java bytecode using the Soot framework: Is it feasible? In International Conference on Compiler Construction (CC), pages 18--34, 2000. Google Scholar
Digital Library
- G. Xu. Finding reusable data structures. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 1017--1034, 2012. Google Scholar
Digital Library
- G. Xu. Resurrector: A tunable object lifetime profiling technique for optimizing real-world programs. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 111--130, 2013. Google Scholar
Digital Library
- G. Xu, M. Arnold, N. Mitchell, A. Rountev, E. Schonberg, and G. Sevitsky. Finding low-utility data structures. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--186, 2010. Google Scholar
Digital Library
- G. Xu, M. Arnold, N. Mitchell, A. Rountev, and G. Sevitsky. Go with the flow: Profiling copies to find runtime bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 419--430, 2009. Google Scholar
Digital Library
- G. Xu, N. Mitchell, M. Arnold, A. Rountev, and G. Sevitsky. Software bloat analysis: Finding, removing, and preventing performance problems in modern large-scale object-oriented applications. In ACM SIGSOFT FSE/SDP Working Conference on the Future of Software Engineering Research (FoSER), pages 421--426, 2010. Google Scholar
Digital Library
- G. Xu and A. Rountev. Detecting inefficiently-used containers to avoid bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 160--173, 2010. Google Scholar
Digital Library
- G. Xu, D. Yan, and A. Rountev. Static detection of loop-invariant data structures. In European Conference on Object-Oriented Programming (ECOOP), pages 738--763, 2012. Google Scholar
Digital Library
- H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1029--1040, 2007. Google Scholar
Digital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1--14, 2008. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX conference on Hot topics in cloud computing (Hot- Cloud), page 10, Berkeley, CA, USA, 2010. Google Scholar
Digital Library
- N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazieres. Making information flow explicit in histar. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 263--278, 2006. Google Scholar
Digital Library
- J. Zhou, P.-A. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In International Conference on Data Engineering (ICDE), pages 1060--1071, 2010.Google Scholar
Cross Ref
Index Terms
FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications
Recommendations
FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications
ASPLOS'15The past decade has witnessed the increasing demands on data-driven business intelligence that led to the proliferation of data-intensive applications. A managed object-oriented programming language such as Java is often the developer's choice for ...
FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsThe past decade has witnessed the increasing demands on data-driven business intelligence that led to the proliferation of data-intensive applications. A managed object-oriented programming language such as Java is often the developer's choice for ...
Understanding and Combating Memory Bloat in Managed Data-Intensive Systems
The past decade has witnessed increasing demands on data-driven business intelligence that led to the proliferation of data-intensive applications. A managed object-oriented programming language such as Java is often the developer’s choice for ...







Comments