skip to main content
research-article

FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications

Authors Info & Claims
Published:14 March 2015Publication History
Skip Abstract Section

Abstract

The past decade has witnessed the increasing demands on data-driven business intelligence that led to the proliferation of data-intensive applications. A managed object-oriented programming language such as Java is often the developer's choice for implementing such applications, due to its quick development cycle and rich community resource. While the use of such languages makes programming easier, their automated memory management comes at a cost. When the managed runtime meets Big Data, this cost is significantly magnified and becomes a scalability-prohibiting bottleneck. This paper presents a novel compiler framework, called Facade, that can generate highly-efficient data manipulation code by automatically transforming the data path of an existing Big Data application. The key treatment is that in the generated code, the number of runtime heap objects created for data types in each thread is (almost) statically bounded, leading to significantly reduced memory management cost and improved scalability. We have implemented Facade and used it to transform 7 common applications on 3 real-world, already well-optimized Big Data frameworks: GraphChi, Hyracks, and GPS. Our experimental results are very positive: the generated programs have (1) achieved a 3%--48% execution time reduction and an up to 88X GC reduction; (2) consumed up to 50% less memory, and (3) scaled to much larger datasets.

References

  1. Zing: Java for the real time business. http://www.azulsystems.com/products/zing/whatisit.Google ScholarGoogle Scholar
  2. Hyracks: A data parallel platform. http://code.google.com/p/hyracks/, 2014.Google ScholarGoogle Scholar
  3. The Rust programming language. http://www.rust-lang.org/, 2014.Google ScholarGoogle Scholar
  4. Soot framework. http://www.sable.mcgill.ca/soot/, 2014.Google ScholarGoogle Scholar
  5. Yahoo! webscope program. http://webscope.sandbox.yahoo.com/, 2014.Google ScholarGoogle Scholar
  6. F. N. Afrati and J. D. Ullman. Optimizing joins in a map- reduce environment. In International Conference on Extending Database Technology (EDBT), pages 99--110, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. Proc. VLDB Endow., 1(1):958--969, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Aiken, M. Fäahndrich, and R. Levien. Better static memory management: improving region-based analysis of higher-order languages. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--185, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Giraph: Open-source implementation of Pregel. http://incubator.apache.org/giraph/.Google ScholarGoogle Scholar
  10. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google ScholarGoogle Scholar
  11. G. Back and W. C. Hsieh. The KaffeOS Java Runtime System. ACM Transactions on Programming Languages and Systems (TOPLAS), 27(4):583--630, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Banga, P. Druschel, and J. C. Mogul. Resource containers: A new facility for resource management in server systems. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 45--58, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. S. Beebee and M. C. Rinard. An implementation of scoped memory for real-time java. In International Conference on Embedded Software (EMSOFT), pages 289--305, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. M. Blackburn and K. S. McKinley. Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 22--32, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Blanchet. Escape analysis for object-oriented languages. Applications to Java. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 20--34, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE), pages 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Boyapati, A. Salcianu, W. Beebee, Jr., and M. Rinard. Ownership types for safe region-based memory management in real-time java. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 324--337, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Bu, V. Borkar, G. Xu, and M. J. Carey. A bloat-aware design for big data applications. In ACM SIGNPLAN International Symposium on Memory Management (ISMM), pages 119--130, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 363--375, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 363--375, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Choi, M. Gupta, M. Serrano, V. Sreedhar, and S. Midkiff. Escape analysis for Java. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Lan- guages, and Applications (OOPSLA), pages 1--19, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 21--21, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3:515--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Dolby and A. Chien. An automatic object inlining optimization and its evaluation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 345--357, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Fisher, Y. Mandelbaum, and D. Walker. The next 700 data description languages. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 2--15, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Gay and A. Aiken. Memory management with explicit regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 313--323, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Gay and A. Aiken. Language support for regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 70--80, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Grossman, G. Morrisett, T. Jim, M. Hicks, Y. Wang, and J. Cheney. Region-based memory management in cyclone. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 282--293, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 121--133, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Z. Guyer, K. S. McKinley, and D. Frampton. Free-Me: a static analysis for automatic individual object reclamation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 364--375, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. N. Hallenberg, M. Elsman, and M. Tofte. Combining region inference and garbage collection. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 141--152, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Hawblitzel and T. von Eicken. Luna: A flexible Java protection system. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 391--403, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Hicks, G. Morrisett, D. Grossman, and T. Jim. Experience with safe manual memory-management in cyclone. In ACM SIGNPLAN International Symposium on Memory Management (ISMM), pages 73--84, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Kowshik, D. Dhurjati, and V. Adve. Ensuring code safety without runtime checks for real-time control systems. In International Conference on Architecture and Synthesis for Embedded Systems (CASES), pages 288--297, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In International World Wide Web Conference (WWW), pages 591--600, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 31--46, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Lattner. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana- Champaign, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 129--142, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 278--289, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. Ysmart: Yet another SQL-to-MapReduce translator. In IEEE International Conference on Distributed Computing Systems (ICDCS), pages 25--36, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. O. Lhotak and L. Hendren. Run-time evaluation of opportunities for object inlining in Java. Concurrency and Computation: Practice and Experience, 17(5-6):515--537, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Liu, N. Ravi, S. Chakradhar, and M. Kandemir. Panacea: Towards holistic optimization of MapReduce applications. In International Symposium on Code Generation and Optimization (CGO), pages 33--43, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 340--349, 2010.Google ScholarGoogle Scholar
  49. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. Proc. VLDB Endow., 5(8):716--727, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. H. Makholm. A region-based memory manager for prolog. In ACM SIGNPLAN International Symposium on Memory Management (ISMM), pages 25--34, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Y. Mandelbaum, K. Fisher, D. Walker, M. F. Fernández, and A. Gleyzer. PADS/ML: a functional data description language. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 77--83, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. G. Murray, M. Isard, and Y. Yu. Steno: automatic optimization of declarative queries. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 121--131, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. K. Nguyen and G. Xu. Cachetor: Detecting cacheable data to remove bloat. In ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), pages 268--278, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX USENIX Annual Technical Conference (ATC), pages 267--273, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13(4):277--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Salihoglu and J. Widom. GPS: A graph processing system. In Scientific and Statistical Database Management, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Y. Shuf, M. Gupta, R. Bordawekar, and J. P. Singh. Exploiting prolific types for memory management and optimizations. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 295--306, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In International Conference on Data Engineering (ICDE), pages 996--1005, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  62. M. Tofte and J.-P. Talpin. Implementation of the typed call-by-value lamda-calculus using a stack of regions. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 188--201, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Storm: dstributed and fault-tolerant realtime computation. https://github.com/nathanmarz/storm.Google ScholarGoogle Scholar
  64. R. Vallée-Rai, E. Gagnon, L. Hendren, P. Lam, P. Pominville, and V. Sundaresan. Optimizing Java bytecode using the Soot framework: Is it feasible? In International Conference on Compiler Construction (CC), pages 18--34, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. G. Xu. Finding reusable data structures. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 1017--1034, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. G. Xu. Resurrector: A tunable object lifetime profiling technique for optimizing real-world programs. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 111--130, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. G. Xu, M. Arnold, N. Mitchell, A. Rountev, E. Schonberg, and G. Sevitsky. Finding low-utility data structures. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--186, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. G. Xu, M. Arnold, N. Mitchell, A. Rountev, and G. Sevitsky. Go with the flow: Profiling copies to find runtime bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 419--430, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. G. Xu, N. Mitchell, M. Arnold, A. Rountev, and G. Sevitsky. Software bloat analysis: Finding, removing, and preventing performance problems in modern large-scale object-oriented applications. In ACM SIGSOFT FSE/SDP Working Conference on the Future of Software Engineering Research (FoSER), pages 421--426, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. G. Xu and A. Rountev. Detecting inefficiently-used containers to avoid bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 160--173, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. G. Xu, D. Yan, and A. Rountev. Static detection of loop-invariant data structures. In European Conference on Object-Oriented Programming (ECOOP), pages 738--763, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1029--1040, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX conference on Hot topics in cloud computing (Hot- Cloud), page 10, Berkeley, CA, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. N. Zeldovich, S. Boyd-Wickizer, E. Kohler, and D. Mazieres. Making information flow explicit in histar. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 263--278, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. J. Zhou, P.-A. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In International Conference on Data Engineering (ICDE), pages 1060--1071, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 4
        ASPLOS '15
        April 2015
        676 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2775054
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2015
          720 pages
          ISBN:9781450328357
          DOI:10.1145/2694344

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 March 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!