skip to main content
research-article

A bloat-aware design for big data applications

Authors Info & Claims
Published:20 June 2013Publication History
Skip Abstract Section

Abstract

Over the past decade, the increasing demands on data-driven business intelligence have led to the proliferation of large-scale, data-intensive applications that often have huge amounts of data (often at terabyte or petabyte scale) to process. An object-oriented programming language such as Java is often the developer's choice for implementing such applications, primarily due to its quick development cycle and rich community resource. While the use of such languages makes programming easier, significant performance problems can often be seen --- the combination of the inefficiencies inherent in a managed run-time system and the impact of the huge amount of data to be processed in the limited memory space often leads to memory bloat and performance degradation at a surprisingly early stage.

This paper proposes a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented GC enabled languages. To motivate this work, we first perform a study on the impact of several typical memory bloat patterns. These patterns are summarized from the user complaints on the mailing lists of two widely-used open-source Big Data applications. Next, we discuss our design paradigm to eliminate bloat. Using examples and real-world experience, we demonstrate that programming under this paradigm does not incur significant programming burden. We have implemented a few common data processing tasks both using this design and using the conventional object-oriented design. Our experimental results show that this new design paradigm is extremely effective in improving performance --- even for the moderate-size data sets processed, we have observed 2.5x+ performance gains, and the improvement grows substantially with the size of the data set.

References

  1. Algebricks. https://code.google.com/p/hyracks/source/browse/#git%2Ffullstack%2Falgebricks.Google ScholarGoogle Scholar
  2. AsterixDB. https://code.google.com/p/asterixdb/wiki/AsterixAlphaRelease.Google ScholarGoogle Scholar
  3. Hivesterix. http://hyracks.org/projects/hivesterix/.Google ScholarGoogle Scholar
  4. Hyracks: A data parallel platform. http://code.google.com/p/hyracks/.Google ScholarGoogle Scholar
  5. Pregelix. http://hyracks.org/projects/pregelix/.Google ScholarGoogle Scholar
  6. VXQuery. http://incubator.apache.org/vxquery/.Google ScholarGoogle Scholar
  7. A. Aiken,M. Fähndrich, and R. Levien. Better static memory management: improving region-based analysis of higher-order languages. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--185, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Altman, M. Arnold, S. Fink, and N.Mitchell. Performance analysis of idle programs. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 739--753, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Giraph: Open-source implementation of Pregel. http://incubator.apache.org/giraph/.Google ScholarGoogle Scholar
  10. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google ScholarGoogle Scholar
  11. The Hive Project. http://hive.apache.org/.Google ScholarGoogle Scholar
  12. The Mahout Project. http://mahout.apache.org/.Google ScholarGoogle Scholar
  13. W. S. Beebee andM. C. Rinard. An implementation of scoped memory for real-time java. In International Conference on Embedded Software (EMSOFT), pages 289--305, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: towards a scalable, semistructured data platform for evolvingworld models. Distrib. Parallel Databases, 29:185--216, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. M. Blackburn and K. S. McKinley. Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 22--32, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE), pages 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Boyapati, A. Salcianu, W. Beebee, Jr., and M. Rinard. Ownership types for safe region-based memory management in real-time java. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 324--337, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 363--375, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Dufour, B. G. Ryder, and G. Sevitsky. A scalable technique for characterizing the usage of temporaries in framework-intensive Java applications. In ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), pages 59--70, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Gay and A. Aiken. Memory management with explicit regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 313--323, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Gay and A. Aiken. Language support for regions. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 70--80, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Grossman, G. Morrisett, T. Jim, M. Hicks, Y.Wang, and J. Cheney. Region-based memory management in cyclone. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 282--293, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Hallenberg, M. Elsman, and M. Tofte. Combining region inference and garbage collection. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 141--152, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Hicks, G.Morrisett, D. Grossman, and T. Jim. Experience with safe manual memory-management in cyclone. In International Symposium on Memory Management (ISMM), pages 73--84, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Kowshik, D. Dhurjati, and V. Adve. Ensuring code safet without runtime checks for real-time control systems. In International Conference on Architecture and Synthesis for Embedded Systems (CASES), pages 288--297, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Makholm. A region-based memory manager for prolog. In International Symposium on Memory Management (ISMM), pages 25--34, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Mitchell, E. Schonberg, and G. Sevitsky. Making sense of large heaps. In European Conference on Object-Oriented Programming (ECOOP), pages 77--97, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Mitchell, E. Schonberg, and G. Sevitsky. Four trends leading to Java runtime bloat. IEEE Software, 27(1):56--63, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Mitchell and G. Sevitsky. The causes of bloat, the limits of health. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 245--260, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Mitchell, G. Sevitsky, and H. Srinivasan. Modeling runtime behavior in framework-based applications. In European Conference on Object-Oriented Programming (ECOOP), pages 429--451, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD), pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.Google ScholarGoogle Scholar
  36. R. Ramakrishnan and J. Gehrke. Database Management Systems (3. ed.). McGraw-Hill, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. A. Shah, S. Madden, M. J. Franklin, and J. M. Hellerstein. Java support for data-intensive systems: Experiences building the telegraph dataflow system. SIGMOD Record, 30(4):103--114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Shankar, M. Arnold, and R. Bodik. JOLT: Lightweight dynamic analysis and removal of object churn. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 127--142, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Tofte and J.-P. Talpin. Implementation of the typed call-by-value lamda-calculus using a stack of regions. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 188--201, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Storm: dstributed and fault-tolerant realtime computation. https://github.com/nathanmarz/storm.Google ScholarGoogle Scholar
  41. G. Xu. Finding reusable data structures. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 1017--1034, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. G. Xu, M. Arnold, N. Mitchell, A. Rountev, E. Schonberg, and G. Sevitsky. Finding low-utility data structures. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 174--186, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. G. Xu, M. Arnold, N. Mitchell, A. Rountev, and G. Sevitsky. Go with the flow: Profiling copies to find runtime bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 419--430, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. G. Xu, N.Mitchell, M. Arnold, A. Rountev, and G. Sevitsky. Software bloat analysis: Finding, removing, and preventing performance problems in modern large-scale object-oriented applications. In FSE/SDP Working Conference on the Future of Software Engineering Research (FoSER), pages 421--426, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Xu and A. Rountev. Precise memory leak detection for Java software using container profiling. In International Conference on Software Engineering (ICSE), pages 151--160, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. G. Xu and A. Rountev. Detecting inefficiently-used containers to avoid bloat. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 160--173, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. G. Xu, D. Yan, and A. Rountev. Static detection of loop-invariant data structures. In European Conference on Object-Oriented Programming (ECOOP), pages 738--763, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yahoo! Webscope Program. http://webscope.sandbox.yahoo.com/.Google ScholarGoogle Scholar
  49. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. USENIX Workshop on Hot Topics in Cloud Computing, page 10, Berkeley, CA, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A bloat-aware design for big data applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 48, Issue 11
      ISMM '13
      November 2013
      128 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2555670
      Issue’s Table of Contents
      • cover image ACM Conferences
        ISMM '13: Proceedings of the 2013 international symposium on memory management
        June 2013
        140 pages
        ISBN:9781450321006
        DOI:10.1145/2491894

      Copyright © 2013 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 June 2013

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!