skip to main content
research-article
Free Access

Analysis and Optimization of Task Granularity on the Java Virtual Machine

Authors Info & Claims
Published:16 July 2019Publication History
Skip Abstract Section

Abstract

Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, leading to missed parallelization opportunities. In this article, we provide a better understanding of task granularity for task-parallel applications running on a single Java Virtual Machine in a shared-memory multicore. We present a new methodology to accurately and efficiently collect the granularity of each executed task, implemented in a novel profiler (available open-source) that collects carefully selected metrics from the whole system stack with low overhead, and helps developers locate performance and scalability problems. We analyze task granularity in the DaCapo, ScalaBench, and Spark Perf benchmark suites, revealing inefficiencies related to fine-grained and coarse-grained tasks in several applications. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in several applications, achieving speedups up to a factor of 5.90×. Our results highlight the importance of analyzing and optimizing task granularity on the Java Virtual Machine.

References

  1. Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2011. Oracle scheduling: Controlling granularity in implicitly parallel languages. In OOPSLA. 499--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gul Agha. 1986. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. In PLDI. 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jianmin Bi, Xiaofei Liao, Yu Zhang, Chencheng Ye, Hai Jin, and Laurence T. Yang. 2014. An adaptive task granularity based scheduling for task-centric parallelism. In HPCC. 165--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced Java bytecode instrumentation. In PPPJ. 135--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA. 169--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. 2008. jPredictor: A predictive runtime analysis tool for Java. In ICSE. 221--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kuo-Yi Chen, J. Morris Chang, and Ting-Wei Hou. 2011. Multithreading in Java: Performance and scalability on multicore systems. IEEE Trans. Comput. 60, 11 (2011), 1521--1534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guojing Cong, Sreedhar Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay Saraswat, and Tong Wen. 2008. Solving large, irregular graph problems using adaptive work-stealing. In ICPP. 536--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Databricks. 2015. Spark Performance Tests. Retrieved from https://github.com/databricks/spark-perf.Google ScholarGoogle Scholar
  11. Florian David, Gael Thomas, Julia Lawall, and Gilles Muller. 2014. Continuously measuring critical section pressure with the free-lunch profiler. In OOPSLA. 291--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bruno Dufour, Karel Driesen, Laurie Hendren, and Clark Verbrugge. 2003. Dynamic metrics for Java. In OOPSLA. 149--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. 2008. An adaptive cut-off for task parallelism. In SC. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H2. 2018. H2 Database Engine. Retrieved from http://www.h2database.com.Google ScholarGoogle Scholar
  15. Kevin Hammond, Hans-Wolfgang Loidl, and Andrew S Partridge. 1995. Visualising granularity in parallel programs: A graphical winnowing system for Haskell. In HPFC. 208--221.Google ScholarGoogle Scholar
  16. Matthias Hauswirth, Peter F. Sweeney, Amer Diwan, and Michael Hind. 2004. Vertical profiling: Understanding the behavior of object-oriented applications. In OOPSLA. 251--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview scalability analyzer. In SPAA. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A universal modular ACTOR formalism for artificial intelligence. In IJCAI. 235--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In LSP. 79--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. IBM. 2007. DayTrader. Retrieved from https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wascrypt/l0wscry00_daytrader.htm.Google ScholarGoogle Scholar
  21. ICL. 2017. PAPI. Retrieved from http://icl.utk.edu/papi/.Google ScholarGoogle Scholar
  22. Hiroshi Inoue and Toshio Nakatani. 2009. How a Java VM can get more from a hardware performance monitor. In OOPSLA. 137--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Shintaro Iwasaki and Kenjiro Taura. 2016. Autotuning of a cut-off for task parallel programs. In MCSoC. 353--360.Google ScholarGoogle Scholar
  24. Joseph JaJa. 1992. Introduction to Parallel Algorithms. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Stephen Kell, Danilo Ansaloni, Walter Binder, and Lukáš Marek. 2012. The JVM is not observable enough (and what to do about it). In VMIL. 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In ECOOP. 220--242.Google ScholarGoogle Scholar
  27. Clyde P. Kruskal and Carl H. Smith. 1988. On the notion of granularity. J. Supercomput. 1, 4 (1988), 395--408.Google ScholarGoogle ScholarCross RefCross Ref
  28. Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. 2012. Work-stealing without the baggage. In OOPSLA. 297--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Philipp Lengauer, Verena Bitto, Hanspeter Mössenböck, and Markus Weninger. 2017. A comprehensive Java benchmark study on memory and garbage collection behavior of DaCapo, DaCapo Scala, and SPECjvm2008. In ICPE. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V. Kale. 2014. Optimizing data locality for fork/join programs using constrained work stealing. In SC. 857--868. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Linux man. 2013. top(1). Retrieved from https://linux.die.net/man/1/top.Google ScholarGoogle Scholar
  32. Linux man. 2018. Documentation of CLOCK_MONOTONIC in clock_gettime(). Retrieved from https://linux.die.net/man/3/clock_gettime.Google ScholarGoogle Scholar
  33. Pedro Lopez, Manuel Hermenegildo, and Saumya K. Debray. 1996. A methodology for granularity-based control of parallelism in logic programs. J. Symbolic Comput. 21, 4 (1996), 715--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder, Petr Tůma, Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe. 2013. ShadowVM: Robust and comprehensive dynamic program analysis for the Java platform. In GPCE. 105--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A domain-specific language for bytecode instrumentation. In AOSD. 239--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. 1991. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2, 3 (1991), 264--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Philippe Moret, Walter Binder, and Alex Villazon. 2009. CCCP: Complete calling context profiling in virtual execution environments. In PEPM. 151--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ananya Muddukrishna, Peter A. Jonsson, Artur Podobas, and Mats Brorsson. 2016. Grain graphs: OpenMP performance analysis made easy. In PPoPP. 28:1--28:13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the accuracy of Java profilers. In PLDI. 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Albert Noll and Thomas Gross. 2013. Online feedback-directed optimizations for parallel Java code. In OOPSLA. 713--728. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Oracle. 2017. Documentation of System.nanotime(). Retrieved from https://docs.oracle.com/javase/9/docs/api/java/lang/System.html.Google ScholarGoogle Scholar
  42. Oracle. 2017. Java Native Interface. Retrieved from https://docs.oracle.com/javase/9/docs/specs/jni/index.html.Google ScholarGoogle Scholar
  43. Oracle. 2017. Java Platform, Standard Edition 8 Java Development Kit Version 9 API Specification. Retrieved from https://docs.oracle.com/javase/9/docs/api/.Google ScholarGoogle Scholar
  44. Oracle. 2017. Java Virtual Machine Tool Interface (JVM TI). Retrieved from https://docs.oracle.com/javase/9/docs/specs/jvmti.html.Google ScholarGoogle Scholar
  45. Oracle. 2017. The Parallel Collector. Retrieved from https://docs.oracle.com/javase/9/gctuning/parallel-collector1.htm.Google ScholarGoogle Scholar
  46. Oracle. 2017. ExecutorService. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ExecutorService.html.Google ScholarGoogle Scholar
  47. Oracle. 2017. ForkJoinPool. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ForkJoinPool.html.Google ScholarGoogle Scholar
  48. Oracle. 2017. ThreadPoolExecutor. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ThreadPoolExecutor.html.Google ScholarGoogle Scholar
  49. perf. 2015. Linux profiling with performance counters. Retrieved from https://perf.wiki.kernel.org.Google ScholarGoogle Scholar
  50. Andrea Rosà and Walter Binder. 2018. Optimizing type-specific instrumentation on the JVM with reflective supertype information. J. Visual Lang. Comput. 49 (2018), 29--45.Google ScholarGoogle ScholarCross RefCross Ref
  51. Andrea Rosà, Lydia Y. Chen, and Walter Binder. 2016. Actor profiling in virtual execution environments. SIGPLAN Not. 52, 3 (Oct. 2016), 36--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Andrea Rosà, Eduardo Rosales, and Walter Binder. 2017. Accurate reification of complete supertype information for dynamic analysis on the JVM. SIGPLAN Not. 52, 12 (Oct. 2017), 104--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Andrea Rosà, Eduardo Rosales, and Walter Binder. 2018. Analyzing and optimizing task granularity on the JVM. In CGO. 27--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Eduardo Rosales, Andrea Rosà, and Walter Binder. 2017. tgp: A task-granularity profiler for the Java virtual machine. In APSEC. 570--575.Google ScholarGoogle Scholar
  55. Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 1). Retrieved from http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/.Google ScholarGoogle Scholar
  56. Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 2). Retrieved from http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.Google ScholarGoogle Scholar
  57. Aibek Sarimbekov, Andreas Sewe, Walter Binder, Philippe Moret, and Mira Mezini. 2014. JP2: Call-site aware calling context profiling for the Java virtual machine. Sci. Comput. Program. 79 (Jan. 2014), 146--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, and Charles E. Leiserson. 2015. The cilkprof scalability profiler. In SPAA. 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and analysis of a scala benchmark suite for the Java virtual machine. In OOPSLA. 657--676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. The Apache Software Foundation. 2018. Apache Spark—RDD Programming Guide. Retrieved from https://spark.apache.org/docs/latest/rdd-programming-guide.html.Google ScholarGoogle Scholar
  61. The Apache Software Foundation. 2018. Apache Spark MLlib. Retrieved from https://spark.apache.org/mllib/.Google ScholarGoogle Scholar
  62. The Apache Software Foundation. 2018. Apache Tomcat. Retrieved from http://tomcat.apache.org.Google ScholarGoogle Scholar
  63. The Apache Software Foundation. 2018. Lucene. Retrieved from https://lucene.apache.org.Google ScholarGoogle Scholar
  64. The Apache Software Foundation. 2018. Spark Configuration. Retrieved from https://spark.apache.org/docs/latest/configuration.html.Google ScholarGoogle Scholar
  65. The Apache Software Foundation. 2018. Spark Streaming. Retrieved from https://spark.apache.org/streaming/.Google ScholarGoogle Scholar
  66. The Apache Software Foundation. 2018. SparkContext API. Retrieved from https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html.Google ScholarGoogle Scholar
  67. The Eclipse Foundation. 2016. Jetty. Retrieved from http://www.eclipse.org/jetty/.Google ScholarGoogle Scholar
  68. The Eclipse Foundation. 2018. Eclipse. Retrieved from https://www.eclipse.org.Google ScholarGoogle Scholar
  69. The Stanford Natural Language Processing Group. 2010. Stanford Topic Modeling Toolbox. Retrieved from https://nlp.stanford.edu/software/tmt/tmt-0.4/.Google ScholarGoogle Scholar
  70. Peter Thoman, Herbert Jordan, and Thomas Fahringer. 2013. Adaptive granularity control in task parallel programs using multiversioning. In Euro-Par. 164--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. TPC. 2010. TPC-C. Retrieved from http://www.tpc.org/tpcc/.Google ScholarGoogle Scholar
  72. Alex Villazón, Haiyang Sun, Andrea Rosà, Eduardo Rosales, Daniele Bonetta, Isabella Defilippis, Sergio Oporto, and Walter Binder. 2019. Automated large-scale multi-language dynamic program analysis in the wild. In ECOOP. 1--26.Google ScholarGoogle Scholar
  73. Adarsh Yoga and Santosh Nagarakatte. 2017. A fast causal profiler for task parallel programs. In ESEC/FSE. 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Jisheng Zhao, Jun Shirako, V. Krishna Nandivada, and Vivek Sarkar. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In PACT. 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Yudi Zheng, Lubomír Bulej, and Walter Binder. 2015. Accurate profiling in the presence of dynamic compilation. In OOPSLA. 433--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Yudi Zheng, Andrea Rosà, Luca Salucci, Yao Li, Haiyang Sun, Omar Javed, Lumobír Bulej, Lydia Y. Chen, Zhengwei Qi, and Walter Binder. 2016. AutoBench: Finding workloads that you need using pluggable hybrid analyses. In SANER. 639--643.Google ScholarGoogle Scholar
  78. Gary M. Zoppetti, Gagan Agrawal, Lori Pollock, Jose Nelson Amaral, Xinan Tang, and Guang Gao. 2000. Automatic compiler techniques for thread coarsening for multithreaded architectures. In ICS. 306--315. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Analysis and Optimization of Task Granularity on the Java Virtual Machine

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!