Abstract
Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, leading to missed parallelization opportunities. In this article, we provide a better understanding of task granularity for task-parallel applications running on a single Java Virtual Machine in a shared-memory multicore. We present a new methodology to accurately and efficiently collect the granularity of each executed task, implemented in a novel profiler (available open-source) that collects carefully selected metrics from the whole system stack with low overhead, and helps developers locate performance and scalability problems. We analyze task granularity in the DaCapo, ScalaBench, and Spark Perf benchmark suites, revealing inefficiencies related to fine-grained and coarse-grained tasks in several applications. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in several applications, achieving speedups up to a factor of 5.90×. Our results highlight the importance of analyzing and optimizing task granularity on the Java Virtual Machine.
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2011. Oracle scheduling: Controlling granularity in implicitly parallel languages. In OOPSLA. 499--518. Google Scholar
Digital Library
- Gul Agha. 1986. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press. Google Scholar
Digital Library
- Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. In PLDI. 85--96. Google Scholar
Digital Library
- Jianmin Bi, Xiaofei Liao, Yu Zhang, Chencheng Ye, Hai Jin, and Laurence T. Yang. 2014. An adaptive task granularity based scheduling for task-centric parallelism. In HPCC. 165--172. Google Scholar
Digital Library
- Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced Java bytecode instrumentation. In PPPJ. 135--144. Google Scholar
Digital Library
- Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA. 169--190. Google Scholar
Digital Library
- Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. 2008. jPredictor: A predictive runtime analysis tool for Java. In ICSE. 221--230. Google Scholar
Digital Library
- Kuo-Yi Chen, J. Morris Chang, and Ting-Wei Hou. 2011. Multithreading in Java: Performance and scalability on multicore systems. IEEE Trans. Comput. 60, 11 (2011), 1521--1534. Google Scholar
Digital Library
- Guojing Cong, Sreedhar Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay Saraswat, and Tong Wen. 2008. Solving large, irregular graph problems using adaptive work-stealing. In ICPP. 536--545. Google Scholar
Digital Library
- Databricks. 2015. Spark Performance Tests. Retrieved from https://github.com/databricks/spark-perf.Google Scholar
- Florian David, Gael Thomas, Julia Lawall, and Gilles Muller. 2014. Continuously measuring critical section pressure with the free-lunch profiler. In OOPSLA. 291--307. Google Scholar
Digital Library
- Bruno Dufour, Karel Driesen, Laurie Hendren, and Clark Verbrugge. 2003. Dynamic metrics for Java. In OOPSLA. 149--168. Google Scholar
Digital Library
- Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. 2008. An adaptive cut-off for task parallelism. In SC. 1--11. Google Scholar
Digital Library
- H2. 2018. H2 Database Engine. Retrieved from http://www.h2database.com.Google Scholar
- Kevin Hammond, Hans-Wolfgang Loidl, and Andrew S Partridge. 1995. Visualising granularity in parallel programs: A graphical winnowing system for Haskell. In HPFC. 208--221.Google Scholar
- Matthias Hauswirth, Peter F. Sweeney, Amer Diwan, and Michael Hind. 2004. Vertical profiling: Understanding the behavior of object-oriented applications. In OOPSLA. 251--269. Google Scholar
Digital Library
- Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview scalability analyzer. In SPAA. 145--156. Google Scholar
Digital Library
- Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A universal modular ACTOR formalism for artificial intelligence. In IJCAI. 235--245. Google Scholar
Digital Library
- Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In LSP. 79--90. Google Scholar
Digital Library
- IBM. 2007. DayTrader. Retrieved from https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wascrypt/l0wscry00_daytrader.htm.Google Scholar
- ICL. 2017. PAPI. Retrieved from http://icl.utk.edu/papi/.Google Scholar
- Hiroshi Inoue and Toshio Nakatani. 2009. How a Java VM can get more from a hardware performance monitor. In OOPSLA. 137--154. Google Scholar
Digital Library
- Shintaro Iwasaki and Kenjiro Taura. 2016. Autotuning of a cut-off for task parallel programs. In MCSoC. 353--360.Google Scholar
- Joseph JaJa. 1992. Introduction to Parallel Algorithms. Addison-Wesley Professional. Google Scholar
Digital Library
- Stephen Kell, Danilo Ansaloni, Walter Binder, and Lukáš Marek. 2012. The JVM is not observable enough (and what to do about it). In VMIL. 33--38. Google Scholar
Digital Library
- Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In ECOOP. 220--242.Google Scholar
- Clyde P. Kruskal and Carl H. Smith. 1988. On the notion of granularity. J. Supercomput. 1, 4 (1988), 395--408.Google Scholar
Cross Ref
- Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. 2012. Work-stealing without the baggage. In OOPSLA. 297--314. Google Scholar
Digital Library
- Philipp Lengauer, Verena Bitto, Hanspeter Mössenböck, and Markus Weninger. 2017. A comprehensive Java benchmark study on memory and garbage collection behavior of DaCapo, DaCapo Scala, and SPECjvm2008. In ICPE. 3--14. Google Scholar
Digital Library
- Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V. Kale. 2014. Optimizing data locality for fork/join programs using constrained work stealing. In SC. 857--868. Google Scholar
Digital Library
- Linux man. 2013. top(1). Retrieved from https://linux.die.net/man/1/top.Google Scholar
- Linux man. 2018. Documentation of CLOCK_MONOTONIC in clock_gettime(). Retrieved from https://linux.die.net/man/3/clock_gettime.Google Scholar
- Pedro Lopez, Manuel Hermenegildo, and Saumya K. Debray. 1996. A methodology for granularity-based control of parallelism in logic programs. J. Symbolic Comput. 21, 4 (1996), 715--734. Google Scholar
Digital Library
- Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder, Petr Tůma, Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe. 2013. ShadowVM: Robust and comprehensive dynamic program analysis for the Java platform. In GPCE. 105--114. Google Scholar
Digital Library
- Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A domain-specific language for bytecode instrumentation. In AOSD. 239--250. Google Scholar
Digital Library
- Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. 1991. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2, 3 (1991), 264--280. Google Scholar
Digital Library
- Philippe Moret, Walter Binder, and Alex Villazon. 2009. CCCP: Complete calling context profiling in virtual execution environments. In PEPM. 151--160. Google Scholar
Digital Library
- Ananya Muddukrishna, Peter A. Jonsson, Artur Podobas, and Mats Brorsson. 2016. Grain graphs: OpenMP performance analysis made easy. In PPoPP. 28:1--28:13. Google Scholar
Digital Library
- Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the accuracy of Java profilers. In PLDI. 187--197. Google Scholar
Digital Library
- Albert Noll and Thomas Gross. 2013. Online feedback-directed optimizations for parallel Java code. In OOPSLA. 713--728. Google Scholar
Digital Library
- Oracle. 2017. Documentation of System.nanotime(). Retrieved from https://docs.oracle.com/javase/9/docs/api/java/lang/System.html.Google Scholar
- Oracle. 2017. Java Native Interface. Retrieved from https://docs.oracle.com/javase/9/docs/specs/jni/index.html.Google Scholar
- Oracle. 2017. Java Platform, Standard Edition 8 Java Development Kit Version 9 API Specification. Retrieved from https://docs.oracle.com/javase/9/docs/api/.Google Scholar
- Oracle. 2017. Java Virtual Machine Tool Interface (JVM TI). Retrieved from https://docs.oracle.com/javase/9/docs/specs/jvmti.html.Google Scholar
- Oracle. 2017. The Parallel Collector. Retrieved from https://docs.oracle.com/javase/9/gctuning/parallel-collector1.htm.Google Scholar
- Oracle. 2017. ExecutorService. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ExecutorService.html.Google Scholar
- Oracle. 2017. ForkJoinPool. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ForkJoinPool.html.Google Scholar
- Oracle. 2017. ThreadPoolExecutor. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ThreadPoolExecutor.html.Google Scholar
- perf. 2015. Linux profiling with performance counters. Retrieved from https://perf.wiki.kernel.org.Google Scholar
- Andrea Rosà and Walter Binder. 2018. Optimizing type-specific instrumentation on the JVM with reflective supertype information. J. Visual Lang. Comput. 49 (2018), 29--45.Google Scholar
Cross Ref
- Andrea Rosà, Lydia Y. Chen, and Walter Binder. 2016. Actor profiling in virtual execution environments. SIGPLAN Not. 52, 3 (Oct. 2016), 36--46. Google Scholar
Digital Library
- Andrea Rosà, Eduardo Rosales, and Walter Binder. 2017. Accurate reification of complete supertype information for dynamic analysis on the JVM. SIGPLAN Not. 52, 12 (Oct. 2017), 104--116. Google Scholar
Digital Library
- Andrea Rosà, Eduardo Rosales, and Walter Binder. 2018. Analyzing and optimizing task granularity on the JVM. In CGO. 27--37. Google Scholar
Digital Library
- Eduardo Rosales, Andrea Rosà, and Walter Binder. 2017. tgp: A task-granularity profiler for the Java virtual machine. In APSEC. 570--575.Google Scholar
- Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 1). Retrieved from http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/.Google Scholar
- Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 2). Retrieved from http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.Google Scholar
- Aibek Sarimbekov, Andreas Sewe, Walter Binder, Philippe Moret, and Mira Mezini. 2014. JP2: Call-site aware calling context profiling for the Java virtual machine. Sci. Comput. Program. 79 (Jan. 2014), 146--157. Google Scholar
Digital Library
- Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, and Charles E. Leiserson. 2015. The cilkprof scalability profiler. In SPAA. 89--100. Google Scholar
Digital Library
- Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and analysis of a scala benchmark suite for the Java virtual machine. In OOPSLA. 657--676. Google Scholar
Digital Library
- The Apache Software Foundation. 2018. Apache Spark—RDD Programming Guide. Retrieved from https://spark.apache.org/docs/latest/rdd-programming-guide.html.Google Scholar
- The Apache Software Foundation. 2018. Apache Spark MLlib. Retrieved from https://spark.apache.org/mllib/.Google Scholar
- The Apache Software Foundation. 2018. Apache Tomcat. Retrieved from http://tomcat.apache.org.Google Scholar
- The Apache Software Foundation. 2018. Lucene. Retrieved from https://lucene.apache.org.Google Scholar
- The Apache Software Foundation. 2018. Spark Configuration. Retrieved from https://spark.apache.org/docs/latest/configuration.html.Google Scholar
- The Apache Software Foundation. 2018. Spark Streaming. Retrieved from https://spark.apache.org/streaming/.Google Scholar
- The Apache Software Foundation. 2018. SparkContext API. Retrieved from https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html.Google Scholar
- The Eclipse Foundation. 2016. Jetty. Retrieved from http://www.eclipse.org/jetty/.Google Scholar
- The Eclipse Foundation. 2018. Eclipse. Retrieved from https://www.eclipse.org.Google Scholar
- The Stanford Natural Language Processing Group. 2010. Stanford Topic Modeling Toolbox. Retrieved from https://nlp.stanford.edu/software/tmt/tmt-0.4/.Google Scholar
- Peter Thoman, Herbert Jordan, and Thomas Fahringer. 2013. Adaptive granularity control in task parallel programs using multiversioning. In Euro-Par. 164--177. Google Scholar
Digital Library
- TPC. 2010. TPC-C. Retrieved from http://www.tpc.org/tpcc/.Google Scholar
- Alex Villazón, Haiyang Sun, Andrea Rosà, Eduardo Rosales, Daniele Bonetta, Isabella Defilippis, Sergio Oporto, and Walter Binder. 2019. Automated large-scale multi-language dynamic program analysis in the wild. In ECOOP. 1--26.Google Scholar
- Adarsh Yoga and Santosh Nagarakatte. 2017. A fast causal profiler for task parallel programs. In ESEC/FSE. 15--26. Google Scholar
Digital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. 1--14. Google Scholar
Digital Library
- Jisheng Zhao, Jun Shirako, V. Krishna Nandivada, and Vivek Sarkar. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In PACT. 169--180. Google Scholar
Digital Library
- Yudi Zheng, Lubomír Bulej, and Walter Binder. 2015. Accurate profiling in the presence of dynamic compilation. In OOPSLA. 433--450. Google Scholar
Digital Library
- Yudi Zheng, Andrea Rosà, Luca Salucci, Yao Li, Haiyang Sun, Omar Javed, Lumobír Bulej, Lydia Y. Chen, Zhengwei Qi, and Walter Binder. 2016. AutoBench: Finding workloads that you need using pluggable hybrid analyses. In SANER. 639--643.Google Scholar
- Gary M. Zoppetti, Gagan Agrawal, Lori Pollock, Jose Nelson Amaral, Xinan Tang, and Guang Gao. 2000. Automatic compiler techniques for thread coarsening for multithreaded architectures. In ICS. 306--315. Google Scholar
Digital Library
Index Terms
Analysis and Optimization of Task Granularity on the Java Virtual Machine
Recommendations
Analyzing and optimizing task granularity on the JVM
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and OptimizationTask granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable ...
Understanding task granularity on the JVM: profiling, analysis, and optimization
Programming '18: Companion Proceedings of the 2nd International Conference on the Art, Science, and Engineering of ProgrammingTask granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable ...
An Adaptive Task Granularity Based Scheduling for Task-centric Parallelism
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)Different from data parallel model, task parallel computing model is very important for complex analysis and data mining. Task granularity is a key factor that significantly affects the performance of task-centric parallel programs. However, current ...






Comments