A Multifaceted Memory Analysis of Java Benchmarks

Java benchmarking suites like Dacapo and Renaissance are employed by the research community to evaluate the performance of novel features in managed runtime systems. These suites encompass various applications with diverse behaviors in order to stress test different subsystems of a managed runtime. Therefore, understanding and characterizing the behavior of these benchmarks is important when trying to interpret experimental results. This paper presents an in-depth study of the memory behavior of 30 Dacapo and Renaissance applications. To realize the study, a characterization methodology based on a two-faceted profiling process of the Java applications is employed. The two-faceted profiling offers comprehensive insights into the memory behavior of Java applications, as it is composed of high-level and low-level metrics obtained through a Java object profiler (NUMAProfiler) and a microarchitectural event profiler (PerfUtil) of MaxineVM, respectively. By using this profiling methodology we classify the Dacapo and Renaissance applications regarding their intensity in object allocations, object accesses, LLC, and main memory pressure. In addition, several other aspects such as the JVM impact on the memory behavior of the application are discussed.


Introduction
The in-depth understanding of the memory behavior of standardized benchmarking suites, such as the traditional Dacapo [3], as well as the later Renaissance [23], is essential for the community of JVM researchers and practitioners.That said, memory profiling for Java applications is a challenging task due to the "noise" introduced by the JVM itself.The JVM interference lowers the accuracy of coarse-grain, black-box profiling 1 , while on the other hand, fine-grain wrapping of the application code imposes technical challenges as it is intrusive and requires source code recompilation.
Popular Java profilers offer a range of high-level metrics, including object allocations, threads, GC, and more.However, such tools typically provide only the high-level profile of a Java application, lacking correlation with the underlying hardware, and thus making the analysis susceptible to blind spots and inconsistent conclusions.For instance, the Dacapo sunflow is characterized by [15] as "memoryintensive" with respect to the total object allocations and the allocation rate.However, in this paper we discover that a close inspection of the last level cache misses reveals that the application does not pose high pressure to the main memory due to its good data locality.Hence, although sunflow indeed is memory intensive with respect to the number of total allocations, as we also observe, there is no negative impact to its performance.Such a case highlights the value of co-examining low-level hardware-related metrics alongside typical high-level metrics to avoid misconceptions when profiling managed applications.Similarly, approaches that solely focus on low-level metrics lack the reverse correlation and consequently turn out to be insufficient for the same reasons.For the aforementioned reasons, a multifaceted characterization approach that combines metrics across different layers of the stack is needed.
This paper, addresses this gap by proposing a methodology to characterize the memory behavior of a Java application by analyzing and correlating application and hardware profiling metrics.The proposed methodology is multifaceted as it employs two independent profilers of MaxineVM [13]: NUMAProfiler [21], and PerfUtil [21,22].The former profiler monitors the VM to collect high-level metrics, such as object allocations, accesses, etc.This information enables an initial characterization of the memory intensity of an application.The latter profiler leverages the Hardware Performance Counters of the system to collect low-level metrics related to hardware.This information is helpful to confirm or reconsider the outcome of the initial characterization.The co-examination of high and low-level metrics reduces potential profiling blind-spots and provides valuable insights for both the extensively studied Dacapo benchmarks and the newer Renaissance benchmark suite.Hence, this paper makes the following contributions: 1.It proposes a methodology to effectively characterize the memory behavior of a Java application based on a multifaceted profile that is composed of high-level and low-level metrics.2. It presents a comprehensive study on the memory behavior of 30 Dacapo and Renaissance benchmarks.The study not only showcases the effectiveness of the proposed methodology through the discussion of selected examples but also results in the classification of the studied benchmarks into several categories, leveraging the multifaceted profile.
The rest of the paper is organized, as follows.Section 2 presents the tools that are utilized to perform the multifaceted profiling.Section 3 describes the experimental methodology that is followed for profiling all benchmarks with the selected profilers as well as the experimental testbed.Section 4 presents a rigorous study on the memory behavior of all benchmarks.In particular, Section 4.2 performs an initial characterization of the applications using the applicationlevel metrics obtained by NUMAProfiler, whereas Section 4.3 expands the initial characterization while discussing the microarchitectural metrics obtained by PerfUtil.Finally, Section 5 presents the related work, and Section 6 conveys the conclusions.

Tooling support for multifaceted profiling
This section presents the two profiling tools that operate within MaxineVM, a metacircular research VM written in Java.NUMAProfiler is employed to collect the high-level metrics related to the application layer, while PerfUtil is used to obtain low-level microarchitectural metrics.Both profilers have been independently validated against other functionality equivalent ones [20].Even though MaxineVM and its profiling tools do not suggest a production environment, this paper stands as an effective proof-of-concept and aims to point towards new profiling opportunities for managed runtimes.Figure 1 illustrates the software stack of MaxineVM along with an overview of the data metrics that are being collected by each profiler.More information about each profiler is given in the following sections.

NUMAProfiler
NUMAProfiler is an accurate Java object profiler for Max-ineVM that is also enriched with NUMA awareness [21].It probes the runtime layer of the VM in order to monitor object allocations, object accesses, survivor objects after garbage collection, threads as well as the NUMA placement of the virtual pages in the heap.NUMAProfiler exposes an API to the VM runtime.The API calls are injected into the proper components of MaxineVM.To avoid heap pollution as well as the interruption of the application's threads, NUMAProfiler maintains thread-local buffers off-heap to store the profiling data.Additionally, the API calls are used to lazily trigger the profiler mechanisms when it is necessary.Even though the profiling process is simplified due to this lazy approach, substantial overhead is introduced by the profiler (∼ 10).However, that overhead can be tolerated since NUMAProfiler is intended for offline profiling purposes.While the NUMA-related features of the profiler pertain to NUMA hardware and its implications, this paper focuses solely on studying the memory behavior of Java applications within a traditional CPU architecture with uniform memory access.Hence, NUMA architecture which is orthogonal to the current study, is not within the scope of this work.
A feature of NUMAProfiler is the classification of the ownership for the object accesses.NUMAProfiler classifies an object access as shared or thread-local.Such a classification is an important application property [12] because it highlights the inter-thread dependencies; nevertheless, it is not a trivial task.An object access is considered as shared, if the thread that performed the access and the thread that acts as the object owner are different.MaxineVM is modified to store the owner of each object into the misc word of the object header.Hence, the owner thread is disclosed during the profiling of an object access along with the thread that performs the access.For this work, NUMAProfiler is tuned to consider that the owner of an object is the thread that has allocated this object (allocator thread).

PerfUtil
PerfUtil is an accurate and flexible profiler which equips the VM itself with fine-grain utilization of the Hardware Performance Counters [22].It interfaces with the perf [9] functionality of the Linux kernel, and it passes over the control to the Java code of the VM.PerfUtil offers a flexible and customizable way of monitoring microarchitectural metrics per thread, per core or both.Similarly to NUMAProfiler, the functionality of PerfUtil is exposed to the VM runtime by an API.In addition, PerfUtil supports time-multiplexing that enables a large set of events to be simultaneously counted, while it operates with low overhead [21].
3 Experimental Setup

Testbed Characteristics
Table 1 shows the hardware and software characteristics of the testbed that we used.The testbed is a Dell PowerEdge R620 server that contains a dual socket Intel Xeon processor with two NUMA nodes that result in 32 number of cores.
To ensure that any NUMA-related effect that might influence performance is excluded when studying the memory behavior of the selected benchmarks, we employ the Single Node configuration (see Table 1).That configuration establishes a Uniform Memory Access (UMA) environment for performing our experiments by utilizing only one NUMA node.Moreover, the Intel hyper-threading technology is disabled to prevent any additional variations with regard to the performance or the memory behavior of the benchmarks.
To avoid dynamic voltage and frequency scaling (DVFS), the CPU frequency is fixed to at 2.9 GHz via the ACPI CPU frequency driver.

Benchmark Suites
The latest pre-built maintenance release of Dacapo benchmarks (dacapo 9.12 MR1) [4] was used, while the pre-built 0.11.0 release2 was used for Renaissance.The number of iterations of the benchmarks was selected based on the well known good practices [15,23] to reach a warmed up state and it was augmented by ten additional runs to include enough run-steady iterations in our measurements.Moreover, Dacapo allows the user to configure the input size and the deployed threads with some exceptions (i.e., avrora) where the number of threads is determined by the input size.The benchmarks of Renaissance have a "test" (small) and a "jmh" (default/large) input size and most of the benchmarks aim to automatically deploy worker threads equal to the number of available cores.In our experiments, we used the largest input size and deployed eight threads (wherever possible) which corresponds to the number of cores in a single node.Table 2 lists the studied Dacapo and Renaissance applications, along with their run configurations.Note that some applications (batik, eclipse, tomcat, tradebeans, tradesoap, dec-tree, finagle-chirper, finagle-http, page-rank) are omitted from the performance evaluation due to various failures of MaxineVM including memory corruption (segfaults) or concurrency bugs that lead to livelocks.

Experimental Methodology
The experiments were conducted in a two-step process.Each step corresponds to an individual build of MaxineVM equipped exclusively with one of the two profilers to prevent interference between the profilers that may skew the results.The first step, deploys MaxineVM with NUMAProfiler to collect various object-related metrics, while the second step runs MaxineVM with PerfUtil to collect numerous microarchitectural metrics.Note that the non-determinism of Dacapo and Renaissance was experimentally observed to have minimal impact on our results.We verified this by comparing and contrasting the two runs as follows.First, we compared the instruction count between runs and the total cycles required to complete the runs.Then, we compared the cache behavior of benchmarks across the two runs ensuring the miss ratios are similar.Naturally, the two runs have slightly different absolute numbers; however, the behavior of the benchmarks was almost identical.To reach parity between the two runs we ensured that all timed executions were in hot state (i.e.almost no recompilation was taking place).In addition, we ran all experiments with the same configurations and with large heap sizes to ensure minimal interference from the GC.Furthermore, when comparing against Open-JDK runs (Section 4.6 ), we configured OpenJDK to behave as similar as possible to MaxineVM by de-activating optimizations that are not present in MaxineVM (e.g.escape analysis, compressed pointers, etc.).Finally, the multiplexing feature of PerfUtil enabled concurrent monitoring of thirty-two perf events in a single run.

Methodology
Performing such a multifaceted performance analysis -involving two different profilers -produces a large number of data points which may be difficult to navigate and draw conclusions.Below we provide a methodology, based on our experience, on how to interpret those numbers in order to characterize various benchmarks.
In general, we can follow two approaches: bottom-up or top-down.In the bottom-up approach we start by looking into the micro-architectural characterization of a benchmark trying to understand what factors affects its performance.Then, we move at a higher level by comparing and contrasting the numbers achieved by the high-level profiler in order to validate or complement our assumptions and understanding based on the low-level metrics.By examining the results from the low-level profiler in Table 4, the first metric we focus on is that of the CPI (Cycles Per Instruction).In general, the larger the CPI the slower the benchmark is.If the CPI is high, we typically check the three main factors that affect performance on modern processors: branch misprediction ratio, cache miss ratio, and TLB miss ratio.
Based on the observed numbers, we hypothesize about the behavior of the benchmark and then we try to the validate those hypotheses by comparing and contrasting the low-level results with those from the high level profiler.For example, if we notice that a benchmark has high cache miss ratios, which correlates with high CPI, we examine its allocation rate and size to determine whether this is the root cause of the problem or whether the benchmark has just irregular memory access patterns.The same logic can be applied for other metrics.
In the top-down approach, we follow a reverse methodology.We first look into the high-level performance metrics of NUMAProfiler to get a high-level understanding of the benchmark and then we start delving deeper into its performance characteristics.By examining first the high-level performance metrics we can identify potential performance bottlenecks of a benchmark and then focus on the low-level microarchitectural profiling results that regard these specific potential bottlenecks (e.g., high allocation rates may result to high cache or TLB miss ratios).
In the following subsections, after presenting the collective results from both profilers for all benchmarks, we apply the methodology across two particular benchmarks as a guideline (Section 4.5).

Characterization With High-Level Metrics
The object allocations, accesses, and their rates over time, are quite indicative regarding the memory intensity of an application.However, they do not always lead to well-rounded conclusions as highlighted in this section.Table 3 (inspired by Lengauer et al. [15]) outlines several object-related metrics (as obtained via NUMAProfiler) for each application.The notable observations and findings of those metrics are discussed in the following subsections.The reported numbers derive from the average of ten run-steady iterations (after warm-up), and the maximum value of each metric is highlighted as bold.Moreover, note that the NUMAProfiler numbers inevitably incorporate MaxineVM-internal objects due to metacircularity.Thus, any observed difference against a HotSpot-based profiler (i.e., AntTracks [14,15]) is expected and attributed to the effect of metacircularity [20].

Object Allocations
Count & Rate.The total count of object allocations and memory footprint indicate how much memory is allocated per application.However, they do not highlight the intensity of the memory allocation.The object count and object size per second metrics should be taken under consideration towards characterizing the memory intensity of a managed application.An application that allocates new objects at a high rate is very likely to put excessive pressure on the memory system.
Dacapo: H2 allocates overall the most objects and size of memory, however it has a low allocation rate.This is due to the large number of instructions (and consequently execution time) that h2 has (see Table 2).Sunflow allocates less and smaller objects but it is the most allocation intensive application in terms of objects and memory size.Jython is the most intensive single-threaded benchmark both in terms of objects and memory size allocation.Lusearch-fix has been introduced as an update to lusearch bearing a fix in the Lucene platform that reduces object allocations; however, no such difference is observed 3 .
Renaissance: Akka-uct, naive-bayes, neo4j-analytics, h2, and gauss-mix allocate the most objects in total (per iteration).Akka-uct allocates almost double the objects of naive-bayes which is the second highest allocating application.Mnemonics and scala-doku are the single-threaded applications with the most total object allocations.Naive-bayes, akka-uct, gauss-mix, neo4j-analytics, db-shootout, and scrabble are the most intensive in terms of both object and size allocation rate.

Memory
Footprint & Object Layout.The overall object allocations do not necessarily reflect the memory allocation footprint size (per iteration, in MB).
Dacapo: Luindex allocates the largest objects on average, and it contains the most and longest arrays.However, it is a single-threaded application with the smallest memory footprint among the Dacapo applications.Lusearch and xalan follow in terms of average object size showing also higher array rate and average array length than the geometric mean of Dacapos.Xalan is an application that has large objects, on average.Even though it performs fewer allocations than sunflow, it ends up with a higher memory footprint.
Renaissance: Fj-kmeans has by far the largest average object size (1.13 kB) among Renaissance applications, while log-regression, db-shootout, and movie-lens follow.It is notable that although fj-kmeans and db-shootout allocate fewer objects than naive-bayes or neo4j-analytics, they end up with higher memory footprint which is apparently related with object size.In addition, fj-kmeans is an array-dominated application with 63.9% of its allocations being arrays while db-shootout, and movie-lens follow.
The discussion above highlights that such metrics are crucial especially when the observed memory footprint origin matters (i.e., GC optimizations, an optimization targeting large objects, heap size tuning and more).Moreover, the Object Layout metrics reveal additional properties of an application which are very likely to affect its memory and/or overall behavior.For example, large objects (as in lusearch and xalan) are likely to span across two memory pages.Such applications can stress TLBs and page-tables more than others.This type of memory pressure can be a source of inefficiency in the context of NUMA [1,10].For example, lusearch and xalan have been proven unfriendly to the

Object Accesses.
The object accesses highlight the application-memory relation degree as observed from the application layer.The columns "Object Accesses" and "Sh.Accesses" of Table 3 present a collection of object access metrics as well as the percentage of shared accesses.The latter refers to the number of accesses performed by a different thread rather than the "owner" of the object (recall Section 2.1) as a percentage of total object accesses.Dacapo: As can be observed in Table 3, h2 performs the most object accesses in total while, sunflow, xalan, and avrora have more than 2x more accesses than the geomean.
Sunflow performs the most object accesses per second followed by lusearch and xalan.All applications are readdominated with sunflow having the most (30) reads per one write.Avrora shows the highest shared object R/W access rate.Sunflow follows, but it has only shared Read accesses.A high percentage of shared Read accesses is an indication for the existence of the producer-consumer memory access pattern.Finally, avrora, h2, and xalan show the highest percentage of shared writes.
Renaissance: Akka-uct, fj-kmeans, reactors, and neo4j-analytics perform by far the most object accesses.Akka-uct and fj-kmeans show the highest object access rate along with philosophers, neo4j-analytics, scrabble, and naive-bayes that follow.All applications are dominated by read accesses, with fj-kmeans having 120 reads per one write.On the other hand, db-shootout and neo4j-analytics have the most balanced R/W ratio.Many Renaissance applications show a considerable degree of shared accesses with reactors having the most shared reads and writes.Even though actor frameworks aim to guarantee workload concurrency, their asynchronous nonblocking message passing infrastructure inevitably leads to accessing objects "owned" (allocated) by other threads.Therefore, the degree of shared access for reactors and akka-uct is justified.Nevertheless, it should be noted that both reactors and akka-uct are artificial stress-test benchmarks.As such, their observed behavior might not fully represent such frameworks in general.On the contrary, als, log-regression, naive-bayes show negligible shared object accesses, thus potentially denoting data parallelism.
The above analysis makes clear that shared accesses profiling can outline insights regarding the internal data dependencies of the application.Such a property, especially with respect to writes, is of high importance since it is tightly related to the scalability of an application (i.e., on a NUMA system).

4.2.4
Object Metrics Summary.Herefore, we have surveyed all managed applications with respect to several highlevel memory metrics related to object allocations and object accesses.Figure 2 groups those applications by the allocation and access rates, and filters out those below the geometric mean using the heuristic of Equation ( 1): The left and right circles contain applications that exceed the geometric mean of the object allocation rate and the object access rate, respectively.The intersection of the two circles highlights the applications that are intensive both in terms of object allocations and object accesses.The emerging classification confirms already known trends for Dacapo [12,15].However, a small differentiation is observed in terms of absolute numbers as a side effect of the metacircular runtime, the slightly different run configurations, and/or the different (updated) version of the benchmark suite.Unlike previous studies [12,15], we observe that although the above metrics are necessary, they are not sufficient to properly characterize the memory behavior of a managed application.Thus, the next section enhances our study with microarchitectural analysis of metrics provided by PerfUtil.

Characterization With Low-Level Metrics
This section analyzes and discusses the findings derived from PerfUtil.The insights provided by such a low-level profile complement the findings of Section 4.2 while deepening the understanding of the Dacapo and Renaissance applications.To perform the profiling with PerfUtil, we followed the same experimental procedure with NUMAProfiler, as described in Section 4.2.

Overview of Hardware Instructions.
Table 2 shows the distribution of Arithmetic (integer and floating point), Branch, and Memory Instructions per benchmark.The collected metrics are presented as a percentage over the total number of retired instructions.Table 2 also reveals the ratio between Memory, Arithmetic, and Branch Instructions as well as whether an application is dominated by read or write accesses.PerfUtil counts the Total Retired Instructions, L1D Reads, L1D Writes, Branch Instructions, and thus, the number of Arithmetic Instructions is calculated as follows: Read and write ratios settle towards read operations, since the L1D read instruction percentage is higher than L1D writes for all applications.However, Renaissance applications show a greater diversity than Dacapo.For instance, als, chi-square, movie-lens, and naive-bayes (all belong to the Apache Spark family) are below the percentage of minimum memory instructions observed in Dacapo, while future-genetic is beyond the maximum one.Nevertheless, the geometric mean of the percentage of memory instructions (Table 2 -Total Mem.) is, as expected, ∼45% and all applications are dominated by read accesses.The L1D R/W ratio tends to be aligned to the R/W Ratio of Table 3 even though minor misalignments are visible; probably due to the "noise" introduced by the VM infrastructure.Note that the observed total instructions of a managed application are essentially a mix of instructions from the application and the VM itself.Since there is no obvious way to safely estimate and exclude the latter, a co-interpretation of the results derived from the NUMAProfiler and PerfUtil is necessary.

Data
Locality & Cache/Memory Pressure.Although the cache hierarchy aims to fill the latency gap between the CPU and main memory, the latter often remains a source of delays in the execution of a program.Due to complex features of modern hardware (e.g., out-of-order execution, multiple cache levels, shared memory, etc.) such characterization lacks a strict definition (or concrete methodology) and can only rely on a multifaceted profile that comprises numerous metrics.However, CPI co-examination along with memory hierarchy and Branch Prediction Unit (BPU) pressure and locality metrics can reveal useful insights regarding The larger the MPKI value is, the "heavier" the load for the corresponding memory hierarchy level is.Consequently, this set of "pressure" metrics can be used to assess the memorybound degree in correlation to other applications.LLC MPKI reveals that an application's object allocation and access intensiveness are not necessarily reflected in main memory pressure which is counter-intuitive.For instance, in Dacapo, sunflow is the most intensive application in terms of object allocations and object accesses (recall Section 4.2.1).However, Table 4 shows that the largest pressure on main memory among the Dacapo benchmarks is caused by h2.On the contrary, sunflow seems to put the least pressure, among the multithreaded applications, on the memory hierarchy as the LLC/Memory pressure and CPI metrics.This is justified by the good spatial and/or temporal locality of sunflow working data.Sunflow's behavior can also be observed in LLC and memory accesses per kilo object operation ratio which are below the geomean and among the lowest.Consequently, the accesses per kilo object operation metrics are quite indicative regarding the locality of working data by comparing object operations against actual cache/memory pressure.Fop is the most LLC and main memory intensive among the singlethreaded applications (fop, jython, luindex), which is also confirmed by its CPI.It is notable that although avrora and pmd are the most LLC intensive applications, they finally put low pressure on memory denoting that their data set successfully fits into the larger LLC (compared to L2).After examining LLC and memory pressure metrics, avrora's CPI seems to be affected more by BPU, DTLB, and cache rather than main memory (see Table 4).High DTLB pressure of lusearch and xalan probably is related to their large objects.
Renaissance: As can be observed in Table 4, reactors, scrabble, dotty, and scala-stm-bench7 have high CPI values.Such a fact could imply stalls due to memory and consequently memory-boundness.However, this observation contradicts with Figure 2 where only scrabble and scala-stm-bench7 seem to be "object allocation and access intensive".Therefore, the assessment of memory intensity cannot rely only on object-level metrics.An application might be memory-bound due to other reasons (i.e., lack of memory locality), even though it does not significantly allocate or access objects.In particular, reactors shows 2.5x more LLC accesses per kilo object operations than the geomean which implies lack of data locality, while dotty shows very high BPU MPKI which indicates irregularity in memory access patterns (recall the first paragraph of Section 4.3.2 that explains how the BPU MPKI is related to irregular memory access patterns).Similarly, par-mnemonics has been classified as memory-bound [23], however its CPI is way below 1; hence, the memory is not the most decisive factor for performance of this application.In case of als, the inspection of the CPI value confirms that it is compute-bound.Movie-lens which has been also classified as compute-bound [23], it lacks locality as it shows 2x more LLC accesses per kilo object operations than the geomean.Hence, the performance of this application is rather influenced also by memory.
Moreover, akka-uct, gauss-mix, naive-bayes, db-shootout, scrabble, neo4j-analytics, and philosophers are in the intersection of Figure 2; hence it is expected to be memory intensive applications.Nevertheless, they diversify as they do not put pressure on both the LLC and memory.Akka-uct, gauss-mix, naive-bayes, db-shootout and neo4j-analytics put significant pressure to memory, as expected.On the contrary, scrabble and philosophers put significant pressure only up to the LLC.It is very likely that scrabble and philosophers either benefit from locality in LLC or/and have a smaller working data set that fits into LLC.Although, we cannot safely estimate the exact reason for each, note that low LLC and memory accesses per kilo object operation of philosophers indicate good data locality.On the other hand, scrabble shows high BPU MKPI, a symptom of irregular memory access patterns.Irregular memory access patterns in scrabble are also reported by [21], based on the fact that this application deploys a centralized HashSet data structure for its working data 4 .In addition, rx-scrabble (that implements the same algorithm as scrabble but uses an alternative framework), dotty, mnemonics, par-mnemonics, scala-kmeans, philosophers, and log-regression are candidates for irregular memory patterns due to their high BPU MPKI.
A similar diversity is observed for the applications in the right circle of Figure 2. Fj-kmeans and scala-stm-bench7 put significant pressure on both LLC and memory, while future-genetic stresses only the LLC.Fj-kmeans is the second most intensive application in terms of object accesses, it is a read-dominated application, and according to those data it seems to benefit from data locality since it has the lowest LLC accesses per kilo object operations.Nevertheless, it is notable that [21] characterizes fj-kmeans as data locality bound in the context of a NUMA system.Considering this fact, we observe that indeed, this application benefits from data locality in a unified LLC potentially due to limited cached data size.However, this is not true in a distributed LLC environment (like a NUMA system) where cache coherency protocols significantly impact the locality of data in the LLC.The "object allocations intensive" chi-square and scala-doku do not put significant pressure neither to LLC nor to Memory, since object allocation operations are 3-5 orders of magnitude lower than object accesses.Log-regression, naive-bayes, gauss-mix, and chi-square, which are Spark applications, show greater than 50% LLC Miss Rate.Such poor data locality inevitably brings to the spotlight the effect of the Spark engine when co-located with worker threads over the same limited CPU and cache resources.However, only gauss-mix and naive-bayes end up with high memory pressure among the aforementioned applications.Akka-uct has lower CPI than reactors although the former has almost three times more accesses to main memory.This counter-intuitive observation is due to the domination of memory instructions in reactors (see Table 2), and because it has 2x more LLC accesses per kilo object operations than akka-uct.Throughout this comparison, the high complexity of memory behavior analysis is highlighted, thereby denoting that the memory overhead can be derived by any component of the stack.

The Benefits of Multifaceted Profiling
The above characterization of each application according to its memory behavior is illustratively summarized in Figure 3.The left part of this figure depicts the "view" obtained by each profiling tool individually, while the right part illustrates the "view" that is achieved by co-utilizing those tools.This figure demonstrates the benefits of multifaceted profiling by putting aside and performing a perception-wise comparison between the left with the right part.For example, the "view" of NUMAProfiler indicates that sunflow and philosophers are memory intensive applications, however as it is turned out by the multifaceted profiling they are neither LLC nor main memory intensive.On the other hand, dotty, fj-kmeans or scala-stm-bench7 put considerable pressure on the LLC and the main memory, even though they do not perform much object allocations.Therefore, it is clear that the proposed methodology broadens the profiling view, and avoids misconceptions as well as blind-spots; hence, it provides new opportunities for more effective profiling of managed applications.

How to Navigate through the Numbers
To exemplify the methodology explained in Section 4.1, we use as an example of a bottom-up analysis the scrabble benchmark from the Renaissance suite.By examining the results from the low-level profiler in Table 4, the first metric we focus on is that of the CPI.In general, the larger the CPI the slower the benchmark is.In the case of scrabble, we see that the CPI is high compared to the other benchmarks (1.46) which means that this benchmark for some reason(s) does not execute fast.The next step is to discover why scrabble behaves this way.For this, we typically check the three main factors that affect performance on modern processors: branch misprediction ratio, cache miss ratio, and TLB miss ratio.As shown in Table 4, scrabble has high BPU (4.11), DTLB (1.05), L1 (25.29), and L2 (4.79) MPKI which justify the high CPI.At this stage, we conclude that scrabble puts pressure both on the CPU's front-end (branch predictor) and on the back-end (memory subsystem) which means that the benchmark has significant and unpredictable control flow divergence which may influence also its behavior when accessing memory.Although the MPKI is high for L1 and L2, we observe that the LLC although it has a miss high rate, it is not amongst the worst performers which means the following: 1) either scrabble is not too memory intensive and its dataset can fit into the caches, or 2) it is intensive but the hardware prefetcher does a good job in fetching the correct data upon LLC misses.In order to understand in which category scrabble belongs, it is now time to investigate the numbers we obtained from the high-level profiler shown in Table 3.As shown in Table 3, scrabble has a fairly average sized dataset (2.9 GB) compared to the rest of the benchmarks and does not have any write shared accesses.If we combine all the findings we had so far by investigating the results produced by the two profilers we understand that scrabble: 1) has irregular branch behavior, 2) has irregular memory behavior through the cache hierarchy, and 3) although it is memory intensive it does not put significant pressure beyond the LLC.Therefore, by combining the two profilers we derive that scrabble exhibits an irregular memory pattern that does not extend beyond its LLC and has negative effect in its performance.The irregular memory access behavior is probably triggered by unpredictable code paths within the code that access different parts of the caches.
As an example of applying the top-down approach, we use the sunflow benchmark from the Dacapo suite, for which, we have prior knowledge from existing works.The findings in Table 3 verify prior work.Indeed, sunflow has a high object and data allocation rate; 43K objects/sec and 1.8 GB/sec, respectively.By looking at these numbers we may assume that sunflow puts significant pressure into memory which may result in high cache miss ratios and hence low performance.To validate this hypothesis, we contrast the high-level numbers with the low-level ones from Table 4.As we see, although sunflow has large object and data allocation rates, its performance is amongst the best in the Dacapo suite since its CPI is very low (0.59).This means that although it is regarded as memory intensive, this does not reflect negatively in its performance since both its L1 and L2 miss ratios are very low.However, we observe that the LLC miss ratio is higher than the rest of the benchmarks which is natural since it has a large data set.Hence, the CPU will fetch the data from memory into the LLC upon request.However, because sunflow exhibits good memory locality, as soon as the data enter the cache subsystem (via a miss request or by the hardware prefetcher), it makes good (re-)use of them.Therefore, from the micro-architectural point of view, although sunflow fetches data from memory due to its large dataset, it does it in a way that it does not negatively affect the performance due to extremely good data locality.

The Impact of MaxineVM on Characterization
This study aims to characterize the memory behavior of the Dacapo and Renaissance benchmark suite using MaxineVM and its profiling tools.Naturally, as Blackburn et al. pointed in their 2006 study [3], "we can draw dramatically divergent conclusions by simply selecting a particular iteration, virtual machine, heap size, architecture, or benchmark" [3].To quantify the effect of MaxineVM on our study we profile all applications with "perf-stat" on OpenJDK 8 and OpenJDK 11 ( Table 5 shows the VM configurations in detail -for machine    1).Then, we apply the same characterization heuristics on the obtained results and finally compare and discuss the outcome.The objective of this analysis is to understand if the benchmarks exhibit the same behavior across different MREs rather than comparing the exact metrics.
For this comparison, we exercised two OpenJDK configurations: 1) OpenJDK 8 with Serial collector is used as the closest configuration to MaxineVM while, 2) OpenJDK 11 with G1 GC is used as the latest compatible version with Renaissance v0.11.In order to bring the OpenJDK configurations closer to the MaxineVM one, we disabled both escape analysis and compressed pointers.In addition, we exercised large heaps to minimize GC interference.Lacking an equivalent to PerfUtil profiling tool in OpenJDK, we perform all profiling measurements for all VMs (including MaxineVM) with perf-stat.However, perf-stat does not allow finegrain profiling in order to exclude warmup iterations.To minimize this effect we run each application with increased number of iterations (as in Table 2).Table 6 presents the comparison of the memory behavior characterization between MaxineVM (baseline), OpenJDK 8 and OpenJDK 11.In addition, Table 7 in Appendix A provides an extended collection of metrics for OpenJDK 11.The memory behavior characterization is conducted by examining the L2 MPKI and LLC MPKI as measured with perf-stat in order to classify each application as LLC intensive and/or Memory intensive, accordingly.Each application with a value greater than the geomean of its run configuration is considered as LLC/Memory intensive.In essence, we classify each benchmark by comparing itself with the geomean of all benchmarks in its run.Then, we examine whether the result of this comparison is observed across all configuration runs.For example, if we assess the L2 MPKI metric for avrora, we can see that in MaxineVM the value is 12.09 (with a 4.07 geomean), in OpenJDK 8 the value is 12.03 (with a 8.15 geomean), and in OpenJDK 11 the value is 10.52 (with a 7.37 geomean).For all configurations, the observed values are significantly larger than the geomean values of their configurations hence they exhibit the same behavior across configuration runs.The "Match" column highlights whether the characterization of a benchmark with OpenJDK 8 and/or 11 matches the characterization of the same benchmark with MaxineVM.
The characterization regarding LLC intensiveness matches in 71% (of all benchmarks) between MaxineVM and both OpenJDK 8 and OpenJDK 11.Moreover, the characterization regarding memory intensiveness matches in 81% (of all benchmarks) between MaxineVM and both OpenJDK 8 and Are the results obtained from MaxineVM transferable to other VMs?: Based on our experiments, the majority of the benchmarks (71%-81%) exhibit the same behavior across different VMs and configurations; albeit with different absolute numbers.For the remaining benchmarks that do not demonstrate the same trends across VMs, we observe that even between runs within the same VM, they behave differently.For example, the results of scala-doku and philosophers would lead to different characterization even between Open-JDK 8 and OpenJDK 11.In fact, the results in MaxineVM fall in-between OpenJDK 8 and OpenJDK 11.Therefore, it is recommended that for these specific benchmarks that exhibit great sensitivity across different VMs and configurations, ad-hoc characterization is required in order to draw safe conclusions for a specific study.

Related Work
Many research efforts [5,6,8,12,[14][15][16][24][25][26] have aimed to analyze the performance-critical properties of managed applications.Some of them have proposed new profiling tools for the JVM, such as AntTracks [14], AkkaProf [25], FJProf [26], and OXJPerf [16] as well as novel profiling techniques, such as bytecode instrumentation [12], runtime-driven JVM instrumentation [12], application codewrapping [6], and BottleGraphs [8].More specifically, Kalibera et al. [12] exploited bytecode instrumentation and runtime-driven JVM instrumentation to study a wide set of concurrency metrics for Dacapo benchmark suite.DuBois et al. [8] leveraged BottleGraphs and studied the exhibited parallelism of Dacapo benchmarks.Lengauer et al. [15] utilized AntTracks to study the memory behavior of Dacapo, Dacapo Scala and SPECjvm2008 benchmark suites.AkkaProf and FJProf are two special-purpose profilers [25,26] utilized for providing effective profiling metrics for Akka and Fork/Joinbased Java applications.Rossa et al. [24] presented P3, a tool for the JVM that exploits bytecode instrumentation and offers a high-level profile related to concurrency, synchronization, etc.Those studies mainly focus on high-level application profiling and metrics, and as a result, they lack correlation with the low-level hardware metrics proposed in our study.
In addition, the choices in the available tooling infrastructure regarding the utilization of Hardware Performance Counters for Java applications are notably limited.The few and rare implementations either lack the ability to perform fine-grain profiling, such as the Oracle Solaris Studio [19], Intel VTune [7,11], JMH [27] or even are not actively maintained, such as the JRockit [18], and the JikesRVM [2] which both lack support beyond Java 6.One exception is the OJX-Perf by Li et al. [16] which is a low-overhead profiler based on perf that binds microarchitectural events to Java objects and targets memory bloats.Nevertheless, the correlation of low with high level metrics that OJXPerf offers is limited to object and method scopes, hence it lacks visibility to the overall memory behavior of the application.
Unlike the aforementioned studies and tools which have not presented a multifaceted approach, Deshmukh et al. [5] deployed perf and Lttng [6] in order to obtain a collection of metrics from the microarchitectural as well as from the runtime layer.However, they focused on the Common Language Runtime (CLR) and on .NET applications.

Conclusion
This paper studied the memory behavior of 30 Dacapo and Renaissance applications.For this purpose, a characterization methodology based on a multifaceted profile of a Java application was proposed.The profile is composed of high and low-level metrics collected by two profilers of MaxineVM.
The findings of this work were leveraged to classify the memory behavior of the studied applications into several categories.The analysis complements other related studies by revealing additional insights for the already extensively studied Dacapo applications.Moreover, the study contributes to the memory behavior understanding of the recently introduced Renaissance benchmarks.
This work demonstrates how a characterization methodology that moves away from a single-faceted profiling approach can effectively broaden the analysis perspective by avoiding some misconceptions and blind spots.Both the proposed characterization methodology and the tooling support for the multifaceted profiling can be considered as transferable items that can be applied to other MREs, besides MaxineVM.
Finally, the work presented in this paper aims to initiate new research opportunities, such as profiling studies and optimization approaches, for MREs and to provide the research community with useful insights regarding the memory behavior and characteristics of the most common Java benchmarks.

Figure 2 .
Figure 2. Memory Intensive Applications in terms of Object Allocations and Accesses.

Figure 3 .
Figure 3. Left: Profiling view from each tool individually.Right: Profiling view by co-utilizing the tools.

Table 1 .
Hardware and Software Configurations.

Table 3 .
Object Allocations, Layout and Accesses in Dacapo & Renaissance.
Page Migration mechanism of Linux[22]which is tightly related to the TLB and page-table.Lusearch slows down by ∼13% while xalan gets its remote node accesses increased by ∼300% when Page Migration is enabled.

Table 4 .
Cache/Memory Locality and Pressure in Dacapo & Renaissance

Table 6 .
Comparison of the memory behavior characterization with OpenJDK 8 and 11 against MaxineVM.

Table 7 .
Extended collection of profiling metrics for Dacapo & Renaissance benchmarks with OpenJDK 11.