Heap Size Adjustment with CPU Control

This paper explores automatic heap sizing where developers let the frequency of GC expressed as a target overhead of the application's CPU utilisation, control the size of the heap, as opposed to the other way around. Given enough headroom and spare CPU, a concurrent garbage collector should be able to keep up with the application's allocation rate, and neither the frequency nor duration of GC should impact throughput and latency. Because of the inverse relationship between time spent performing garbage collection and the minimal size of the heap, this enables trading memory for computation and conversely, neutral to an application's performance. We describe our proposal for automatically adjusting the size of a program's heap based on the CPU overhead of GC. We show how our idea can be relatively easily integrated into ZGC, a concurrent collector in OpenJDK, and study the impact of our approach on memory requirements, throughput, latency, and energy.


Introduction
Garbage collection (GC) o ers signi cant bene ts to applications and developers.By abstracting away memory management, code is not tied to a speci c strategy for managing heap memory, allowing programs to switch easily between di erent GC implementations with di erent properties, e.g., by a command-line argument.
Managed programming languages use various approaches for controlling an application's footprint.Some languages include strategies that automatically reduce the heap size based on memory usage or other metrics.Although the programmer can in uence this behavior to some extent by ensuring that objects become garbage, the system may not detect it immediately.Other languages allow the programmer to set an upper bound for the heap size and then manage it relative to that limit.Regardless of how language runtimes manage memory, collecting memory inherently impacts performance in an indirect and hard-to-predict manner.
In OpenJDK HotSpot JVM (OpenJDK for short) a maximum heap size is set on startup, to a user-de ned value using the -Xmx command-line ag, or in its absence, by picking a default value based on the available memory of the machine (at the time of writing, OpenJDK sets Xmx to 25% of the machine's RAM).This decision is made before the program is started, so unless care is taken to explicitly control the maximum heap size, simple programs and complex enterprise applications will share the same memory constraints.
The size of the heap a ects the performance di erently depending on the GC algorithm.Stop-the-world GC's typically optimize for high throughput and are only able to deliver low latency if the working sets are small enough.In contrast, concurrent collectors typically are not able to achieve as high throughput, but by allowing program activities to continue while GC is running, the frequency or duration of GC has very little impact on a program's performance, as long as the GC is able to collect memory at the same rate as the application is allocating.When memory is abundant or allocation rate is low, infrequent GC can materialize through worse spatial locality in both types of collectors, which may negatively a ect performance and/or latency [35].
Table 1.Smallest heap sizes in MB without any allocation stalls for multiple benchmarks across multiple machines.Machines are listed in ascending order based on the number of cores and memory capacity.The lower half of the table displays architectural details for each machine.Machine number 3 is listed three times, indicating con gurations with 8, 16, and 24 cores, where the core count was controlled using the taskset command.Note that machine #1 could not run Batik without experiencing stalls due to a combination of insu cient memory and inadequate hardware resources for running GC e ectively.A common approach to picking a heap size for an application is by trial-and-error: run the program multiple times with representative load across di erent JVM instances with varying maximum heap limits and measure its performance until a suitable heap size is identi ed.A heap size may not be portable across machines and may have to be reevaluated after changes are made to the software or after a switch to a new JVM.This approach may be time-consuming and may not account for variations in the program's memory use during execution.If the maximum heap size is invariant throughout the entire program duration (as in OpenJDK), the entire heap may be used for allocating objects even when memory pressure is low(er), which defers GC and may not be optimal for a program's performance.In conclusion, determining an appropriate heap size for a given application is a complex task that necessitates consideration of various factors.These factors include the hardware con guration of the machine running the program and software-related details such as memory usage patterns.

Machines
Automatic heap size adjustment aims to free developers from the need to manually set a heap size, which has proven to be complicated.Instead, developers will be given a sensible default parameter for e ective resource management, which should also be intuitive to change.In this work, we explore automatic heap size adjustment in the context of concurrent collectors, where the heap size is controlled by how often we trigger GC, instead of the other way around.As a result, developers can launch Java applications (servers, GUI programs, command line tools, etc.) without having to worry about estimating their memory requirements or worrying that Java's default values might result in these processes ballooning to impractical proportions, thereby disrupting other programs or a ecting the application's performance negatively.Our proposal distinguishes itself from previous proposals for automatically adjusting the heap size (e.g., Bruno et al. [6], Grzegorczyk et al. [15], Yang et al. [37], and White et al. [34], cf.§7) by utilising a di erent "tuning knob" for concurrent collectors.Instead of letting developers control performance through an upper bound on the heap size, we let developers control how much CPU they are willing to spend on GC, expressed as a proportion of the CPU usage of the application.Our strategy is thus more directly tied to performance than heap size, and the heap size becomes a consequence of the GC CPU overhead budget (we call it GC target henceforth).As a result of our choice of tuning knob, the job of picking a reasonable default is easier (or dare we say possible!)than picking a default maximum heap size.
Our contributions can be summarized as: • Highlighting heap size variability: We reveal signicant variability of heap sizes across di erent benchmarks and hardware con gurations, emphasizing the impracticality of a one-size-ts-all default heap size.This nding underscores the need for more adaptive approaches.( §2) • Exploring automatic heap sizing in a concurrent collector: We target concurrent collectors whose CPUintensive activities are not on the critical path of the program's performance.This allows us to dynamically change the heap size to match the program's current behavior and allows developers to trade CPU for memory (and conversely) with minimal impact on performance.( §3) • Application in ZGC: We speci cally showcase the implementation and application of our proposed technique on ZGC, a fully concurrent garbage collector.By doing so, we demonstrate its feasibility in a real-world example.( §4) • Performance evaluation: We conduct a comprehensive evaluation demonstrating that adopting our heap size adjustment does not compromise performance or introduce latency issues.Furthermore, we establish that it is possible to determine a sensible default value for CPU overhead that can e ectively cater to a variety of applications.( §6) • Energy e ciency considerations: In addition to performance optimization, we illustrate how the concept of CPU overhead can be harnessed as a powerful tool for adjusting the energy spent by an application.( §6)

The Perils of Manual Heap Size Picking
When it comes to nding heap sizes, people use multiple rules of thumb, for example, setting heap limits to some multiple of a live set [17].The challenge becomes even more intricate when considering multiple applications running simultaneously.Kirisame et al. [24] introduced a framework to compare di erent practices people used for setting up a heap size and derived an optimal "square-root" heap limit rule, which minimizes total memory usage for all applications running together.However, it is still a static heap limit, which might not be optimal on a machine with another architecture, as we demonstrate below.
To study the challenges of manually picking a heap size, we conducted an experiment across multiple machines to nd heap sizes for a number of benchmarks.Table 1 shows heap sizes for 4 benchmarks from the DaCapo suite running with the ZGC collector across a range of di erent machines.Following best practices for tuning heap sizes for concurrent collectors, we tried to nd the smallest heap size-expressed as a power of two-that does not produce an allocation stall, a relocation stall, or an OOM (Out of Memory) error for each benchmark on each machine.Ensuring the absence of stalls is of paramount importance when utilizing fully concurrent collectors, given their low-latency nature.Stalls not only lead to performance degradation, as GC becomes critical, but they also undermine the predictability of GC, thereby posing a risk to meeting server-level agreements (SLAs) and latency requirements.Maintaining a consistent and predictable latency pro le is essential to uphold performance standards and guarantee uninterrupted service delivery.The reason why we limit ourselves to powers of two is twofold: rst, developers have a preference for selecting heap sizes in powers of two [11], and second, nding a stall-free heap size in a reasonable time requires increasing the heap in some increments.In our case, we started at 16MB, doubled the heap size on a stall, and continued our search at the higher heap size.Nevertheless, due to Java's inherited variance, we adopt a stability-oriented approach in which we consider a heap size to be a successful candidate only if three consecutive runs with the same heap size yield no stalls or OOM errors.If stalls or OOM errors do occur, we increment the heap size and repeat the evaluation process.
As it is clear from this experiment, heap sizes vary between the machines without a discernible pattern1 , such as being a function of the number of cores.In addition, we experimented with the same machine, tagged as #3 in the table, with di erent numbers of cores controlled by taskset: 8, 16, and 24.For Tomcat and Spring, the heap size changes by 4×.So, even within the same machine, modifying core con gurations can often require substantial changes in heap sizes.Thus, application deployment across di erent con gurations requires heap sizing to be repeated for each con guration.

Heap Size Adjustment with CPU Control
Similar to [34], but in a concurrent setting, we explore an approach where developers directly control memory by setting a GC target dictating how much CPU should be spent on GC, expressed as a percentage of the total CPU utilization of the program.Our insight is that memory and GC CPU utilization are inversely correlated.Let's consider a program with a constant allocation rate.When the heap is large, GC occurs infrequently, resulting in low CPU time spent doing GC; conversely, when the heap is small, GC occurs more often, causing a corresponding increase in the CPU time spent doing GC.In a concurrent garbage collector, this kind of trading memory for CPU, or the other way around, should be largely (at least ideally) orthogonal to the program's performance since the program will not block on GC.Furthermore, the program's CPU usage can be considered a proxy for its allocation rate and, by extension, its need for GC.By expressing the GC target in terms of the program's CPU usage, increased program activity immediately translates to increased CPU headroom for GC in absolute numbers.Understanding and controlling the scalability and CPU utilization of a program is a more direct task compared to comprehending its live set, which encompasses all objects contributing to memory pressure.
We de ne the GC overhead (henceforth denoted ) as the ratio of time spent doing GC (henceforth GC ) to time spent in the entire application (henceforth APP ): These time measurements are the main inputs to our algorithm for determining the new heap size.To mitigate uctuations, GC should be calculated using average times for the last collections (in our implementation, we pick = 3).For instance, if one GC cycle has high CPU activity when the previous cycles did not, it might be too hasty to change the heap size.Thus, the heap size varies "slowly, " preventing committing memory that is not needed in the long run.ZGC with 5% GC CPU Overhead Limit Figure 1.Memory usage of vanilla (unmodi ed, by default uses 25% of the available RAM) ZGC (22 cycles) and ZGC with 1% (856 cycles), 2% (1506 cycles), and 5% (3182 cycles) GC CPU overhead limits.For this run, we used 12 application threads on a 16-core machine, leaving a 4-core headroom.For each, we measure the following: maximum heap size, memory usage before GC, and memory usage after GC.Note that the y-axis for vanilla ZGC is two orders of magnitude higher.The di erences in the x-axes demonstrates the impact of GC on throughput.An artifact of the current ZGC design where each GC cycle forces mutators to take a slow path in the load barrier the rst time each reference is loaded.Thus, very frequent GC (i.e., 5%) can materialize as a throughput regression.
The core idea of our proposal is to iteratively adjust the heap size until the GC overhead, i.e., , meets the target set by the developer, Target_ .Note that the value is a target, not an upper bound.Thus, if > Target_ , we increase the heap size to lower the GC frequency and thereby lower the GC CPU overhead.Conversely, when

< Target_
, we decrease the heap size to trigger more collections, to increase the GC CPU overhead.
To showcase the impact of target GC CPU overhead, we run the Xalan benchmark from the DaCapo suite.It was run four times with vanilla ZGC on machine #3a from Table 1.Additionally, the benchmark was executed with GC targets: 1%, 2%, and 5%.By default, vanilla ZGC uses a high heap memory size (25% of RAM) that is signi cantly reduced when higher GC target values are used.Figure 1 depicts the results of these runs.
In our approach, during periods of lower CPU activity in the application, a collector will work less, as it is proportional to the application's CPU usage.This results in fewer allocations and overall less pressure on both the allocator and memory manager.Conversely, spikes in the application's activity translate into a higher CPU budget for the GC threads.While it may seem logical to run GC during low CPU activity to utilize available CPU resources, the e ectiveness may be limited if there is less memory to free.
At the end of each GC cycle, we compare the GC CPU overhead to the user-de ned GC target to calculate over-head_error which we use to adjust the heap size: We aim to prevent sudden and sharp heap size changes.Therefore, in addition to smoothing out uctuations in the GC by considering the average over the last three collections, we avoid using overhead_error directly to modify the heap size, as large error numbers can cause uctuations in the heap sizes.To mitigate this, we pass the overhead_error through the Sigmoid function [16] to smoothen changes in heap sizes 2 .The Sigmoid function is a mathematical function that is commonly used to model non-linear relationships between variables in statistical models.It maps input values to a range between 0 and 1.Thus, using the Sigmoid function prevents aggressive changes in the heap size.We pass the overhead_error to the Sigmoid function to calculate "Sigmoid overhead error": We use this result to calculate an adjustment factor that limits the changes to the heap size to within a range of 0.5 to 1.5: When overhead_error is zero, i.e., actual GC CPU overhead equals GC target, the Sigmoid function returns 0.5.Therefore, the adjustment_factor becomes 1 and the heap size remains unchanged.
An (overhead_error) < 0.5 means that the actual has exceeded the Target_ , so the adjustment_factor would be less than 1 and will reduce the heap size, leading to more GC cycles.When the actual is below Tar-get_ , (overhead_error) > 0.5, i.e., adjustment_factor > 1 will increase the heap size.The heap size will never change more than 50% of the current size (in any direction).Finally, we compute the new heap size as follows: Our approach can be used in combination with an upper bound on the heap size-e.g., Xmx-to trigger an OOM error.However, setting this upper limit may prevent the application from reaching the target GC CPU utilization rate (Target_ ).If an upper limit is not speci ed, the system sets it to a default value, which should be close to the maximum memory available on the machine, but not set to 100% to prevent system instability and swapping.Note, that previously it was 25% of the machine.

Prototype Implementation in ZGC
Adjusting the heap size based on GC CPU overhead is suitable for concurrent GCs that do not interfere with the application's critical path.In this section, we implement a prototype on ZGC, a concurrent collector in OpenJDK, to demonstrate the e ectiveness of this approach.The prototype follows the ideas presented in the previous section.

Background on ZGC
The Z Garbage Collector [22] (ZGC) is designed for low latency, o ering sub-millisecond pause times invariant of the heap size.GC activity in ZGC occurs concurrently with "mutators" (application threads) by relying on barriers that trap object accesses and coordinate accesses to objects from mutators and GC worker threads.A barrier is essentially some additional logic triggered (in the case of generational ZGC) when a reference is read from a eld and placed on the stack or when a reference is loaded from a eld.The barrier logic branches on metadata bits embedded in pointers [21].For example, in the case of a load barrier, if the metadata shows that the pointer is valid, we enter the fast path in which the overhead of the barrier is simply shifting o the metadata bits from the address.Otherwise, we enter the slow path, where we ensure that the pointer is valid by looking up the new canonical address of the object from a forwarding table.This last step may involve copying the object elsewhere and writing to the forwarding table ourselves.ZGC is a multiphase collector with separate mark and evacuation phases.Its overall design was described by Yang and Wrigstad [36].
Single-generation ZGC uses load barriers to synchronize GC activities with mutators.Generational ZGC [23] instead uses write barriers in addition to load barriers.It maintains a remembered set of references from the old generation to young objects, serving as additional roots during GC in the young generation only.Such a design favors generational workloads where objects are more likely to die young (following the weak generational hypothesis [25]) by supporting a more aggressive collection of the young generation without having to do repeated work on long-living objects.From a resource perspective, generational ZGC requires less CPU and memory usage than single-generation ZGC.In this paper, when we refer to ZGC, we are speci cally discussing the generational version of ZGC.
Memory in OpenJDK and ZGC.In addition to Xmx, ZGC introduced a new JVM option in OpenJDK 13 called "soft max heap size", and subsequently adopted by G1.A soft max heap size is a limit on the size of the heap, beyond which ZGC strives not to grow.Unlike Xmx, exceeding the soft max heap size will not result in an OOM error (unless the limit is equal to Xmx).When approaching the soft max heap, ZGC triggers GC to bring the heap size below the soft max heap size.If it fails to do so, it will grow the heap instead of going into an allocation stall.The soft max heap size value thus serves as a guiding parameter for GC to balance heap size and allocation rate and has a direct impact on GC activity and frequency.If the value is too small, ZGC might end up doing back-to-back collections.If the value is too large, it can lead to in ated memory costs, oating garbage, heap fragmentation, and poor spatial locality; especially under a low allocation rate.
The relations between di erent memory parameters in ZGC are shown in Fig. 2. Used memory refers to the occupied memory by both live and dead objects (that have not yet been collected).Maximum capacity or committed memory represents the amount of memory requested by OpenJDK from the Operating System, which is always higher than used memory.In practice, committed memory is often signi cantly higher: bursts of allocation immediately drive the committed memory up, and to avoid requesting memory from the OS-which may cause delay, or worse, fail-OpenJDK will not return committed memory unless several minutes have elapsed since it was needed (lower bounded by Xms, the ag is used to set the minimum and initial heap sizes).ZGC Heuristics.Heuristics control when to start a GC cycle to avoid running OOM and also how many threads to use for each cycle.In addition, a GC may also be triggered due to other reasons such as a high allocation rate, high heap usage, or if no collection has been triggered for 5 minutes.Collecting the old generation can also be performed occasionally if not triggered by other reasons.These heuristics consider the available free memory and the time remaining before an OOM error occurs based on the average allocation rate and unforeseen circumstances.To determine the number of GC workers required to prevent OOM, ZGC analyzes the duration of previous GC cycles and adjusts the worker count according to hardware limitations.Finally, ZGC predicts the duration of the next GC cycle based on the number of GC workers and calculates the start time for the next cycle.

Heap Size Adjustment with CPU Control in ZGC
We take advantage of the aforementioned soft max heap size limit as it has the characteristics we require: it triggers GC but does not stall.To prevent exceeding the machine's heap capacity unintentionally (e.g., due to a too low target), we set Xmx to 80% of the available RAM 3 (unless the user has explicitly set Xmx).This ensures that the adaptive heap size remains within an upper limit.
Thus, the heap size in our prototype implementation is ZGC's soft max heap, and our technique ultimately results in adjusting the soft heap max up and down at the end of each GC cycle to meet the GC CPU overhead target set by 3 This number re ects a pragmatic choice motivated by wanting to keep some spare memory for remaining programs running on the machine and also to leave space for ZGC's forwarding tables which are allocated o -heap and may grow very large under certain circumstances [26].and the time spent in mutators (green).Gray lines denote time measured at the end of GC cycle .We only include the time when mutators were scheduled, meaning APP = 2.5+2+2.5 = 7.In the case of GC , we measure from the start to the nish of the GC cycle.Thus, GC = 3 × 1.5 = 4.5, even though the 2nd GC thread was not scheduled after = 7.Thus, = 4.5 7 ≈ 64%.(This example omits barriers, read more about them in §4.3) the programmer.The amount of memory committed from the OS by OpenJDK is limited (as usual) by Xmx and will only grow in tandem with the soft max heap.

Obtaining GC
We calculate GC as the sum of time spent on young ( young ) and old collections ( old ) plus an estimate of the time mutators spent in the slow path of barriers ( ).For simplicity and to avoid adding logic contributing to GC overhead, we use existing telemetry in ZGC.Thus, young and old are wall-clock time measurements.For traceability, we pre x wall-clock time measurements by and CPU time measurements by below.Thus, we will henceforth write GC instead of GC to highlight that the time measurement is a wall-clock time.To address potential inaccuracies in individual measurements, we calculate young and old using the average times for the last 3 collections (as we described in §3).For uniformity, we use a single formula (Eq.( 6)) to describe GC and, in a minor collection, set old to 0.
As already mentioned, is the mutator time spent in the slow paths of barriers.When mutators hit slow paths in barriers, they do GC work, either remapping an old address to a forwarding address or performing relocation.We measure the wall-clock time of barriers using sampling: we record the time once for every 1024 slow paths taken, calculate the average time spent in slow paths, and multiply that with the number of slow paths taken.ZGC calculates GC time separately for each generation by adding the times for the serial and parallel work.The serial time is the wall-clock time spent on non-parallel tasks like relocation set selection after marking, while the parallel time is the sum of the wall-clock time spent by worker threads on parallelizable tasks.
Similarly, for activity in the old generation: 4.4 Obtaining APP The application's average time is the sum of the scheduled time of all threads spawned by the process (i.e., a CPU time measurement) between two collections in the same generation.Thus, we write APP henceforth to clarify the nature of APP in our implementation: Similarly to GC time, we reuse existing GC telemetry to capture application time to avoid additional measurement overheads.Application time is obtained by measuring CPU time (see Figure 3 for an overview).Listing 1 shows the code for measuring the CPU time of the process.
Listing 1. Code that calculates the process CPU time at moment of the call.The function clock_gettime measures the CPU time consumed by a process, meaning that it includes the CPU time consumed by all threads in the process, including application threads, GC threads, compiler threads, etc.To measure the CPU time between two moments in time, we cache the last result and subtract it from the result of the subsequent call.

Calculating a Suggested Heap size
Most of our modi cations to ZGC are located in its heap sizing mechanism: class ZAdaptiveHeap.The main logic is captured in the method ZAdaptiveHeap::adapt (see listing 2), which performs the calculations outlined in Sections 3 to 4. For clarity, we add comments with the labels from the equations to aid in mapping the C++ code to the descriptions above.The method is called at the end of each GC activity in both major (young + old) and minor (only young) collections.
Listing 2. The modi ed adapt method that recalculates heap limits in ZGC.(For simplicity, we only show the logic for major GC and remove one lock to reduce clutter.)We establish a lower bound for the suggested heap by using the amount of used memory.This is because we aim to avoid triggering GC more often than necessary.If we set the suggested heap size below the used value, we risk triggering GC when there are no objects to clean.Although concurrent GC does not interfere with the application's critical path (as it is running concurrently and does not force the application to stop) and therefore it might have a negligible impact on performance or latency, the additional GC work can have a negative impact on energy consumption.

Initial and Adapted Heap Sizes
We set the initial heap size to 16 MB, in terms of the soft limit.This is an unlikely heap size for most programs and will trigger GC as the limit is approached or exceeded which will cause GC to adapt the heap size and (most likely) increase the soft limit (by at most 50% each time).Fig. 1 shows the frequent increases of the soft limit in green.(The top-left sub-gure shows Vanilla ZGC where the soft limit is equal to Xmx and never exceeded.)If the soft limit is not exceeded, and the GC overhead is below the target, we will decrease the soft limit to trigger GC more often.This is clearly visible in the two bottom sub gures of Fig. 1.

Evaluation
In this section, we are going to answer the following question: How e ective is our automated heap sizing strategy, based on CPU usage as a tuning knob, compared to vanilla ZGC which relies on setting a maximum heap size?We now explain our experimental setup and benchmarking methodology.

Hardware and Software
We evaluate our work by comparing our modi ed ZGC with its unmodi ed base also referred to as vanilla (generational ZGC in OpenJDK version 21).We used an Intel Xeon Sandy-Bridge EN/EP server machine (machine #5 in Table 1) running Oracle Linux Server 8.4.The machine has 32 identical CPUs, which we con gured as a single NUMA-node to avoid NUMA e ects.The CPU model is Intel® Xeon® CPU E5-2680 with 64KB L1 cache, 256KB L2 cache, a shared 20MB L3 cache, and 30GB RAM.The con guration allows us to obtain energy consumption statistics.

Benchmarks
We use the DaCapo benchmark suite (Chopin branch), which includes a variety of microbenchmarks and real-world applications that stress the JVM and the garbage collector.The suite includes several latency-sensitive applications that require low-latency response times.These benchmarks measure metered latency, including request serving time, queuing delays, and interruptions like GC.By using these benchmarks, GC performance can be evaluated in terms of both throughput and responsiveness.We excluded the benchmarks Kafka and JME due to a low CPU utilization issue, as well as Lusearch due to a high CPU utilization variability, making it hard to draw any meaningful conclusions (noted by the benchmark maintainers).We also excluded H2 due to a reproducible memory leak across multiple machines and garbage collectors.When referring to DaCapo in this paper, we speci cally mean the DaCapo Chopin benchmark suite.We included all throughput-oriented benchmarks except Cassandra, which is incompatible since OpenJDK 16.
In order to obtain a more comprehensive understanding of our prototype, we also include the Hazelcast benchmark [13].Hazelcast was chosen since most of the latency-sensitive workloads in DaCapo were excluded for the aforementioned reasons.As low latency is the main goal of a concurrent collector, we wanted to study more such workloads.Hazelcast is designed to provide distributed and scalable in-memory data storage and processing, which can help reduce data access and processing latency.
DaCapo.We use a commit (number 300acaa7) that includes latency-oriented benchmarks as evaluating latency is crucial for fully concurrent collectors.We conducted the benchmarks using the large size for all applicable tests.For the remaining benchmarks (Fop, Zxing, Xalan), we used the default size.
Hazelcast.Hazelcast performs real-time stream processing.We used all the suggested con guration parameters [31].It has a xed workload, set by its key-set size.We experiment with multiple key-set sizes: 400 000, 250 000, 100 000.Thus, we report the results of those 3 di erent con gurations.

Benchmarking Methodology
We run each benchmark using 5 JVM instances, which lets us identify performance anomalies and outliers that might not have been discernible using a single JVM instance.Note that the variation between JVM instances is within the variation between the last 5 stable iterations of a single JVM.Inside each JVM instance, each benchmark repeats multiple iterations (varies across benchmarks to reach a coe cient of variation (CV) for the last 5 iterations below 5% with respect to execution time), which is necessary to avoid impact from warmup and JIT compilation.Notably, our approach takes time to adjust from the initial heap size before stabilizing around a GC target.
Once we reached a steady state, we calculated the arithmetic mean of the last 5 iterations to remove noise from the environment.However, in cases where a steady state could not be reached, we used all recorded values for the last 5 iterations per JVM instead of computing the arithmetic mean.This is because taking a mean could hide outliers, and we do not know the shape of the data distribution.
In summary, we compute either one arithmetic mean per JVM instance (resulting in 5 data points in the nal set) or all values from each JVM (resulting in 25 data points in the nal set).We use the same approach for both adaptive ZGC and vanilla ZGC and to compare them, we perform statistical analysis on the nal data sets.The nal results reported in Table 2 were calculated using an arithmetic mean of the nal set.

Statistical Analysis
We used di erent tests to verify the validity and reliability of the results.We perform statistical analysis on the nal data sets to draw our conclusions.We employed Welch's t-test [33], Grubb's outlier test [14], and Yuen's t-test [38] to determine whether the di erences between the means of the compared results from vanilla and adaptive ZGC are statistically signi cant.Welch's t-test and Yuen's t-test are particularly useful in cases where we can not make assumptions about the shape of data distribution and the variances of the compared groups are not equal.We believe these tests are safer to use instead of relying on a non-veri able assumption about the normality of our data distribution.
We used Grubb's outlier test to check if the data set has statistical outliers.If so, we use Yuen's t-test instead of Welch's t-test.Yuen's test involves trimming a xed proportion of the extreme values from each data set, we used 10%, to reduce the in uence of outliers.To determine whether the results exhibit signi cant di erences, we used the p-value obtained from Welch's t-test (with a signi cance level of 0.05).If the resulting p-value is greater than 0.05, we conclude that the data sets do not exhibit signi cant di erences.
To help provide an overview, we color code the results if statistical signi cance was found.Red means the adaptive approach is worse than the vanilla ZGC; green means the opposite.White indicates the results are statistically the same.We also highlight a bigger than 5 % negative impact of our approach with a darker shade of red (Table 2, Table 3).

Energy Measurements
Energy consumption was measured using the Running Average Power Limit (RAPL) [19] interface available on recent Intel architectures.This interface allows machine-speci c registers (MSRs) to be read, which contain energy scores.To calculate the nal energy score, we report the sum of the package and DRAM domains, following the method used by Shimchenko et al. [30].Our approach for measuring energy consumption is similar to that used for measuring throughput and latency.For DaCapo benchmarks, warmup iterations were excluded, and statistics were aggregated across 5 JVM instances for the last 5 iterations in each run.For the Hazelcast benchmark, we report energy consumption for the entire run, as it is a longer-running benchmark where the warmup period is a small fraction of the total run time.

Baseline Heap Sizes
If the Xmx option is not speci ed by the user when starting the JVM, the JVM will default the maximum heap size to 25% of the physical memory available on the system. 4Research papers that involve measurements across multiple garbage collectors use other collectors like G1 [7] or Serial GC to pick the minimum heap size [29] and then employ a scaling factor to provide additional headroom (additional memory space) for other collectors.However, it is not at all clear if such an approach re ects the actual heap sizes chosen by developers for production systems.For example, developers often tend to choose heap sizes that are powers of two [11].
Proper con guration of a concurrent collector should avoid allocation stalls as these introduce jitter and hurt latency.Thus, we decided to adopt a manual heap size adjustment strategy for our baseline (vanilla ZGC), where we pick the smallest power-of-two heap size with which the application runs reliably without stalling.We use this value for each benchmark as Xmx for a baseline con guration in vanilla ZGC; also, we explicitly set Xms to 16MB, which is the same as its default value according to the ZGC codebase.Finally, we had to manually pick heap sizes as there is no "best option".We pick baseline values not in order to "beat" something but explain the behaviour of our system.

GC Targets
To investigate the implications of our proposed design, we studied the impact of GC targets on latency and throughput using varying percentages of GC CPU target overhead.Speci cally, we examined the following GC targets: 5% (to a limited extent using 3 JVM instances To assess if the picked list of GC targets is representative, we found the actual GC CPU Overheads without the heap size adjustment for memories picked according to Table 2. Looking at Table 2's GC CPU overheads, the actual GC CPU overheads have a big variation from less than 1 % (Sun ow) to 23 % (Fop).Given that having a closer GC target to the actual GC overhead might better reveal the e ect of our adaptive solution, we only evaluated our strategy with a 5% GC target for the benchmarks with actual GC overheads below 5%.This required running additional iterations to reach a steady state, adding time to benchmarking.
Methodologically, testing very small GC targets on shortrunning benchmarks is challenging since it takes time to grow the heap from the initial 16MB to a size that sustains the required GC target.If this time exceeds the benchmark's run-time, it never reaches a steady state, causing the results inconclusive.Very high GC targets do not represent a real deployment.This said we believe that a picked range of GC targets is su cient to demonstrate how our system behaves and showcase main trends.

Results
We now compare the performance of running the vanilla ZGC with manually selected heap sizes against our adaptive technique, which leverages di erent values of GC CPU overhead.Throughout the experiments, we closely examined various metrics, including memory usage, execution time, latency, and energy consumption.We sought to identify the advantages and drawbacks of each approach.Additionally, we propose an optimal default value for the GC CPU overhead that strikes a balance between e cient resource utilization and overall system performance.
Prior to presenting our results, we would like to address the absence of 3 benchmarks, Batik, Jython, and Pmd, from our study.These benchmarks have actual GC targets of 80 %, 76 %, and 170 %, respectively, using the maximum memory available on the SandyBridge machine.Therefore, we were unable to allocate additional memory to lower the GC targets for these benchmarks.Nevertheless, our methodology remains valid, and we were able to obtain results for these benchmarks by running them with the maximum available memory on the machine.As a result, the GC CPU overhead of these benchmarks remained similar to their actual values.Note that failing to attain a CPU target does not result in the failure of benchmark execution.The observed outcome is merely a disparity in real CPU overhead when compared to the requested target.If the target is set lower than the actual value and insu cient memory is available to elevate it, the application will persist in running without reaching the target.This situation remains unchanged unless the entire machine's memory su ces to prevent OOM issues, a scenario shared by Vanilla ZGC.Conversely, when the target surpasses the real CPU overhead and reducing memory fails to rectify it, this signi es an absence of substantial GC work.Irrespective of these scenarios, the application continues to function without interruption.
Memory Usage.Memory usage for di erent GC targets is presented in Table 2, normalized to the vanilla ZGC with the chosen heap size as described in §5.6.Memory represents the average used memory before a GC for the last 5 stable iterations.Despite comparing memory maximums, normalization yielded similar results.Results show that overall memory usage decreases if the tested GC target is higher than the default GC CPU overhead.For instance, Hazelcast_100 has, by default, a GC CPU overhead of 21 %.Therefore, the memory used grows with 5 %, 10 %, and 15 % GC targets but is on par for a 20 % GC target.The biggest observed reduction is 96 % for Sun ow with 15 % and 20 % GC targets.
Moreover, the reduction in memory usage correlates with a higher number of minor and major collections, which simply means that GC works more to keep a tighter heap.As expected, in terms of reducing memory footprint, 20 % leads to the smallest heap size across all the benchmarks.Execution Time.The results show that adjusting the heap size dynamically with 15 % and 10 % GC targets had a minimal negative impact on execution time, except for Xalan and Sun ow.However, Avrora, Hazelcast_400, and Fop have a reduction in execution time.For instance, Avrora showed a 3 % and 5 % improvement in execution time for 15 % and 20 % GC targets, respectively.This improvement can be attributed to the collector compacting live objects close together, improving cache locality [35], and making memory accesses easier to prefetch.However, Sun ow experienced a signi cant 15 % degradation in execution time.Additional pro ling revealed more stalling in the instruction pipeline backend, which is often an indication of memory stalls [20].It is possible that the 96 % memory reduction resulted in too many GC cycles, which interfered with the mutator accesses.To improve our technique in the future, we will consider the cache e ects of too many GC cycles.From prior work, we also know that Sun ow is very sensitive to keeping allocation order during relocation and it is possible that this order is kept less well with so frequent GC cycles.
Energy.As per our initial hypothesis (Table 2), we expected energy changes to exhibit an opposite trend to memory.We anticipated that if a benchmark consumed more CPU during GC than the baseline, then we would see a decrease in memory usage and an increase in energy consumption.This is because CPU usage incurs higher energy costs than DRAM [18].As expected, the 20 % GC target yields on average worse energy results compared to 10 % and 15 % GC targets.However, it is apparent that the relationship between reduced memory and increased energy is not always linear.For instance, the Graphchi benchmark with 20 % GC target has a 82 % reduction in memory usage but only 1 % increase in energy consumption.At a single-program granularity, opting for high GC targets has an increased energy cost.However, in a cloud setting, where CPU is typically highly overcommited [27] and memory is the limiting factor for consolidating virtual machines and containers, signi cant memory reductions lead to fewer physical nodes and ultimately lower energy consumption [2].
Latency.In our evaluation, latency results were available only for a subset of benchmarks, which we report in Table 3.Our adaptive approach has no negative impact on 99thpercentile latency and can even reduce it.For instance, in Table 2. Execution time, memory, energy (all three normalized), the number of minor and major collections as well as GC CPU overheads in vanilla ZGC and adaptive ZGC for various benchmarks (BMs).Heap size (MB) (Z) is the minimum stall-free heap size for each benchmark.CPU Utilised shows the number of CPU cores used by the application (out of 32 cores available on SandyBridge).White cells show no statistical signi cance according to the methodology explained in §5.4.Di erent shades of red represent highlights where the adaptive approach is worse than the default.Darker red indicated the CV above 5 %.We write (Z) for vanilla ZGC and (A) for our adaptive approach.CPU-intensive workloads such as Tomcat and Hazelcast_400, where there is high competition for CPU resources between the collector and mutator threads, lower GC targets (i.e., 10 %) lead to using more memory and positively a ect latency by allowing GC to run less frequently, thereby reducing the impact on mutator performance.While increasing Xmx could achieve a similar e ect, our approach reduces latency while also decreasing memory usage.Because, with the xed memory level, GC CPU overhead can vary drastically throughout execution, leading to high numbers.Our technique keeps the GC CPU overhead more stable, aiming to uctuate around a certain GC target.It ensures that GC does not take up a lot of space, allowing mutators to deliver stable low latency without frequent drops.With a 20 % GC target and the smallest heap size, Spring showed a notable increase in latency.However, it is important to note that this benchmark has less than half of the capacity of the machine's average CPU utilization, but at times it spikes quite high, becoming CPU intensive.Higher GC targets in CPU-intensive workloads can reduce latency by mitigating contention between GC and mutator threads, as explained above.However, due to the limited number of latency-oriented workloads tested and the high variance in DaCapo benchmarks, we cannot make de nitive conclusions about the positive impact of our technique on latency.Nonetheless, our ndings suggest that it does not have a statistically signi cant negative e ect.

G C ta rg
Picking the Default GC Target.Di erent GC targets can yield opposite trends for di erent optimization goals.While the highest GC target of 20 % provided the best results for memory, energy optimization requires the lowest GC target.Meanwhile, too many or too few GC cycles can harm performance.Thus, choosing the best GC target for each program may require manual selection.However, upon examining the benchmarks as a whole, we found that a 15 % GC target achieved a 51 % memory reduction, with only a 3 % execution time degradation and a 3 % increase in energy (calculated as the geometric mean across all benchmarks, following [12]).Therefore, a 15 % GC target may be a good default choice for optimizing the trade-o between memory usage, execution time, and energy consumption.

Related Work
Language runtimes that host managed languages-such as Java, Python, and JavaScript-maintain a garbage collected heap to manage live application objects (unreachable objects are collected by the garbage collector).Determining the heap size is challenging as it involves a tradeo between application pause time, GC CPU, and memory utilization.Various heuristics have been proposed to achieve this goal by minimizing pause time, GC utilization, and memory usage.

Heap Size Adjustment Algorithms
A number of studies have been conducted for STW collectors, aiming to improve execution time [5], avoid paging [15] or both [37].Brecht et al. [5] propose an adaptive technique to increase the heap size aiming to reduce execution time in the STW Boehm-Demers-Weiser GC [4].The authors suggest increasing the heap size aggressively without collecting garbage if su cient memory is available.Only when memory is scarce GC becomes more frequent, and the heap size stabilizes.This approach prioritizes reducing the GC target overhead to improve the application throughput.Yang et al. [2004] introduced an analytical model to adjust the heap size in the multi-program environment.In their approach, an operating system's virtual memory manager monitors an application's memory allocation and footprint.Then, it periodically changes the heap size to closely match the real amount of memory used by the application.A model is used to minimize GC overhead by giving it enough heap size but also to minimize paging by avoiding large heaps.The model is o ered for Appel and semi-space collectors [1].Zhang et al. [39] propose a novel approach to memory management called Program-level Adaptive Memory Management (PAMM).PAMM uses the program's repetitive patterns (phases) information to manage memory adaptively.The authors believe the behavior of the phase instances is quite similar and repetitive, so they can represent the memory usage cycle in the application.PAMM monitors the program's current heap usage and the number of page faults to adjust a softbound as a GC threshold.When the threshold is reached, PAMM triggers GC to collect and free unused memory.They evaluate PAMM with three STW and generational collectors (Mark-Sweep, CopyMS, and GenCopy).PAMM relies on a speci c phase detection algorithm, which may not be applicable to all types of programs.
Grzegorczyk et al. [15] propose the Isla Vista heap size adjustment strategy to avoid GC-induced paging.Their strategy is to grow the heap when more physical memory is available and shrink it by triggering GC when there is not enough physical memory.Thus, it trades more GC for less paging by communicating between OS and VM and triggering the heap size adjustment logic on relocation stalls.
Controlling the ratio of GC time to overall execution time also has been addressed within HotSpot's collectors, using -XX:GCTimeLimit5 .However, this strategy may not be suitable for concurrent collectors, as they are designed to operate concurrently with the mutator and outside of the program's critical path.
In a closely aligned study, White et al. [34] propose a PID (Proportional-Integral-Derivative) controller that monitors GC overhead (the percentage of total execution time spent on GC) and adjusts the heap resize ratio to maintain a target GC overhead level set by the user.They utilize the Jikes Research Virtual Machine (RVM), the Memory Management Toolkit (MMTk) as the experimental platform, and the Fas-tAdaptiveMarkSweep collector.
More recently, Bruno et al. [6] propose a vertical memory scalability approach to scale JVM heap sizes dynamically.To do this, the authors introduce a new parameter: CurrentMaxMemory.Contrary to the static memory limit dened at launch time, CurrentMaxMemory can be re-de ned at run-time, similar to our soft max capacity.In addition to the new dynamic limit, this work also proposed an automatic trigger to start heap compaction whenever the amount of unused memory is large.This technique allows returning memory to the Operating System as soon as possible.
7.2 Heap Size Adjustment in State-of-the-Art GC Immix [3] is a collector suitable for high-performance computing.Immix does not require the maximum heap size to be known in advance.It continuously monitors the amount of free memory available in the heap and adjusts memory allocation accordingly.When the amount of free memory falls below a certain threshold (which may vary between implementations), Immix triggers a GC cycle to reclaim unused memory.If the free space is still insu cient after collection, Immix may allocate additional memory blocks to meet the application's memory needs.Immix also considers the rate of object allocation as a metric.If the allocation rate exceeds a certain threshold, it indicates a high memory consumption and the potential need for more memory to avoid out-ofmemory errors.Immix also uses heuristics to estimate the size of the working set or the set of objects that are actively being used by the application.Since Immix is a STW collector, this dynamic heap resizing brings many disadvantages.
For example, the application may experience brief pauses or slowdowns during the resizing process, which in turn makes it more di cult to reason about the memory usage and performance characteristics of an application.Cheng et al. [8] introduce a parallel, concurrent, real-time garbage collector for multi-processors.GC work is proportional to the allocation rate, so it indirectly scales up and down with program CPU utilisation.It aims to provide bounds on pause times for GC while also scaling well across multiple processors.Using the concept of Minimum Mutator Utilisation (MMU), they capture the percentage of time in a given time window the mutators have access to the CPU.They showed that their proposed collector keeps higher MMU results compared to non-incremental GC.However, they do not assess GC CPU or utilize MMU-based actions.
Degenbaev et al. [9] propose scheduling GC during detected idle periods in the application to reduce GC latency.It uses knowledge of idle times from Chrome's scheduler to opportunistically schedule di erent GC tasks like minor collections and incremental marking.This allows adapting GC based on real-time application behavior and available idle cycles.While not directly adjusting heap size, scheduling GC during idle periods allows for reducing memory usage and footprint when the application becomes inactive and based on the real-time needs of the application.
The G1 [10] (Garbage First) garbage collector requires knowledge of the maximum memory needed for an application in advance.If it is not explicitly provided, it uses a default value.G1 uses a dynamic heap size adjustment strategy to adjust the memory usage during runtime based on the current usage pattern of the application [28].G1 divides the heap into regions of equal size and groups them into two generations: young and old.When the young generation lls up, G1 performs a young collection, during which live objects are copied to a new region while unused regions are reclaimed.G1 also performs periodic concurrent marking of live objects in the old generation.When the old generation lls up, G1 performs a mixed collection, which collects both young and old regions that have been marked as garbage.During a mixed collection, G1 dynamically sizes the heap by using the occupancy of the old generation as a target and adjusts the heap size to meet that target.
Heap size adjustment in .NET is di cult because of the prevalence of object pinning which can make it impossible to uncommit memory..NET o ers a ConserveMemory interface to the garbage collector that allows "conserving memory at the expense of more frequent garbage collections and possibly longer pause times" [32].This setting works by controlling the fragmentation tolerance in old generations, before triggering a full, compacting GC cycle.

Discussion
Previous works, adjust the amount of heap size by estimating the amount of memory that is necessary to keep the application running without incurring high latency and CPU overheads.Instead of estimating the amount of memory needed by the application, we adjust the heap size to meet a speci c GC target.Our CPU-driven heap size adjustment is particularly important for concurrent collectors like ZGC, which compete with the mutator for CPU resources to collect memory, unlike the STW collectors used in prior studies.Dissimilar to xed-size heap headroom used in STW collectors, a concurrent GC requires variable headroom depending on the available CPU for collection.If the mutator consumes most of the CPU, a large headroom is necessary for a concurrent collector, while a small headroom su ces for collection when the mutator has minimal CPU usage.In sum, rather than directly controlling the heap headroom as in previous works for STW collectors, we specify the desired GC target and adjust the heap headroom accordingly.

Conclusion
This paper explores an adaptive approach for automatically adjusting heap size based on CPU overhead for GC work as a tuning knob.Our evaluation demonstrates that this technique does not negatively impact latency, which is the main goal of fully concurrent collectors.In addition, we o er insights into optimizing energy and performance by tuning GC targets.Our ongoing work focuses on seamlessly integrating and re ning this approach within the ZGC framework to unlock its full potential in real-world applications.

Figure 3 .
Figure 3. Concrete measurements of GC and application time in our implementation.At the end of the GC cycle + 1 ( = 7.5), we consider the time spent in GC threads (blue) and the time spent in mutators (green).Gray lines denote time measured at the end of GC cycle .We only include the time when mutators were scheduled, meaning APP = 2.5+2+2.5 = 7.In the case of GC , we measure from the start to the nish of the GC cycle.Thus, GC = 3 × 1.5 = 4.5, even though the 2nd GC thread was not scheduled after = 7.Thus, = 4.5 7 ≈ 64%.(This example omits barriers, read more about them in §4.3)

Table 3 .
The 99th-percentile metered latency from the adaptive approach normalized to vanilla ZGC.The color coding is the same as in Table2.H is for Hazelcast.