Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes

The runtimes of managed object-oriented languages such as Java allocate objects on the heap, and rely on automatic garbage collection (GC) techniques for freeing up unused objects. Most such runtimes also consist of just-in-time (JIT) compilers that optimize memory access and GC times by employing escape analysis: an object that does not escape (outlive) its allocating method can be allocated on (and freed up with) the stack frame of the corresponding method. However, in order to minimize the time spent in JIT compilation, the scope of such useful analyses is quite limited, thereby restricting their precision significantly. On the contrary, even though it is feasible to perform precise program analyses statically, it is not possible to use their results in a managed runtime without a closed-world assumption. In this paper, we propose a static+dynamic scheme that allows one to harness the results of a precise static escape analysis for allocating objects on stack, while taking care of both soundness and efficiency concerns in the runtime. Our scheme comprises of three key ideas. First, using the results of a statically performed escape analysis, it performs optimistic stack allocation during JIT compilation. Second, it handles the challenges associated with features that may invalidate the optimism, using a novel idea of dynamic heapification. Third, it uses another novel notion of stack ordering, again supported by a static analysis, to reduce the overheads associated with the checks that determine the need for heapification. The static and the runtime components of our approach are implemented in the Soot optimization framework and in the tiered infrastructure of the Eclipse OpenJ9 VM, respectively. To evaluate the benefits, we compare our scheme with the existing escape analysis and find that it succeeds in allocating a much larger number of objects on the stack. Furthermore, the enhanced stack allocation leads to a significant reduction in the number of GC cycles and brings decent performance improvements, especially suited for constrained-memory environments.


INTRODUCTION
The managed runtimes of languages such as Java allocate objects on heap and deallocate them using automatic garbage collection.This unburdens the programmer from making intricate allocationdeallocation decisions, and reduces the possibility of harmful memory bugs.In contrast, memory allocated on stack gets freed up as soon as the activation record of the allocating method is popped out, thus eliminating the need for explicit garbage collection.The ability to allocate objects or their fields on the stack instead of on the heap has long intrigued the community, with researchers coming up with different applications of escape analysis to enable the same.For example, the C2 just-in-time (JIT) compiler of the HotSpot VM [18] uses escape analysis to decompose objects into scalar variables on the stack.Similarly, GraalVM [17] uses a partial-escape analysis [21] to enable scalar replacement in parts of a program when it cannot be performed throughout the program.Nonetheless, there are a large number of scenarios when Java objects cannot be scalarized and need to exist in entirety.The most common case is when an object is passed as an argument to a method that is too large to be inlined.To overcome these limitations, many prior works, instead of trying to decompose objects, choose to allocate them in their original shape on the stack instead of on the heap; this optimization is called stack allocation [5,7,25].However, stack allocation in a managed runtime relies on analyses performed during JIT compilation which, like any JIT analysis, affects the execution time of the program.Consequently, to target efficiency, prior works involving JIT compilers resort to performing imprecise escape analyses that usually enable very few objects to be allocated on the stack.As an example, we found that on average across several benchmark programs of interest, the JIT compiler of the Eclipse OpenJ9 VM [11] was able to stack allocate only 0.16% of the total number of objects.
A promising alternative to imprecise JIT analyses could be to perform an aggressive escape analysis statically and use its results to perform stack allocation during JIT compilation.This way, one could potentially identify a much larger number of objects to be stack allocated, without performing expensive analysis during JIT compilation.However, in presence of dynamic features supported by the language and the runtime, as well as differing libraries between the static analysis and the runtime, such a static+dynamic strategy may lead to erroneous stack allocation.As an instance, it is quite popular in the Java static-analysis community to resolve reflective calls using run-time logs [6]; if the call graph changes during execution from what was assumed statically, an object may get passed to a previously unknown method potentially making it escape.Similarly, many managed runtimes provide features like dynamic classloading and hot-code replacement that allow new code to show up during program execution, thereby again leading to a potentially wrong stack allocation if the object escapes therein.In this paper, we propose a sound and efficient approach of using static-analysis results to perform stack allocation in a Java VM, which is not marred by any dynamism that the language or the runtime may impose during program execution.
Our approach is based on three novel ideas: optimistic stack allocation, dynamic heapification, and stack ordering, supported by corresponding precise static analyses.To begin with, the first idea is for the VM to optimistically allocate a list of objects, generated by a static escape analysis, on the stack frames of the methods it JIT-compiles.This lets us harness the results of a precise escape analysis for performing enhanced stack allocation.The second idea is to timely detect, using a run-time check and a "stack walk", if any objects are allocated on the stack incorrectly by the optimistic scheme; and if so, then to move them onto the heap (along with correction of references).This Contributions: • A first-of-its-kind static+dynamic strategy to optimistically perform stack allocation in Java VMs, harnessing the results of context-, flow-and field-sensitive escape analysis.• A new run-time algorithm based on stack walks, implemented across the tiered infrastructure of a real-world JVM, to timely detect and move wrongly stack allocated objects to the heap, while correcting their references.• A novel idea of ordering non-escaping objects on the stack to reduce the time spent in stack walks and consequently the overhead of run-time heapification checks.• An evaluation that shows the efficacy of combining static analyses with run-time strategies in improving the amount of stack allocation that a JIT can perform and consequently the run-time performance, in allocation-intensive programs.
The rest of the paper is organized as follows.Section 2 illustrates the interesting challenges that our proposed approach handles, with a motivating example.Section 3 describes the key run-time components of our approach: optimistic stack allocation, run-time checks for heapification, and recursive dynamic heapification using stack walks.Later, Section 4 shows how we determine and use stack orders to reduce the number of stack walks towards reducing the overhead of run-time checks.Section 5 states the correctness criterion of our approach, and discusses few design choices that we made while implementing it for a full language in an industry VM. Section 6 evaluates the improvements imparted by our approach in terms of stack allocation, performance in constrained memory environments, as well as the contribution of the key ideas in imparting efficiency.Finally, Section 7 presents important related work and Section 8 concludes the paper.

MOTIVATING EXAMPLE
The primary goal of our paper is to propose a robust and efficient scheme that allows static escape analysis of Java Bytecode to be used "safely" for performing stack allocation in the JVM.In this scheme, we let the JVM optimistically allocate objects on the stack, using the information given by the static analysis, in such a way that the JVM is able to detect and handle incorrect stack allocations during run-time.In this section, we motivate the reader towards the subtle aspects of developing such a scheme, using a running example.
Consider the Java code snippet shown in Figure 2. The class A has a field g of type B and the class C has a field f of type A. Denoting the abstract object(s) allocated at line as , we can see that the reference variable x in the method doo points to the object 5 , object 5 's field f points to the object 7 , and object 7 's field g points to the object 9 .The reference variable p too points to the object 7 (due to the field load at line 8), which is later passed to the method zar of class C (assuming class D is not known during static analysis).As zar does not make the objects pointed-to by its parameters escape, we can allocate the objects 7 and 9 on doo's stack frame, as shown in the left-hand side of Figure 3. Prima facie, note that this information required an interprocedural analysis (to incorporate the effect of zar into doo).Now consider a scenario in which during program execution in the JVM, the method doo is called by passing an object of a dynamically loaded class D (defined at line 20) as the second argument.Here, as the variable r now points to an object of type D, the call site at line 10 would invoke the overridden method zar in D. However, this zar stores the object pointed-to by its parameter p into the field f of its parameter q (at line 22), which in turn points to an object passed to its caller (via the parameter q in doo).As a consequence of this store statement, the lifetime of 7 becomes longer than that of its allocating method doo, and thus 7 cannot be kept on the stack frame of doo.Furthermore, the object 9 is reachable from an escaping object 7 and hence that too cannot be kept on the stack frame of doo.Letting these objects be on the stack is unsound.In order to ensure correctness, we need to first of all detect the erroneous stack allocation before anybody tries to access the corresponding object, and then ensure that the run-time memory layout is modified to avoid future incorrect accesses during the program's remaining execution.Also, as this needs to be performed on-the-fly during program execution, its efficiency is crucial.We next describe how we perform these operations in context of our motivating example from Figure 2.

Detecting Incorrect Stack Allocation
Recall that owing to dynamic classloading in the above discussed example, the object 7 becomes non-stack-allocatable (i.e., escapes) at line 22 in D's zar.Thus, line 22 is the precise program point at which we could detect, just before storing 7 into the field of a longer lifetime object (the one pointed-to by the parameter q), that we are going to make 7 escape.In order to detect this occurrence, we insert a check (using a new opcode in the intermediate language of the JIT compiler) immediately before the write barrier associated with the store operation.This "heapification check" compares the lifetimes of the object being stored (say rhs) and the base object on the left-hand side of the assignment (say lhs), and thus decides whether we could have wrongly allocated rhs on the stack.Note that the lifetime-comparison check needs to be performed for any statement that may cause an object to escape, along with the more common store statement discussed above.

Moving Objects and Correcting References
Once we have detected that an object is wrongly allocated on the stack, we need to move it onto the heap (we call this operation heapification).Moving the object from the stack to the heap is as simple as creating a copy of the object (along with copying the values of its fields) and allocating the new object onto the heap by calling the VM's memory allocation routine.However, there are two more subtle issues to consider.First, the fields of the object being copied may be pointing to other non-primitive objects; for example, the object 9 pointed-to by the field g of the object 7 .This means the heapification operation needs to be repeatedly performed until it has copied all the reachable objects onto the heap.Hence, once we detect an incorrectly stack allocated object, we invoke a "recursive heapification" routine that allocates all the reachable objects that are currently on the stack onto the heap.Second, there may be stale references to the object being heapified (on stack or in registers); for example, after this heapification is done, there remains a stale reference from the variable p in the method doo to the 7 on the stack.Such references need to be corrected to point to the right object.We perform this "reference correction" by maintaining a map containing the old and the new addresses of the objects being recursively heapified, traversing the stack frames the VM requires a fixed stack-frame size and hence precludes the stack allocation of variable-sized arrays and objects that escape a loop iteration.The VM also requires stack-allocated objects to be pre-zeroed at method entry and makes the garbage collector aware of these objects by maintaining their offsets in a metadata field with each method.Similar to other Java runtimes, OpenJ9 is also conservative in performing stack allocation (as well as other analyses and optimizations) when dynamic features such as hot-code replacement are allowed by the VM.As we show in Section 6, the imprecision introduced by time constraints during JIT compilation (escape analysis is only partially interprocedural, depending on the peek-depth), as well as the possibility of dynamic features, results in the VM being able to allocate a very small percentage of objects on the stack during program execution.In this section, we propose an approach that uses the results of a statically performed escape analysis to aggressively stack allocate method-local objects in OpenJ9, while being robust to the dynamic features allowed by the runtime.
The next two subsections discuss two major aspects of our scheme.Section 3.1 discusses how the static-analysis results are used in the runtime to optimistically allocate non-escaping objects on the stack.Section 3.2 discusses the dynamic heapification routine, which essentially is the repair that is performed when a stack-allocated object escapes.We discuss stack ordering, which is an optimization to reduce the overhead of the heapification routine, in Section 4.

Stack Allocation using Static Analysis Results
Our approach starts with a statically performed context-, flow-and field-sensitive escape analysis of Java programs in the Soot framework [24].This analysis uses points-to graphs [25] to build method summaries in terms of the escape status of each abstract object allocated in the program.The summaries are built bottom-up over the call graph (which is generated using the context-insensitive SPARK [14] algorithm implemented in Soot).
The result of our static escape analysis is a per-method list of objects that can be allocated on the stack.As the OpenJ9 JIT converts Java Bytecode to a tree-based intermediate language (Tree IL), we represent objects in terms of bytecode indices (BCIs) of the corresponding new bytecodes, and provide the static-analysis results to OpenJ9 in a separate file (with ".res" extension).In OpenJ9, during the escape analysis phase, we match BCIs corresponding to object nodes in Tree IL with these BCIs to determine if the corresponding object can be allocated on the stack.That is, any BCI listed against a method m in the .resfile can be allocated on the stack frame of m.We next describe when and how we read and use the static analysis results present in the .resfiles in the VM.
In an execution enabled with optimistic stack allocation (determined using a command-line argument), we read the supplied .resfile when the runtime is being initialized and store the results (say in a map called staticAnalysisNonEscapingMap).This ensures that we incur file-reading overheads only once, in the beginning.
In the existing VM implementation, whenever the JIT compiler picks a hot method for compilation, it traverses the Tree IL corresponding to that method, performs a simple (imprecise) escape analysis over the object nodes therein, and generates a list of candidates deemed suitable for stack allocation.We top up this list with the objects whose BCIs are present in the list corresponding to the method being compiled in staticAnalysisNonEscapingMap so that those additional objects also get marked for stack allocation, without actually performing a complex analysis during JIT compilation.Later, the stack allocation routine allocates the objects present in this candidates list on the stack (instead of on the heap) when the stack frame of the corresponding method is created during execution.
Example.Consider the code snippet in Figure 2. The local objects 5 and 6 in the method doo are passed as arguments to the method bar.The existing escape analysis, unless it is able to peek inside bar, would not be able to stack allocate any of the objects 5 or 6 on the stack frame of doo.However, our static-analysis guided optimistic scheme would be able to determine that none of the objects passed to bar escape therein (the store statement inside bar involves objects allocated in the same method, i.e. doo -identifiable with one level of context sensitivity), and consequently mark both 5 and 6 in doo for stack allocation.Similarly, our scheme would be able to mark the objects 7 and 9 for stack allocation in doo.

Dynamic Heapification
As we shall see in Section 6, our static-analysis based scheme marks a significantly high number of objects for stack allocation during program execution.However, in a language runtime that allows interesting dynamic features such as dynamic classloading (DCL) and hot-code replacement (HCR), which essentially allow arbitrary code changes, an object that was stack allocated may escape during execution.We cannot simply leave such objects allocated on the stack as is, and need a mechanism to repair the memory such that the program behaviour is preserved.The most unique feature of our approach is this timely detection and fallback mechanism for repairing incorrect stack allocation; we describe this next.Also, as a consequence of the presence of a dynamic repair mechanism, we call our static-analysis guided stack allocation optimistic.
To handle the possibility of dynamically introduced escapes in the VM, we need a run-time check that timely (that is, before any incorrect access) determines whether we have allocated an escaping object on the stack.For such escaping objects, we also need a mechanism to move the object from the stack to the heap and update all previous references to the object to point to the new heap location.We refer to the introduced run-time checks as heapification checks and the procedure to move objects from stack to heap while correcting references as dynamic heapification.
We observe that there are five kinds of bytecodes that may cause an object to escape: (1) Returns of references (bytecode areturn).
(2) Reference stores to instance fields, static fields and object arrays (bytecodes putfield, putstatic and aastore, respectively).(3) Throwing of exceptions (bytecode athrow).(4) Calls to native methods that may lead to reference stores in sun.misc.Unsafe (putObject, putObjectOrdered, putObjectVolatile and compareAndSwapObject).(5) JNI APIs used to perform stores in called C/C++ code that may manipulate Java objects (setObjectField, setObjectArrayElement and setStaticObjectField).The last three cases typically require the associated object to be on the heap, and hence we trigger a heapification if the object is currently found on the stack.However, for the first two cases (that is, at reference returns and at reference stores), we insert more involved checks to determine the need for heapification, as discussed below.Also, in the remaining text, we describe our heapification checks in detail for the JIT compiler; the changes in the interpreter are similar except that they are introduced directly in the interpretation logic for these bytecodes instead of being emitted in the corresponding native code by the code-generation phase of the compiler.

Heapification Opcodes.
In order to insert heapification checks, we have introduced two opcodes in the JIT compiler: possibleHeapificationAtReturn and possibleHeapificationAtStore.In the IL generation phase of the JIT compiler, when relevant bytecodes are processed, we insert these opcodes in the generated code.The evaluation of these opcodes performs the heapification check.
When an object is being returned, we add the possibleHeapificationAtReturn opcode before the corresponding instruction.When this opcode is evaluated before a return statement, we check if the return value is allocated on the returning method's stack; if so, it is heapified.This way, we do not unnecessarily heapify objects that were passed as a parameter or were allocated on the heap.
A field-store statement involves two objects: the base object on the left-hand side (say lhs) and the object on the right-hand side (say rhs) being stored into a field of the base object.Now if the lhs object has a longer lifetime (that is, it is allocated either on the heap or on a higher stack frame), then the rhs object is escaping.In case this rhs object was allocated on the stack and is found to be escaping due to this store statement, it would require heapification.We perform this check by adding the possibleHeapificationAtStore opcode at store statements.The evaluation of this opcode involves two components: first, a cheap address-comparison check; and second, a more expensive stack-walk check that is performed only when the address-comparison check fails.

Address
Comparisons and Stack Walks.The address-comparison check involves comparing the virtual addresses of the lhs and the rhs objects.During execution, we have access to the bounds of the stack region and can determine whether an object is present in the stack or in the heap.If an object is present in the stack, then we can further perform the address-comparison check.Note that the stack region in OpenJ9 is modelled to grow towards lower addresses and hence newly created stack frames have lower addresses than the previous stack frames.Thus, if two objects have been allocated on the stack, by comparing their addresses we can determine which object was allocated first.However, the existing stack allocation scheme allocates objects in an arbitrary order, within a single compiled-method stack frame.As a result, for the scenario when the lhs and the rhs objects were allocated in the same method, by comparing addresses we cannot always determine which object was allocated first 1 .In fact, their activation records would be cleared at the same time as they are in the same stack frame and hence, both objects have the same lifetime.When this happens, the rhs object does not escape as the lhs object also has the same lifetime.But in case the lhs object was allocated a lower address than the rhs object (due to the arbitrary nature of the existing stack-allocation scheme), we might imprecisely conclude that the rhs object escapes (a false positive).Further, this imprecise conclusion may cause more heapifications than needed.
In order to reduce false positives and make the heapification checks more precise, if we had access to the stack-frame bounds of the method in which an object was allocated, we could easily determine if two objects were allocated in the same method.But since maintaining that explicitly for all the methods during run-time would have a high overhead, we choose to use a stack-walk based approach instead, as discussed below.
A stack-walk check involves walking over the stack to determine if the rhs object was allocated in the same frame as the lhs object or in a deeper frame.A stack walk iterates over the stack frames from the most recently created frame to the deepest one, and compares the stack-frame bounds with the addresses of the rhs and the lhs objects.It stops iterating once it finds a stack frame that has either of these objects in its address range.If the lhs object was allocated in a deeper frame than the rhs object, then we infer that the rhs object will escape and has to be heapified.
Observe that a stack-walk check can precisely determine if an object escapes; however, it involves a call into the JVM from JITed code as well as iteration over stack frames, and thus has a high overhead.Hence, we use the stack-walk check only when the address-comparison check fails.

3.2.3
The Heapification-Check Algorithm.Figure 4 summarizes the above discussion on heapification checks for store statements (of the form . = , where both and are reference variables).Say the object pointed-to by is the "lhs" and the one pointed-to by is the "rhs".Correspondingly, say lhsReg and rhsReg contain the addresses of the lhs and the rhs objects, respectively.Also, let stackBaseReg and stackEndReg store the starting and the ending addresses of the stack region of the memory, respectively.The algorithm essentially has three consequences.We do not need any heapification either when the rhs object is outside stack bounds (i.e., already on the heap) or when the rhs object has a longer lifetime compared to the lhs object (conditions at lines 1 and 8).We definitely need heapification when we are sure of an escape, which is the when the lhs object is on Fig. 4. The heapification check performed at store statements ( .= ).lhsReg and rhsReg store the addresses of the base object (pointed-to by ) and that of the object being stored (pointed-to by ), respectively.
the heap (condition at line 5).We may or may not need a heapification when both the lhs and the rhs objects are inside the stack bounds but the address comparison fails (line 10), whereby we need to perform a stack walk to determine if the rhs object escapes.
We have implemented the stack-walking routine as a helper method in the JVM, called heapi-fyObjectIfRequired (code skipped for brevity).Essentially, from the JIT-compiled code where the address comparison is performed, we call the heapifyObjectIfRequired method with the lhs and the rhs objects of the store as parameters.The method performs a stack walk to find out if the lhs object was allocated on a older stack frame than the rhs object; and if so, then it heapifies starting from the rhs object.
The heapification routine is responsible for moving the object from the stack to the heap and updating all references to the object to point to the new heap location.This is done by allocating a new object on the heap and copying the values of the fields of the escaping object from the stack to the new object.Further, a different stack walk is performed in the JVM to iterate slot-wise through all stack frames and update references to the escaping object to point to the new heap location.
It is worth noting that the escape of one object can cause the escape of other objects reachable via its fields.This brings up a need for heapifying objects recursively, which we perform by iterating over all the objects referenced by fields of the object that was heapified.If such references are found on the stack, they are also (recursively) heapified and the references to these objects are updated to point to the new heap location.During recursive heapification, we also take care to not heapify the same object multiple times using a hashmap that consists of book-keeping information about addresses of the objects that have been heapified in the current heapification invocation.
Example.After dynamic classloading in Figure 2, our heapification check (inserted before the bytecode corresponding to the new store at line 14) would detect an incorrect stack allocation for the object 7 , move it to the heap, and recursively heapify the reachable object 9 .Further, it would update the reference variable p in doo to point to the new 7 on the heap (see Figure 3).Case 1: has always been on the heap.points to and dereferences it successfully.Case 2: was allocated on stack and the reference was created before it escaped.Our approach inserts a heapification check (possibly followed by recursive heapification) in the interpreter and as part of the code generated by the JIT compiler, at all the statements that could make an object escape.Thus, for , the heapification routine performs a complete stack walk and makes all the existing references point to the copied object, irrespective of whether the activation of corresponds to interpretation or JIT compilation 2 .Hence, successfully finds the copied object on the heap.Also, our heapification routine is atomic; that is, it ensures that the state of the object cannot be changed unless it has been heapified completely.
Case 3: was allocated on stack and the reference was created after it escaped.The recursive heapification routine would have already heapified and hence any new references (including ) point to the object on the heap.

Design Choices
1. Implementation over existing escape analysis.Recall that we use static-analysis results on top of the existing escape analysis in the JIT compiler (Section 3), though in principle we could have replaced the existing escape analysis altogether.However, we found that the existing analysis was able to handle certain scenarios better than a static analysis; for example, at a call site involving virtual dispatch, sometimes the JIT can speculatively inline one of the targets whereas a static analysis would conservatively take a meet among the escape statuses of the objects passed to possibly multiple methods.Furthermore, the existing analysis is quite fast (due to its imprecise nature) and does not affect compilation time noticeably.

Avoiding repeated heapifications.
A tiered JIT-based runtime like the OpenJ9 VM may recompile a method multiple times due to the availability of better profiles or deoptimization caused by incorrect speculation.Similarly, if a dynamic property causes repeated heapifications for the objects allocated in a method, we could decide to skip the usage of static-analysis results for that method and recompile the method with the baseline escape analysis.
3. Usage of .resfiles.We have currently used separate text files to store static-analysis results before reading them in the VM.We can simply place these results as attributes in Java class files as well.

EVALUATION
The purpose of our evaluation is twofold, first to show that a static+dynamic scheme like ours can be used to improve the amount of stack allocation in an otherwise resource-constrained environment; and second to demonstrate the impact of our novel approaches in imparting efficiency while maintaining robustness of the statically computed results.
The enhancement in stack allocation can be measured in multiple ways: (i) by counting the number of additional allocation sites that our approach allows to be marked for stack allocationwe compute this by simply incrementing a counter during JIT compilation; (ii) by counting the actual number of additional objects allocated on the run-time stack -we do this by inserting a dynamic counter through the codegen phase of the compiler, which gets incremented every time an object is allocated on the stack during program execution; and (iii) by measuring the actual amount of additional memory that gets allocated on the stack -we compute this by adding up the sizes of all the objects allocated on the stack during program execution, again by instrumenting the codegen phase of the compiler.We also present insights from a fine-grained analysis of the features that lead to some of the highest stack allocations.These results are discussed in Section 6.2.
The impact of additional stack allocation itself on a program's execution can also be measured, albeit in more non-trivial ways.We compute the most direct measure -the run-time performance -in a steady state environment for all the benchmarks under consideration, and show that we sometimes improve but never degrade the performance in default settings.However, reckoning that stack allocation assumes more importance in constrained memory environments, we additionally demonstrate reduction in GC cycles (and consequently, performance improvements) by lowering the amount of heap memory supplied to the JVM.These results are discussed in Section 6.3.
Finally, in Section 6.4, in order to establish that even the offline cost of our approach is reasonable, we report the times taken by the Soot-based static analysis as well as the size overheads of the .resfiles that are to be supplied additionally to the VM during program execution.

Experimental Setup
We now describe the benchmarks, the machine, and the comparison modes used for our evaluation.
Benchmarks and machine.We have used benchmarks from the DaCapo suites 23.10-chopin and 9.12-MRI [4], and SPECjvm 2008 [20].The benchmarks skipped from these suites are those that cannot be analyzed with the used version of Soot and TamiFlex (error reports available on GitHub).We have performed our experiments on a 12th Gen Intel(R) Core(TM) i7-12700 system with 20 cores and 16 GB RAM, running Ubuntu 22.04.1 LTS.Our static analysis has been written using Soot 3.1 [24], with reflection logs for resolving reflective calls generated using TamiFlex 2.0.3 [6].Our modifications to the VM (both interpreter and JIT) have been made over the latest version of Eclipse OpenJ9 [11] built with JDK8 (OpenJ9 commit b4cc246, OMR commit 162e6f7, JCL commit 2a5e268); we chose this JDK version because Soot+TamiFlex do not yet fully support newer Java bytecodes (such as invokedynamic), requiring our static analysis to handle the call sites involving them conservatively.However, in order to be able to test our scheme with the newer (and larger) benchmarks from the latest DaCapo suite (23.11-chopin) that cannot be executed with JDK8, we have also ported our implementation to the latest OpenJ9 built with JDK21 (OpenJ9 commit 7c701a7, OMR commit 87019fe, JCL commit 07e5992) and report the results therein.
In all our experiments, we run each benchmark long enough to account for warmup: for DaCapo 9.12-MRI we take the number of required warmup iterations to achieve a steady state from a prior work that used these benchmarks [16], for DaCapo 23.10-chopin we ran each benchmark for 100 warmup iterations, and for SPECjvm benchmarks we use the existing warmup built into its harness.Also, we disabled shared-class cache (a feature in OpenJ9 that allows loading of previously compiled code in subsequent runs) so that our experiments are independent of previous executions.
Comparison modes.For our experiments involving OpenJ9, we have two primary modes: one is the baseline VM (called BASE) and another is the VM with our changes enabled (called OPT).The mode OPT supports optimistic stack allocation using the partial stack order in .resfiles, with code generated to perform heapification checks and dynamic heapification for objects determined to be escaping at run-time.For measuring the impact of stack ordering, we also have a version MOPT that has everything from OPT minus the support for ordering objects on stack frames.

Enhancement in Stack Allocation
In this section, we compare the stack allocation performed by BASE and OPT, and present insights on the static-analysis features that lead to improvements.

Static and Dynamic Number of Allocations.
For both the schemes BASE and OPT, the first and the second columns in Table 1 show the number of allocation sites marked for stack allocation during JIT compilation ("Static Counts") and the actual number of objects allocated on stack during program execution ("Dynamic Counts"), respectively.The parentheses in each of the columns contain the percentages of objects identified under each category, relative to the total number of static and dynamic objects, respectively.As can be seen, BASE by default is able to identify and stack allocate very few objects for most benchmarks, sunflow being the only exception where the dynamic counts go up to 20.08%.This is because the existing JIT analysis is imprecise and can work well only in scenarios where most of the allocated objects are either local to the method being compiled or passed to methods that easily get inlined.In contrast, our scheme OPT is able to improve the stack allocation significantly.The static number of object allocation sites identified for stack allocation, over all the benchmarks, increases by about 98% (from 514 to 1017), and the dynamic number of object allocations on the stack increases by about 71% (from 1473M to 2524M).Note that the improvement in dynamic percentage of stack allocations could be much higher than that in the percentage of allocation sites identified statically; for example, for h2, OPT is able to stack allocate 13.92% of objects on stack even with the static allocation sites being 3.87%.This is because of objects created inside loops, which are identified as stack allocatable even if they are passed to other methods, due to the precise interprocedural nature of the results used by OPT.Few notable benchmarks where the improvement is high are the h2, pmd and signverify, wherein OPT increases the dynamic percentage of stack allocation from 0.92%, 7.20% and 0.86%, respectively, to 13.92%, 14.20% and 7.24% of the total number of objects.
In order to estimate the actual amount of heap memory savings with our scheme, we measured the amount of stack memory allocated by both BASE and OPT during execution and report them in the third columns for each scheme ("Stack Bytes").We observe that the number of bytes allocated on stack by OPT increases by about 54% compared to BASE (57782MB over 35176MB, over all the .These programs are indeed allocation-intensive, and OPT shifts the memory allocated from the heap to the stack by a good margin in all these benchmarks.

Simulating
Longer Runs of Benchmarks with Forced JIT Compilation.Noting the near-zero numbers for both BASE and OPT for many of the DaCapo 23.11-chopin benchmarks, we investigated their profiles and found that during their execution with 100 iterations very few methods got JIT-compiled at levels that perform escape analysis and stack allocation in OpenJ9.Observe that some of these benchmarks (such as kafka, which is a real-time streaming system) are indeed supposed to be run for a really long duration of time, which might lead to warmer JIT compilations subsequently; we verified this by running some of these benchmarks for thousands of iterations and noticing that stack allocations started to increase.In order to approximate such runs, and to estimate the maximum improvements our scheme could bring therein, we lowered the JIT compilation threshold of OpenJ9 to one invocation and recomputed the number of stack allocations; Table 2 shows the benchmarks for which either BASE or OPT scheme showed a significant improvement compared to the "default-mode" runs reported in Table 1.As we can see, under this configuration, even BASE performs a few stack allocations for all the benchmarks.Importantly, we see much better improvements with OPT for the benchmarks jme and kafka (where the number of objects allocated on stack increase to 35.6% and 4.37%, respectively).Similarly, we found noticeably higher improvements for the benchmarks fop and h2 in this configuration.With these observations, we believe that our scheme could be used to make case for a stack-allocation pass even at lower levels of compilation, or devices with even lower JIT budget, as it has the potential to improve the number of stack allocations without performing any escape analysis during JIT compilation.

Analyzing Allocation
Sites that Lead to High Number of Allocations.We conducted another experiment to understand which kinds of allocation sites get stack-allocated with our scheme but not with the existing JIT analysis.For this, we inserted a counter in the generated code to measure the number of dynamic objects created by each allocation site marked for stack allocation, sorted the results in decreasing order, and mapped them back manually to the corresponding source code.We found two important features of our static analysis that lead to an improved precision.First of all, as OPT's escape analysis is performed statically, it analyzes all the methods interprocedurally; whereas BASE performs escape analysis only for methods that are compiled at hot+ levels of compilation.Furthermore, as the peek depth of BASE is quite limited (usually just zero or one, unless all the callees are compiled at high levels of compilation), it marks objects passed as arguments beyond the peek depth as escaping.Even when BASE peeks a callee, it is limited in terms of the analysis information (e.g., use-def information) that it builds for peeked methods as it does not have the necessary compile-time budget to perform these kinds of analyses in the quest for precision, whereas the OPT does not have these kinds of constraints.For example, in the benchmark h2, we found a class RowList whose instances (>2 million in number) are used to store rows from a database.The benchmark code uses rows as the receiver for a call rows.add, which further calls a chain of methods starting with the same receiver.BASE fails to perform interprocedural analysis of the nested calls and marks the object pointed-to by rows as escaping (not the case with OPT).Note that it is possible to make BASE more intricate and capture such objects, which gets evident upon noticing the improvements with forced compilation in Table 2 for h2 (as well as others), but this increases the time spent in JIT compilation (successfully avoided by OPT).
Secondly, in order to achieve more precision than peeking, BASE requires the target of a call to be inlined into the caller (which is not always possible during JIT compilation).Whereas OPT achieves the resultant precision without the requirement of inlining such calls, due to context sensitivity in the static analysis.For example, consider the following code snippet in which a method m is called from two different callers c1 and c2, such that c1 passes an escaping object as argument and c2 a non-escaping one.The method m stores a new object into the field f of its parameter p, and later the method c2 stores another new object into the field of after the call to m.Now, BASE would be able to identify the object as non-escaping only when it is able to inline m into c2, whereas OPT analyzes the calls to m context sensitively and marks for stack allocation, without requiring m to be inlined during JIT compilation.We found several instances of this scenario in the benchmark pmd and believe it would be prevalent even in others.
Overall, we assert that our static+dynamic scheme shows a marked improvement in stack allocation compared to the baseline, and we further measure its impact on performance in Section 6.3.

Impact of Additional Stack Allocation
In this section, note that the performance metric reported by the DaCapo harness is time (in msec) and that by SPECjvm is ops/sec; we have normalized the latter to enforce a lower-the-better trend.

Impact on Performance (Default Memory
). Figure 7 shows the performance box plots over ten iterations in steady state for the three modes outlined in Section 6.1: BASE, MOPT and OPT.We have skipped showing flat results for benchmarks that show a negligible difference across the various modes.For the nine benchmarks in Figure 7, we can sometimes see a small increase in the time taken from BASE to MOPT.This is due to the overhead of heapification checks performed as part of evaluating the potential escape-causing statements as discussed in Section 3.2.On the plus side, with our stack-ordering enabled implementation OPT, we see that the time taken either matches that of BASE (compiler, luindex, pmd) or even improves noticeably (fop, graphchi, h2, lusearch, rsa).Further, though we occasionally see a (very) slight increase in time with OPT, we later show (in Section 6.3.3) that improvements with OPT are much more pronounced once we reduce the currently very high amounts of heap memory supplied to the JVM, for the benchmarks with good stack allocation.Thus, it can be asserted that our proposed scheme is capable of allowing much higher stack allocation without causing performance overheads, and that the performance improves too when the additional stack allocation starts showing benefits (possibly in terms of faster variable accesses and reduced GC cycles; we measure the latter next).6.3.2Impact on Garbage Collection.One of the relevant ways of measuring the impact of stack allocation is to compare the number of garbage collection (GC) cycles between BASE and OPT.To perform this experiment, we vary the amount of maximum heap memory made available to the JVM process (using the -Xmx argument) and compute the number of GC cycles (using the -verbose:gc argument of OpenJ9) for both BASE and OPT.More the allocation on stack (and conversely, less the allocation on heap), we expect to see fewer GC cycles.Further, lesser the heap memory supplied to the VM, more pronounced should be the impact of allocating bytes on stack rather than on the heap.Now consider the results shown for six benchmarks (chosen based on the object-allocation profile and improvements observed in Section 6.2) in Figure 8.Here, we calculated the minimum heap size (say X) required to execute each benchmark and then computed the number of GC cycles by setting the available heap memory to X, 2X (moderate heap) and 3X (generous heap), as well as the default heap size (DEF), which is 4 GB for our machine.
We can observe that lower heap sizes lead to more number of GC cycles, in general (due to the increased memory pressure).Importantly, we can observe fewer GC cycles with OPT (particularly for compiler, graphchi and h2).This is because once we allocate more objects on stack, we directly reduce the memory pressure on heap, which subsequently translates to reduction in GC cycles.to happen in a loop, and the detection to happen three frames down the run-time stack by inserting an escape-causing store statement therein.Our observations for this small-program experiment were interesting: We found that a million heapifications caused our program to slowdown by one second (original run time 7 seconds); 10 million heapifications caused a slowdown of 10 seconds (original run time 30 seconds), and so on.There was no noticeable slowdown until a few tens of thousands heapifications.We can infer from this result that the cost of heapification for larger programs, where it may not happen as frequently as our artificial program, may be quite tolerable compared to the benefits it may present.Hence we believe our scheme, coupled with efficient heapification checks, is indeed effective in scenarios where a dynamic feature may cause optimistic stack allocation to fail occasionally, and is thus practical to be employed in standard runtimes.

Offline Cost
Finally, observe that the advantages of our scheme -both in terms of additional stack allocation and efficient heapification checks -stem from a precise (flow-, field-and 1-level context-sensitive) static escape analysis; in fact, one could apply any of the recent advances in static analysis, for example object and type sensitivity [15,19], and keep getting better.The advantage of performing analysis of Java Bytecodes statically is that the amount of time spent in gaining additional precision does not affect the execution time of the program.For completion, we now report the times taken, and the sizes of the .resfiles generated, by our static analysis implementation; see Table 3.
The second column shows the time taken by our static analysis (including stack ordering), for each benchmark.As can be seen, the static-analysis times are tolerable (on average, 100.1 seconds) and generally increase with the increase in the size of the benchmark (see the .classsize column).Further, the .resfiles containing the static analysis results are small -on average 0.3 MB and ≈6.0% of the size of the class files -and do not cause any significant storage and reading overhead.
To conclude this section, we note that our scheme is capable of bringing out the benefits of a precise escape analysis to a production Java Virtual Machine, which, previously, could allocate a much lower number of objects on stack even with a sophisticated escape analysis infrastructure in the JIT compiler.We believe this paper is just an example of the power of a static+dynamic approach (coupled with a fallback mechanism) for optimizing programs in constrained managed runtimes, and that a strategy like ours could be applied to even more analyses and optimizations.Finally, our implementation for optimistic stack allocation and dynamic heapification is open source and ready to be integrated into the Eclipse OpenJ9 VM.

RELATED WORK
To the best of our knowledge, ours is the first work that uses static analysis to perform stack allocation in a VM while supporting a dynamic repair mechanism involving heapification and stack ordering.Existing works involving JVMs do perform escape analysis and resultant optimizations of various kinds though, and we discuss them in this section.One of the first extensive proposals to perform escape analysis for Java was implemented in the Jalapeño VM [1,25], which introduced the points-to escape graph notation for denoting points-to relationships as well as performing reachability-based escape analysis.Another popular abstraction for performing escape analysis, proposed by Choi et al. [8], is that of connection graphs, which do not maintain points-to relationships directly but allow one to perform reachability checks faster.A partially interprocedural version of this approach is used by the C2 JIT compiler of the HotSpot VM [18] to perform synchronization elision and scalar replacement.In contrast, the approach proposed by us in this paper for improving the partially interprocedural escape analysis of Eclipse OpenJ9 [11] is context as well as flow sensitive, and uses points-to graphs as they additionally allow us to determine stack orders (which will not be possible with connection graphs).Further, our escape analysis is performed statically, which avoids overheads during JIT compilation.
Kotzmann and Mössenböck [12,13] proposed an escape analysis that works in presence of dynamic classloading for the C1 compiler of the HotSpot JVM.Essentially, they reallocate objects replaced by scalars if the VM deoptimizes to the interpreter.The more recent GraalVM builds up on this idea and performs scalarization of objects while rematerializing them in branches where they escape [21].As some of the stack-allocated objects can also be replaced with scalars, it would be interesting to complement our implementation with scalar replacement.Interestingly, the current VM-only versions of these works require additional bookkeeping in order to materialize objects, which can possibly be reduced if (part of) the analysis was performed statically like our approach.We mark this as an interesting future extension.
The closest related works we found are those by Corry [10] and Cleereman et al. [9].The first one proposes an address check at write barriers to detect objects that live longer than the stack frame being destroyed, but performs stack allocation primarily around loops and by maintaining a separate region for object allocation on stack.The second one proposes to address dynamic classloading that may cause previously non-escaping objects to escape, by modifying the classloader and maintaining special dirty bits on the stack frame, but has no evaluation.Our approach, on the other hand, covers dynamic features including dynamic classloading, and does not require modifications to the classloader, stack management, or to any other traditional routines of the JVM; further, our work is supported by an extensive evaluation over a production JVM.
A recent work proposed a framework [22] for performing expensive analysis statically and using its results during JIT compilation, in context of libraries that may differ between static and dynamic compilation, and another work formalized the idea of staging program analyses across static and JIT compilation [3].Similar to them, we use bytecode indices to represent static-analysis results that are usable in a JVM.However, adapting their approaches for stack allocation would pose the exact problem that we solve in this paper: timely invalidating the results in presence of dynamic features such as hot-code replacement, dynamic classloading, and even the possibility of different libraries.Nevertheless, we are encouraged by this trend and believe static+dynamic analysis has many further applications that are yet to be explored.
An alternative approach of optimizing memory, popular particularly in the functional programming community [23], is to divide the run-time memory into inferred regions that can lead to faster access and cleanup of objects in the current region (a region may include multiple stack frames).Region inference is more expensive than escape analysis and is not employed by current object-oriented runtimes.However, we feel approaches like ours can be used to improve the efficacy of region-based memory allocation and open up the possibility of experimenting with interesting stack and heap allocation strategies, for popular OO runtimes as well.

CONCLUSION
Static and dynamic program analyses have historically been at two ends of a spectrum: researchers keep coming up with advancements to enhance the precision of static-analysis abstractions, and practitioners keep designing novel engineering techniques to keep language runtimes efficient.However, seldom do the ends meet, which is evident by the choice of imprecise analyses employed by JIT compilers to balance the tradeoff with efficiency.
Our work picked up an important OO optimization -that of allocating method-local objects on the stack frames of their allocating methods -and showed that using a static escape analysis to optimistically allocate identified objects on stack could improve the precision without thwarting the efficiency.Further, in order to ensure functional correctness in case the static-analysis results do not correspond to the run-time environment, it proposed the ideas of (i) dynamically checking incorrect stack allocations before they could cause an issue; (ii) repairing the memory layout by heapifying escaping objects and correcting their references; and (iii) doing so efficiently by ordering the objects on stack in a statically identified manner.
Our evaluation on a production JVM, across its interpreter and compiler infrastructure, showed that the proposed scheme is capable of bringing the benefits of a precise escape analysis for improving performance in low-memory systems, by reducing memory pressure and correspondingly, the garbage-collection overheads.Further, it was a significant innovation challenge not just to design a scheme that involved multiple levels of abstractions and coding environments, but also to implement it and reach a stage where it is ready to be integrated into an industry virtual machine.
Overall, we feel that sound and efficient static+dynamic approaches like ours open up possibilities to speed up and improve the precision of various analyses and optimizations, not only for Java Virtual Machines but also for similar languages and runtimes such as C# and the .NET framework, as well as for more dynamic languages such as JavaScript, R and Python.

Fig. 2 .
Fig.2.Motivating example to explain various aspects of our run-time scheme.

Fig. 7 .
Fig. 7. Performance comparison (default heap size); lower the be er.For DaCapo benchmarks the y-axis is time (in msec) and for SPECjvm benchmarks it is the normalized reciprocal of ops/sec.

Fig. 9 .
Fig.9.Performance comparison (minimum heap size for each benchmark); lower the be er.For DaCapo benchmarks the y-axis is time (in msec) and for SPECjvm it is the normalized reciprocal of ops/sec.

1
Procedure HeapificationCheckAtStore(lhsReg, rhsReg) 2 if rhsReg < stackBaseReg OR rhsReg > stackEndReg then No heapification required./* The rhs object is outside stack bounds */ stackBaseReg OR lhsReg > stackEndReg then /* The lhs object is outside stack bounds, hence the rhs object escapes */ 3 5 if lhsReg < 9 No heapification required.10 else /* The lhs object has been allocated in either the same frame or a deeper frame as compared to the rhs object */ 11 Perform stack-walk and heapify if needed.

Table 1 .
Stack allocation statistics for various benchmarks with BASE and OPT schemes.The first six benchmarks (biojava-zxing) are from DaCapo 23.10-chopin, the next eight (avrora-sunflow) are from DaCapo 9.12-MRI, and the last eight (compiler-signverify) are from SPECjvm 2008.

Table 2 .
Stack allocation statistics with forced compilation of all the methods, for the benchmarks where a longer run leads to noticeable improvements.Though for several SPECjvm benchmarks the absolute numbers are less than 1 MB, the order of increase is significant for all the benchmarks.Few examples worth pointing out are graphchi, h2 and pmd, where the bytes allocated on stack increase by orders of magnitude (9184.6MB, 10801 MB and 2465 MB, respectively with OPT, over 0 MB, 523 MB and 1310 MB with BASE)

Table 3 .
Time taken by our static analysis along with the .resfile overhead.