TrackFM: Far-out Compiler Support for a Far Memory World

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2× while retaining their programmer transparency, and that TrackFM can perform similarly to a state-of-the-art library-based system (within 10%). The application is merely recompiled to reap these benefits.


Introduction
Applications benefit from deep, hierarchical memories that match the program's available data locality to a memory tier with appropriate performance characteristics.For example, Lagar-Cavilla et al. found that applications access an average of 32% of their pages in Google's warehouse-scale system [20] and most pages are accessed only infrequently.This infrequent access presents an opportunity for a cheaper, slower tier of memory that sits between DRAM and disk.One example of such a far memory tier is remote memory, alternatively referred to as disaggregated memory [1].In the remote memory model, DRAM on a remote server connected to the local machine with a high-performance interconnect serves as swap space.Remote memory systems accommodate memory-constrained applications by allowing workloads to scale across machines rather than requiring overprovisioning using expensive, large-memory hardware.This reduces ownership costs [40] and mitigates application crashes from unmet memory demands.
Remote memory can be implemented in hardware or software.This paper focuses on software-based remote memory, for which there are two primary techniques: kernel-based and library-based.The kernel-based approach modifies the OS paging subsystem [3,13,44], achieving programmer transparency: the application developer gets the advantages of kernel-based approaches for free; even unmodified binaries can benefit from remote memory.Fastswap is a notable example that uses a modified Linux swap subsystem to leverage memory on a remote server using RDMA [3].The programmer transparency of the kernel-based approach comes at a cost, however.For example, page fault overheads in the kernel impose a performance penalty on applications relative to using only local memory [35].The hardware page fault cost creates a fundamental limitation on performance.Additionally, the architected page size of the hardware can poorly match the granularity of application objects: this results in "I/O amplification," where more data is transferred than necessary.Specialized hardware can improve this situation by reducing the granularity of memory faults [7], but this capability is currently limited to research prototypes [32].
The library-based approach to far memory is an important alternative, where developers use modified (or custom) libraries that include data structures designed to leverage remote memory, at granularities appropriate for the application, and entirely in user space.Application-integrated far memory (AIFM) [35] is the exemplar of this approach.AIFM builds on the Shenango runtime's [34] high-performance user-level tasking and networking to hide remote object fetch latencies using prefetching, concurrent fetch requests, caching, and automatic memory evacuation.AIFM can thus achieve considerably higher performance than Fastswap, especially for fine-grained objects.The performance of the library-based approach trades off for programmer transparency, since the application must be reimplemented to leverage remote memory.Implementations like AIFM do attempt to insulate application developers to some extent.In the best case, developers need only make minimal changes to their code to leverage remote versions of data structures (e.g., a remote HashMap).However, if the AIFM libraries do not provide appropriate data structures, developers must design their own.
The tension between transparency and performance in the kernel-based and library-based approaches creates an opportunity for a third alternative: compiler-based.We argue that modern compiler analysis and transformation techniques make it possible to simultaneously achieve programmer transparency and performance.To support this argument, we design, implement, and evaluate TrackFM, a compiler and runtime framework that achieves full transparency using semantics recovered and exploited using state-of-theart compiler middle-end analyses and transformations, and achieves high performance by using the heavily optimized AIFM runtime as a backend.No specialized hardware or modifications to the OS are required.Using a mix of micro-and macro-benchmarks, we demonstrate that the compiler has sufficient knowledge to allow TrackFM to achieve near performance parity with AIFM (within 10%) while maintaining the programmer transparency of Fastswap.
We summarize our contributions as follows: • We introduce the compiler-based approach to softwarebased far memory, which provides a path to simultaneously achieving programmer transparency and performance.Listing 1. Simple loop using AIFM's remote array.
• We demonstrate how to use modern compiler analysis and transformation techniques to automatically transform existing applications to support far memory.• We introduce new compiler analysis and transformation passes that improve performance for the target applications.• We present the design and implementation of TrackFM, a new compiler-based far memory system.• We report on an extensive performance evaluation using numerous microbenchmarks and applications.
TrackFM is freely available online. 1

TrackFM Design
Our goal is to use the compiler to approach the performance of library-based far memory solutions by automatically transforming existing applications, eliminating the need for programmer modifications.We aim to reuse the AIFM far memory runtime and automate its integration into the application.As an illustrative example, consider a for loop that computes the sum over an array of integers.To make this array remotable in the library-based solution (AIFM), the programmer must use the remote array type provided by AIFM libraries.The programmer must then change their code manually, as shown in Listing 1.The highlighted lines indicate programmer changes.Although these changes are minimal, they require understanding of AIFM's semantics; namely, a scope object must be provided so that AIFM does not evacuate in-use local memory.Moreover, modifying applications with large code bases to run on AIFM may not be practical.
We aim to transform unmodified C/C++ applications to use remote memory automatically.Figure 1 shows our overall design.Our compiler toolchain takes the unmodified C/C++ source code2 for an application, and using an LLVMbased, middle-end analysis and transformation pipeline, remotes certain memory allocations via AIFM.It also injects a thin runtime layer into the application that interfaces with AIFM.The toolchain produces a modified binary that runs on a far memory cluster.Our transformations take place at the IR level.The primary obstacle to automating the integration with AIFM is the semantic gap between the application developer's high-level knowledge of data structures and what the compiler sees at the granularity of memory accesses.AIFM works at the level of objects, contiguous chunks of remotable memory, and what constitutes an object is determined by the application developer.For example, when AIFM's object size is set to 256B, a remote 1KB array will be represented by four chunked AIFM objects.A remote linked list, on the other hand, might use an AIFM object size of 64B to constitute a single linked list node.Unlike AIFM, TrackFM works on unmodified code, so it must automatically determine the mapping of memory allocations to AIFM objects using lowlevel information (i.e., by drawing boundaries around chunks of contiguous memory allocations).
In kernel-based approaches, any page can be swapped to a remote node, while in AIFM, candidates for remoting are determined by which data structures the programmer uses the AIFM data types for.Our design strikes a middle ground, where any heap-allocated data can be swapped out (but not at the granularity of pages).Whether these heap-allocated regions actually are swapped out depends on temporal access patterns; hot regions will be kept local, while cold ones will be evacuated to the remote node.The TrackFM runtime tracks this "hotness" via AIFM's existing object access interposition mechanisms.
AIFM has several programmer-directed parameters that affect its performance, for example, the degree of concurrency, object size, and prefetching strategy.We will see how the compiler's choices for these parameters impact performance in Section 4. Since our compiler framework requires source code, programs that use external libraries present a challenge.The naïve route is to ignore external libraries.Memory that they allocate will not be remotable.However, TrackFM needs to transform pointers to automatically remote them, and those transformed pointers can easily escape to library code, which does not know how to handle them.A library may then incorrectly attempt to access remote memory not yet localized by the TrackFM runtime.The alternatives are to (1) have programmers run external libraries through the  TrackFM compiler or, (2) only allow pre-transformed versions of the libraries provided by us.In this paper, we explore both options, though the latter is more pragmatic.

Implementation
We first outline how TrackFM transforms applications to use far memory, then we describe how we incorporate the high-performance AIFM runtime with TrackFM.Finally, we describe our compiler transformations in detail, including how we manage the overheads they introduce.In this paper, we focus on realizing TrackFM in the context of C/C++ programs.

Far Memory Pointer Transformation
The first distinction that TrackFM must make is between remotable and local-only pointers.AIFM makes this distinction using far memory data structures.However, TrackFM cannot rely on user annotations since we target unmodified code.
Conceptually, all heap-allocated pointers must be managed by TrackFM, and all others (stack, global data, etc.) remain unchanged.However, as a pointer is just an address, we have no a priori way to tell them apart.TrackFM does this by overloading the higher-order bits of the address.In particular, it leverages x86 non-canonical addresses. 3The 60 ℎ bit of the address is used to flag a pointer as a TrackFM pointer.If this bit were to be set in any non-TrackFM pointer, the pointer would be invalid.To enforce this distinction, TrackFM provides a custom malloc implementation which replaces the default libc malloc.Our custom implementation always returns TrackFM (non-canonical) pointers.
Intuitively, a TrackFM pointer can refer to memory that is either on the local or remote system.Thus, the program must be prevented from using the pointer directly.The compiler must provide an indirection layer that, when the pointer is accessed at runtime, localizes the memory and produces a standard pointer in the local address space.Thus, we must guard accesses to TrackFM pointers.These guards constitute compiler-injected code that ensures memory is localized before access; they comprise the lion's share of TrackFM's overheads, as we will see in Section 4. To properly guard pointers, the TrackFM compiler applies a series of analyses and transformations at the compiler's IR level (called passes), as shown in Figure 2.These passes are built on NOELLE [27], a novel analysis and transformation framework that expands LLVM [21] by introducing high-level and program-wide abstractions.We discuss each pass below.
Runtime Initialization.To make far memory transparent to programmers, this pass inserts hooks in the program's main function to initialize TrackFM's runtime system.
Pointer guards.In this pass, TrackFM searches for all LLVM IR-level load and store instructions that correspond to heap allocations (returned by malloc) and marks these instructions as eligible for guard transformation.The pass ignores accesses to stack and global objects by leveraging NOELLE's program dependence graph abstraction, which is powered by several high-accuracy memory alias analyses.Candidate heap pointers are later transformed by the guard transformation pass, described in Section 3.3.
Loop Chunking.We introduce a novel loop chunking analysis to reduce guard overheads introduced in loop bodies.Our loop chunking pass incorporates NOELLE's profiling facilities when available to further improve our optimization.We describe the relevant transformation in Section 3.4, and techniques to improve it in Section 4.2.
Libc Transformation.This pass transforms all memory allocation calls (mainly for heap allocation) in libc (e.g., malloc, realloc, free), into TrackFM-managed memory runtime calls.The TrackFM versions leverage AIFM's regionbased allocator under the covers to allocate remotable memory.Custom heap allocators are not currently supported, but provided they simply replace libc malloc with their own managed heap, this would be trivial to add support for.We consider more complicated heap setups involving mmap() (e.g., using MAP_SHARED) out of scope for this paper.

Bridging AIFM with the Compiler
To integrate with AIFM, we use a lightly modified version of AIFM that includes hooks into the TrackFM runtime.We next discuss details about integrating TrackFM pointers with the AIFM runtime.In particular, we must transform contiguous heap allocations into AIFM objects, fixed-size chunks that can be either in the local or remote state.We will see in Section 3.3 that significant complexity arises because a given heap allocation can comprise multiple AIFM objects, each of which may be in different states (local or remote).
AIFM manages remotable memory at the level of individual data structures.Each of these data structures in the AIFM runtime is implemented as a C++ class which extends a base class that handles the underlying mechanisms of remote objects.We extend this base class with a unified abstract data structure (ADS) that the compiler uses to capture all remotable allocations for the application.With AIFM, programmers specify remote memory usage by leveraging one of these specialized data structures.However, with TrackFM, the compiler identifies all remotable allocations and attaches them to a single runtime-managed object pool.The ADS thus contains a pool of objects that represent the total far memory that an application can use.TrackFM interposes on an application's allocation sites and chunks the allocations into objects in the global pool at run-time.
Object size selection.In AIFM, the user/data structure developer annotates each data structure with an object size for a given application.Since TrackFM does not require programmer changes, it is currently constrained to choose a single object size at compile time for the entire application.Unlike Fastswap, which is constrained by the page size, TrackFM supports object sizes smaller than a page, mitigating I/O amplification.While multiple object sizes are possible, this increases the complexity of the runtime system and compiler transformations, so we leave this for future work.We note that it is likely the case that only a few fixed object sizes make sense, and that these are likely to be powers of two ranging from 64B (cache line size) to 4KB (base page size).Using object sizes smaller than a cache line would saturate the network with many small packets, and would not take advantage of the network's bandwidth, which is geared to larger packets.On the other hand, much larger object sizes would suffer from I/O amplification, and defeat the purpose of sub-page granularity far memory.While the choice of object size is currently selected by us, the small search space suggests that an autotuning approach is feasible.Furthermore, if we are correct that only the powers of two from 6 (cache line) to 12 (base page size) need to be considered, an exhaustive search involving recompilation and a short-term execution would simply expand the short compile times.
Allocating far memory.TrackFM only remotes heap allocations and maintains a simple non-canonical address space to service memory allocation calls by the application.All memory allocation call sites within libc are intercepted by TrackFM and will return TrackFM-managed pointers starting from the non-canonical address range (starting at address 2 60 ).Because TrackFM rewrites pointers at the middle-end, even if a pointer is cast to an integer type (for example to perform offset math), the resulting load/store will still be properly guarded, provided that the non-canonical bits of the address are preserved.Internally, TrackFM maps noncanonical pointers to objects in an ADS.The object corresponding to a TrackFM pointer can be derived by dividing the TrackFM pointer by the object size (a right shift for powers of two  [35]) for lighter-weight guards.objects, while smaller allocations are grouped into a single object.
TrackFM object state table.Any particular allocation could be in a superposition, i.e., some of its constituent objects (chunks) could be local while others are remote. 4IFM tracks the local/remote state of objects by maintaining two metadata representations (one for each state) internally.Determining this state in AIFM requires two memory references, one to find the object, and another to access its metadata.TrackFM eliminates one of these operations by maintaining an object state table, an optimization that caches object metadata in a contiguous lookup table, allowing us to perform a simple index calculation rather than an indirect memory reference to derive object metadata.This is possible because of the way TrackFM encodes object IDs in the non-canonical range of the pointer.We modified AIFM so that this table is kept coherent with the AIFM-managed object metadata.The object state table contains metadata entries (8B each) for each object in the system, where the total number of objects is determined by the total size of the remote heap.The overhead of the table can be computed similarly to a single-level page table.For example, if we have a 32 GB remote heap (as in many of our experiments), we would need 2 23 entries in the table (assuming each object is 4KB), thus consuming 64 MB for the full table.As shown in Figure 3, the compiler-inserted guard derives the object metadata from this table in order to determine whether or not the referenced object is localized.

TrackFM Guards
As described above, TrackFM instruments application derived LLVM bitcode with guards on every relevant load and store instruction referring to heap-allocated memory at the LLVM middle-end layer.These guards comprise compilerinjected instructions that ensure the memory is localized (brought into local memory) before being accessed.TrackFM guards localize an object by reverting the non-canonical address returned from the TrackFM allocator back into a canonical address before execution of the target load/store.Figure 4 depicts the guard.Figure 4a shows an abstract depiction of the injected code as a control flow graph, and Figure 4b shows the guard after it has been lowered to x86_64 assembly.We break down the TrackFM guard into three components: a custody check, a fast-path guard, and a slow-path guard.Note that on the fast path only one of those instructions is a data access (to the object state table) that can result in a cache miss.Figure 4b highlights the fast path through the guard with vertical orange lines on the left.Note that we can also enable optional debug instrumentation that indicates when guards take the fast or slow path, and which AIFM code path they trigger.
Custody check.TrackFM first checks whether the pointer is managed by TrackFM.Recall that this means only heapallocated memory.If a pointer is not managed by TrackFM, we immediately jump to the target load/store.This path constitutes roughly four instructions.If the pointer passes the custody check (i.e., it is a TrackFM pointer), we perform a table lookup to derive the object state table entry corresponding to the AIFM object, and then load the object state of the TrackFM pointer.This path constitutes roughly six instructions.
Fast-path guard.We use AIFM's internal object metadata to determine if an object is safe to access, i.e., guaranteed to be local.Safety is satisfied if certain bits in AIFM's internal metadata representation are cleared. 5When safety is satisfied, the fast-path guard will be taken, constituting 14 instructions.Note that it appears that there is a timeof-check to time-of-use issue between the test instruction (line 6) and the actual target load/store (line e).That is, if the safety check passes and this application thread gets context switched out (or even if there is a race), an evacuator might run on another core and delocalize the object, rendering the pointer invalid for the final target load/store.This issue is prevented because AIFM's evacuator threads use a barrier that waits on all application threads to converge to a state where remotable pointers are "out-of-scope."While within the context of a TrackFM guard, the app thread is guaranteed not to be in this "out-of-scope" state, preventing the convergence necessary for the evacuator to proceed.This means that between line 5 and line e, the object cannot be evacuated.Slow-path guard.If the object is unsafe to access, then we must call into the TrackFM runtime.TrackFM in turn calls into the AIFM runtime to dereference the object, which could involve a remote fetch.When TrackFM interfaces with AIFM here, it adheres to AIFM's internal DerefScope API (shown in Listing 1), and also triggers a periodic collection point to allow stale objects to be evacuated to the remote node.This runtime call in the slow path, which has a higher cost, ensures safety.
Once TrackFM ensures safety, it performs the target load/store.The slow-path guard comprises at least 144 instructions when the pointer object is already localized.However, if the object is remote, the cost of the slow-path guard will be dwarfed by the remote fetch cost.

Managing Loop Overheads
Up to this point, we focused on direct pointer accesses.However, there are many cases where pointers are accessed via an offset, a major example being array accesses.It is common for such accesses to occur in loops.Ideally, when iterating over a collection (e.g., an array) in a loop, we could localize the entire array at the beginning of the loop, bringing any remote elements local before accessing them.This optimization was commonly employed by compiler-based DSM frameworks [26,28,31].However, because we build on AIFM, and a single collection might constitute multiple AIFM objects, the entire collection might be in a superposition (simultaneously local and remote).Moreover, the entire array may not fit in memory.This renders the DSM-style hoisting optimizations ineffective, and it means that all pointer accesses within a loop body must be guarded.
However, when many collection (array) elements fit within a single AIFM object, many of these guards are redundant.They are only necessary when we cross object boundaries in the loop.In AIFM, the iterator classes developed by the library developer for the remote data structures manage this overhead.With TrackFM we leverage the compiler's knowledge of the loop to reduce this overhead by developing a loop chunking optimization for TrackFM pointers.
Figure 5 depicts such a situation with a contiguous array, where multiple array elements fit within an AIFM object.The naïve guard insertion strategy will involve injecting guards at every element access.The slow-path guards (shown in red) will be taken at object boundaries, i.e., when i is a multiple of the object size, and fast-path guards (blue) will be taken on every other access.With our optimization, the compiler can determine the induction variable of a loop, including the step count and the start value of the induction variable, so it knows that sequential accesses within the boundaries of an already fetched object do not require fast-path guards.This trades off many fast-path guards with a slightly more expensive locality invariant guard at object boundaries that calls into the runtime to pin the object in local memory for the duration of accesses to the object (one loop chunk).Object boundary checks (yellow) are also inserted on every access to detect when the locality invariant guard should be taken.Note that this optimization is not just applicable to contiguous arrays; it applies more generally to loops that employ a loop-governing induction variable, which is common in practice (we will see this in Section 4.5).
The analysis pass for the loop chunking optimization searches for spatially local memory accesses that occur in loops, typically a popular location of hot code.Upon finding these accesses, TrackFM attempts to mitigate the overhead from guards in the loop body by chunking the original pointer into object-size chunks.To identify such memory accesses, TrackFM makes use of NOELLE's induction variable (IV) analysis. 6Such analysis is unique as it detects induction variables as patterns in the dependence graph, rather than building on variable analysis.This leads us to capture significantly (∼ 3×) more induction variables than what is traditionally possible.However, TrackFM can also be adapted to use other IV analyses should better techniques arise.Note that there is not a correctness issue if the IV analysis misses induction variables; it just results in lost loop chunking optimizations.We plan to further generalize our loop analysis in the future, for example by adapting polyhedral methods [43] to NOELLE.
Our optimization is particularly effective for workloads that display high regularity. 7Prefetching plays an important role in such workloads.TrackFM can detect sequential access at compile-time, so it uses prefetching alongside loop chunking to mitigate loop overheads.This has an increasing impact on performance as the number of pointers iterating over induction variables in a loop increases.This demonstrates a strength of the compiler-based approach to far memory: kernel-based approaches cannot take advantage of such loop-centric memory analysis; they must make post hoc inferences based on run-time page faults.
Improving Loop Chunking.Loop chunking is not always beneficial.In particular, when array elements are large (so that fewer of them fit within a fixed-sized AIFM object), or the loops have a small iteration space, there are fewer fastpath guards in the first place.If we apply the loop chunking transformation in such cases, performance can actually drop relative to the standard guards.Intuitively, there is a breakeven point when sequential array access occurs at a fine enough granularity for this transformation to pay off.To help the compiler determine where that point is, we develop a simple cost model.
Cost Model.Let  be the size in bytes of a TrackFM object, and let  be the size of an element in a collection accessed in a loop.For example, for an 8-byte integer,  would be 8.We model the number of elements that fit within a single TrackFM object as the object density,  =   .We are interested in determining how densely elements must be packed before the compiler applies the loop chunking transformation.Intuitively, the more dense an object, the more fast path guards will be involved, so the more advantageous the optimization will be.Conversely, if there are few elements per object, the transformation could be detrimental.With the naïve transformation, each loop will iterate over some number of objects, and each object must incur a fast-path guard for each element access, except for the first, which requires a slow-path guard.For each object there will thus be one slow-path guard and  − 1 fast-path guards.Slowpath guards have cost   and fast-path guards have cost   .We model the guard costs at the level of individual objects.We can then estimate the cost of the entire loop in terms of guards: Our loop chunking optimization replaces fast path guards (14 instructions) with less expensive object boundary checks (3 instructions) that determine when an object boundary is crossed.The object boundary checks are shown as small, yellow circles in Figure 5. Slow-path guards are replaced with slightly more expensive locality invariant guards (orange circles) at object crossing boundaries, which involve a call to the runtime.We model the cost of the boundary checks as   and the locality invariant guards as   .The cost of the transformed loop in terms of guards is then: When a loop iterates over large elements, the relatively high cost of the invariant guard can offset the elimination of the fast-path guards.Thus, we must only apply the optimization when there is sufficient object density, i.e.: Figure 6 shows the projected cost of a simple loop with a varying number of iterations for the baseline method and the loop chunking optimization.The chunking optimization becomes preferable once an object comprises as few as ∼730 elements.The curve on the plot shows empirical measurements of loop cost.Note that the projected break-even point matches the empirical data.Thus, if the compiler can determine , we can make intelligent choices about when to apply the loop chunking transformation.To do this, we leverage

Evaluation
The TrackFM compiler must make choices about how it structures far memory objects and passes information to the runtime system.We evaluate the performance impact of these choices and the overheads of compiler-injected guards using microbenchmarks, studying the impact of different types of workloads and access patterns in a controlled setting.We then demonstrate that by making good choices, TrackFM can approach the performance of AIFM on application benchmarks while maintaining programmer transparency.We seek to answer the following questions in our evaluation: the latest version 8 ported to the 5.0 kernel. 9We use the most recent publicly available version of AIFM. 10 TrackFM builds on LLVM version 9.0.0, with NOELLE v9.8.0.For C++ applications, we use libc++ version 9 provided by clang (we directly compile it with TrackFM).For large codebases we use WLLVM 11 to produce bitcode for the entire application before passing it to the TrackFM compiler.

Guard Overheads
The primary source of TrackFM's overhead comes from the compiler-inserted guard instructions at the bitcode level on heap-allocated loads and stores.Table 1 shows their costs in cycles relative to local load/store operations.The additional overhead for a fast path guard relative to a local unmodified load/store (36 cycles) instruction is 21 cycles.This will be the common case for applications that have locality of access.The uncached slow-path and fast-path guards are more expensive, but better than a page fault.The slow-path guard is similar in cost to a major page fault in Fastswap when an object is not present in local memory because both events trigger a remote fetch over the network.For reference, Table 2 compares slow-path guards to remote page fault costs in Fastswap (both when the page is local and remote).Handling a page fault in the kernel incurs 2.9× the cost of handling a slow-path guard in TrackFM when the data is local.This changes when the object/page is remote due to Fastswap's fast RDMA backend, which outperforms our use of AIFM's TCP-based backend (from Shenango) when there is not sufficient concurrency.However, even with this high-performance networking layer, Fastswap still provides little benefit over our remote slow-path guard.This is due to Fastswap's page fault handling overheads (e.g., mapping and cgroups memory reclamation).
If we really instrument every load and store to heapallocated memory, what would the costs be?To provide initial intuition, we used TrackFM to automatically transform the STREAM benchmark [29], which has a 9GB working set.This transformation produces up to 56 million slow-path guards and ∼10 billion fast-path guards.Note that we must pay the cost of these guards even when objects are local.Neither kernel-based approaches nor library-based approaches pay such costs for local objects (though AIFM does incur overhead for smart pointer indirection).Thus, it would seem that these guards present an insurmountable barrier to achieving good performance.However, as we will see in the next section, TrackFM can exploit regularity in the workload to dramatically reduce the number of guards.

Mitigating Guard Costs
To make compiler-assisted far memory feasible, there are two paths to increase performance: (1) reduce guard costs and, (2) reduce the number of guards.We spent significant effort on (1), making the common case fast-path guard involve a small number of instructions (only 14).In this section, we focus our discussion on the second path.
Loop chunking transformation.Loop chunking, described in Section 2, eliminates fast-path guards, a key factor for improving performance.To understand its impact, we first evaluate its effects on the STREAM benchmark, which involves sequential access to arrays of small elements (integers), and is simple to transform.The "Sum" test consists of a single memory access to an array element (sum+=a2[i]) within a loop."Copy" consists of two memory accesses (a1[i]=a2[i]) within the loop body.Figure 7 shows the speedup when using the optimized loop chunking transformation relative to the naïve transformation, where every loop element involves a fast-path guard.The total working set size for both examples is fixed at 12GB to aid in comparison.Note that the local memory constraint enforced on the application does not include the metadata used by AIFM/TrackFM.
We see that as the number of memory accesses within the loop increases (looking at the figures top to bottom), the speedup offered by loop chunking increases due to the larger number of fast-path guards that are eliminated.For example, for "Sum," we reduce the fast-path guard count from ∼1.6 billion to zero.Notice that the horizontal axis sweeps the amount of local memory available to the application, with increasing memory pressure to the left.These graphs tend to have an inclination towards the right-hand side since in that  regime the system is less network-bound, so the importance of eliminating guard overheads is amplified.
Improved Loop Chunking.To showcase how profiling can be coupled with our cost model from Section 3.4, we automatically transformed a k-means benchmark, which contains many loops for which it would be detrimental to apply the loop chunking transformation.We run k-means with 30 million points.The working set size is fixed at 1GB.
Figure 8 shows the results of applying the loop chunking optimization indiscriminately to all loops compared to applying it only to those loops identified as viable candidates by the TrackFM profiler, according to our cost model.
Both lines are normalized to the baseline (no loop chunking) to measure speedup.The figure shows that applying the loop chunking transformation indiscriminately produces poor results and suffers on average 4× slowdown.This is because k-means has many nested loops with a low object density.Such nested loops amplify the cost of loop chunking.In this case, there were at least 512 array elements per AIFM object.The chunking optimization detects 103 array pointers, and after applying the cost model only 27 were optimized.Applying the cost model to the loop chunking Figure 9. Impact of object size on STL maps.Fine grained memory accesses with little spatial locality can benefit from small object sizes.
pass here improves the situation considerably, resulting in a mean speedup of 2.5×.

AIFM Parameters
The TrackFM compiler must make two primary choices when integrating with AIFM: the object size and prefetching strategy.This section explores the impact of those choices.
Object size.TrackFM currently chooses an object size at compile time, though this choice could in principle be informed by profiling.To evaluate the impact of this choice, we compare two microbenchmarks with different degrees of spatial locality and granularity of access.
The first microbenchmark involves accessing a hashmap, much like how a key-value store would operate.We use the unordered hashmap implementation from the C++ STL.Both keys and values are 4B integers.In this case, the entire C++ STL is transformed by the TrackFM compiler.The working set size is 2GB.We use a workload generator to access the hashmap (50 million lookups) according to a Zipfian distribution with skew 1.02.To generate the access trace, we store a sequence of keys sampled from the distribution in a separate 190 MB array also allocated on the heap.
In this case, a small handful of the entries in the hashmap will constitute the majority of accesses, so there will be a high degree of temporal locality (but little spatial locality), and accesses occur at very small granularities (4B).The left side (Figure 9a) shows the impact of varying object size as we sweep the amount of local memory available, and the right side (Figure 9b) highlights the impact for a fixed proportion of the working set size available to local memory (25%).We measure the throughput (MOps/s) of the generated workload.In this case, a smaller object size is clearly preferable.
If we look again at STREAM, where the access pattern shows almost perfect spatial locality, we would expect to see different results.Here, we use the "copy" benchmark from STREAM with a working set size of 9GB.In this case, we measure the far memory bandwidth (the default metric  reported by STREAM).Though the granularity of access for this example is even smaller (integers), the high degree of spatial locality necessitates chunking elements into larger objects.In this case, 4KB is the better choice.Figures 9 and 10 highlight that proper selection of object size is critical to performance.While we currently make this choice offline, we envision using profiling to make this choice when application code is recompiled with TrackFM.
Prefetching.When much of the application's memory is remote, the costs of remote fetches can dominate execution time.To mitigate network costs in this regime, TrackFM must employ prefetching to exploit spatial locality.We again run an experiment on STREAM, this time with and without prefetching enabled.In this case, we use AIFM's existing stride prefetcher, and we prefetch pointers operating on induction variables as identified by TrackFM's loop chunking pass.Figure 11 shows the speedup of using prefetching relative to no prefetching as we sweep the amount of local memory available.The loop chunking optimization discussed previously is enabled in both cases.If we focus on the lefthand side of the figures (where remote costs dominate), we see a large impact (almost 5×) on overall performance.As more local memory is available, the cost of guards dominate, so the impact of prefetching reduces.We validated this with experiments (not shown for space) that demonstrate that the relative number of critical remote fetches, i.e., the number of loads/stores blocked by first having to fetch the object from remote memory when prefetching is disabled, is reduced dramatically with prefetching.
Figure 12 shows the speedup relative to Fastswap on STREAM when we apply both chunking and prefetching.TrackFM performs ∼2.7× better than Fastswap for Sum, and ∼2.9× better for Copy.In this case, Fastswap is limited by its page fault costs, and by its weaker ability to discern highlevel knowledge about the access pattern.Note that AIFM could achieve similar (even slightly better) performance here, but would require programmer modifications.

Mitigating I/O Amplification
One of AIFM's major goals is to reduce I/O amplification, i.e., the unnecessary localization of unused memory, for workloads that access memory at fine granularity.Can TrackFM achieve the same goal?Figure 13a recreates our hashmap example, which will be sensitive to I/O amplification due to the small key/value pair sizes (4B).This time we show how overall performance is highly correlated with the amount of data transferred.We see how the smaller object size chosen by TrackFM significantly reduces the amount of data transferred over the network relative to Fastswap, which uses the standard 4KB page size.Fastswap transfers 43× the working set size for the hashmap, while TrackFM amplifies the working set by only 2.3× (the 64B object size chosen here is still larger than the key/value pairs).The net effect of reducing I/O amplification in this case is an average speedup of 12× relative to Fastswap.Though AIFM can achieve similar or higher speedups with programmer effort, this involves porting libc++ (457 KLOC) to AIFM, a non-trivial task.
Just the array storing the access trace for the keys requires 190MB, and with local memory constrained to 5% of total application memory (only 128 MB), we see high memory pressure, resulting in many object evacuations and swap-ins.Thus, we see an inflated execution time (∼200s) for the first point to the left of Figure 13a.

Application Benchmarks
How do injected guards, remote costs, and our optimizations translate to overall application performance?We explore this question with two application benchmarks.The first is a data analytics workload taken from Kaggle12 that analyzes New York City taxi trips.We adapted this benchmark from AIFM to validate our results against that paper [35].The second application is memcached [12], a commonly used in-memory key-value store.We also evaluate several benchmarks from the widely used NAS suite [5].
Analytics Application.The analytics application has a working set size of 31 GB.We compare the performance of the application automatically transformed with TrackFM to the same application running on Fastswap and AIFM.This analytics application builds on a custom C++ dataframe library, and while we can correctly transform this library, our loop optimizations will not work efficiently due to C++ semantics such as exception handling, which the existing loop analysis in NOELLE has no support for.We concluded that supporting/extending NOELLE to support such C++-specific semantics would require engineering effort not justified by the research value added.Instead, we ported the original C++ dataframe library used in that paper to C, and the results reported for the analytics workload use the C dataframe library.Figure 14 shows that when the available local memory is constrained, TrackFM comes within 10% of AIFM's performance, reaching near parity.Fastswap's performance converges when remote costs stop dominating, when roughly 75% of the working set fits in local memory.To explain these results, we measured the number of major (remote) page faults in Fastswap and the number of slow-path guards injected by TrackFM.We see that the page fault count (page faults imply one-sided RDMA operations in Fastswap) is relatively much higher than TrackFM guards; both event counts strongly correlate with overall performance.The analytics application consist of many column scan operations, which involve tight loops with almost no temporal locality but a high degree of spatial locality.TrackFM can exploit this to eliminate much of the guard costs, and also benefits from the AIFM runtime.
How impactful is our loop chunking optimization here?Figure 15 breaks down the performance similarly to Section 4.2, where we run the benchmark without loop chunking, with loop chunking applied to all loops, and with it applied only to candidate loops identified by our cost model.This application has several aggregation operations that involve loops that iterate over small collections of table rows (low object density), so applying the model here clearly has benefits for reducing guard costs.Table 3. NAS benchmarks (C++ versions) run on TrackFM.
Memcached.In-memory key-value stores represent another end of the access pattern spectrum.Here, access patterns tend to show much less spatial locality, and the granularity of access tends to be quite small, thus there is significant sensitivity to I/O amplification.We use TrackFM to automatically transform memcached version 1.2.7 to run as a far memory application.We use key/value pair sizes based on the USR distribution [4].The working set size for memcached is 12GB, and we constrain the local memory to 1GB.We use a workload generator to create get operations on a Zipfian-distributed set of 100M keys.We measure the overall throughput for all get operations.Figure 16 shows the results.TrackFM shows a ∼1.7× improvement over Fastswap when the skew parameter for the access distribution is between 1.01 and 1.04.As the access distribution becomes more skewed, we see an average speedup of 1.3× over Fastswap.As the skew parameter increases, Fastswap's performance converges due to increased temporal locality, which helps to amortize its page fault costs.While not shown in the figure, as we increase the amount of memory on the local node, Fastswap will converge with an even smaller skew, since more hot keys in the working set can fit on the local node.In this regime, TrackFM's fast-path guards become expensive, as they are not amortized like page faults.As the access distribution becomes less skewed, however, TrackFM outperforms Fastswap due to reductions in I/O amplification.We verify this by measuring the total data transferred over the network.Figure 16c shows that Fastswap, limited by the architected page size, transfers 66× the working set size, much of which is unnecessary since the key-value pair sizes are small.In contrast, TrackFM benefits from small object sizes and transfers only 15× the working set.
NAS benchmarks.We use a reference C++ implementation of the NAS serial benchmark suite [23], and select a limited subset (details shown in Table 3) due to time constraints.Figure 17a shows TrackFM outperforming Fastswap for most benchmarks, where page faults are the limiting factor for Fastswap.FT is a notable outlier where TrackFM performs poorly.First, the FFT implementation in NAS has a particularly friendly access pattern for Fastswap involving good temporal reuse, allowing it to amortize its page fault costs.Further investigation revealed that TrackFM is also injecting an exceptionally large number of guards for FT.We found that the deeply nested, tight loop structure used in FT confounds our loop analysis, resulting in the high guard  count.However, we found that this is mainly an artifact of the default analysis pipeline in NOELLE.By default, NOELLE sees unoptimized code from LLVM.However, in our case, it makes more sense to accept pre-optimized code in NOELLE to minimize the number of guards that are injected.For example, redundant code elimination or dead code elimination can reduce the number of loads and stores and thus the number of guards.We verified this in Figure 17b, where we perform the chain of optimizations included in the "O1" set before the TrackFM passes (TFM/O1).This results in a 6× reduction in memory instructions for FT, and a 4× reduction for SP, dramatically reducing guard overheads.This experiment led us to change NOELLE's default optimization pipeline order for use with TrackFM.

Compilation Costs
TrackFM increases generated code size by an average of 2.4× relative to the original binary.This increase is roughly proportional to the number of memory instructions in the program, each of which is expanded into a guard with the standard transformation.TrackFM's compile time is under 6× compared to standard LLVM, though we have not yet focused effort on reducing compilation overheads.

Discussion
Section 4 showed that the compiler-based approach holds promise.We now attempt to convey some hard-earned insights from our work, its limitations, and future prospects.
Lessons.We spent significant effort engineering the guards to be lightweight.This did pay off, but we were surprised to find that exploring ways to eliminate guards entirely was the more fruitful path, though this is somewhat obvious in retrospect.We were also surprised how well kernel-based approaches perform when there is sufficient temporal locality.This is because page fault costs are quickly amortized when there is repeated access.Even in this scenario, however, they are still sensitive to I/O amplification.This suggests that a hybrid approach (compiler and kernel) holds promise.
Understanding the high-level semantics of access patterns (i.e., access over an array, or a list, etc.) is critical for performance.We expect greater benefits when we can capture information about recursive data structures [25].Finally, we found that in some cases, application code optimized for locality of reference can actually confound efforts by the compiler to derive fine-grained information about the access pattern.For example, memcached uses an optimized slab allocator that batches small allocations, thus grouping together small objects into large chunks.This actually limited TrackFM's ability to mitigate I/O amplification; TrackFM could have more effectively transformed this application had it performed small allocations in the naïve way.
Hardware Support.The overhead of TrackFM's guards could be improved with new hardware extensions.In the limit, the hardware can interpose on remote accesses and track dirty objects on its own, for example by extending the cache coherence engine (as in Kona [7]).However, while this approach is attractive from the standpoint of transparency, it forgoes the benefits of the high-level knowledge available System Programmer Transparent?
No custom hardware?
Mitigates I/O Amplification?
No OS Kernel Changes?
Project Kona to the compiler.An extension more appropriate for TrackFM might involve hardware that the compiler could manage, e.g., a lightweight, sub-page triggering mechanism that vectors directly to user-space (in contrast to the existing userfaultfd mechanisms in Linux [18]).This might, for example, look like a software/hardware stack built atop range translations [17] and user-level fault handling. 13  Limitations and Future Work.The impact of AIFM's object size parameter is workload-dependent, so users must currently choose it.We believe it would be fairly simple to remove this engineering limitation by using autotuning, as discussed in Section 3.2.
Since TrackFM operates at the level of LLVM IR, information about application semantics (e.g., recursive data structures) is mostly lost.We plan to explore inter-procedural data structure analysis [22] to capture these semantics.There is also opportunity for languages whose memory semantics more closely match those of far memory, such as Rust, whose ownership model maps well to the notion of locality.Highlevel parallel languages, where ownership can fall out of language semantics [42], and partitioned global address space (PGAS) languages [9] could also map to compiler-based far memory.
Fetching remote data just to perform trivial computations is unwise.AIFM overcomes this by allowing library developers to manually offload such lightweight computations onto the remote node, thus employing near-data processing.We believe TrackFM could employ static analysis techniques, such as automated amortized resource analysis [15,30], to achieve the same goal.TrackFM could also benefit from a profiling stage that prunes the set of heap allocations available for remoting based on access frequency.For example, the MaPHeA framework leverages hardware performance monitoring to enable profile-guided optimization (PGO) to effectively place heap-allocated objects in heterogeneous memory [33].Though this framework is built on gcc, we suspect incorporating a similar approach into the TrackFM middle-end transformations would be straightforward. 13As in Intel's user-level interrupt vectoring introduced in the Sapphire Rapids microarchitecture [16].

Related Work
Prior work on far memory primarily falls along two lines: software and hardware-based.Hardware-based approaches center on the idea of removing the limitation of the architected page size [7,14,35].On commodity machines, however, such specialized hardware is not yet an option.Prior work on improving software-based, programmer-transparent, far memory focuses on overcoming the limitations of the kernel-based approach, either by using better prefetching strategies [2,6], by reducing page fault costs in the kernel [3], or by using high-performance networking [13].Significant benefits are available when full programmer transparency is not a requirement, as shown by AIFM [35] and Carbink, which focuses on fault-tolerant far memory [47].
One way to improve on the kernel-based approach is to leverage a custom OS.DiLOS focuses on mitigating software overheads (especially of the paging subsystem) by building a LibOS specialized for disaggregated memory [44,45].DiLOS, which builds on OSv [19], uses a custom, unified page table that incorporates remote page table entries in lieu of repurposing the traditional swap cache to track remote page state, thus reducing software overheads.This approach can actually outperform AIFM with sufficient prefetching, demonstrating that in some cases reducing the page fault costs can counteract the negative effects of I/O amplification.However, even though DiLOS can run unmodified binaries (through POSIX compatibility), adopting a new OS can be a challenge.TrackFM, in contrast, runs on stock Linux without any changes.
Meta's production-scale far memory framework (TMO) leverages run-time information to transparently offload memory onto heterogeneous storage, and demonstrates that far memory pays off at scale [41].
Far memory systems share lineage with a large body of work on distributed shared memory (DSM), as these systems are similarly constrained by the architected page size.Thus, there is also work in this domain on avoiding page fault overheads.For example, Blizzard [37] and Shasta [36] work at sub-page granularity to mitigate false sharing.User-space approaches to DSM that leverage the compiler employ optimizations such as aggregation/hoisting of guards to reduce overheads [26,28,31].However, these systems assume that an entire allocation is localized at once.In our system, chunks of a large allocation can be in independent states (local or remote), making hoisting optimizations more challenging.Prior approaches also assume that localized memory will not be evacuated again, which we must handle.Many of the optimizations applied in DSM systems relate to synchronization overheads and communication avoidance [8,11,24,46], which are not applicable to non-coherent, far memory setups.TrackFM requires more careful analysis to reducing guard overheads since the same assumptions made for user-space DSM systems do not apply.
While unrelated to far memory, we build on ideas from prior work on using the compiler to replace paging-based address translation, namely CARAT [38] and CARAT CAKE [39].Table 4 compares TrackFM to the most closely related work.

Conclusion
We demonstrated that the compiler-based approach to far memory is a feasible path to automatically transform applications to leverage remote memory.We realized the compilerbased approach with a prototype system called TrackFM, and demonstrated how it can outperform the kernel-based approach by up to 2× by merely recompiling the application.Its performance comes within 10% of the best performing library-based approach, AIFM, but requires no modifications to application code.TrackFM simultaneously achieves programmer transparency and good performance by leveraging novel compiler analysis and transformation techniques, and by using the highly-optimized AIFM runtime as a backend.

A.7 Notes on Reusability
We provide several make scripts to automate new applications with TrackFM.We provide instructions on how to use TrackFM for new applications in the README in the top level of the TrackFM repository.

Figure 1 .
Figure 1.Users compile applications with TrackFM to run on a far memory cluster.

Figure 4 .
Figure 4. Left: control flow of compiler-inserted guard check.Circles indicate conditional branches and squares indicate exit nodes.Each node is annotated with the number of x64 instructions executed.Right: guard lowered to x64 code.The vertical orange lines indicate the fast path (highlighted in blue on the left).

Figure 5 .
Figure 5.The loop chunking optimization eliminates fastpath guards within loops when object boundaries are not crossed.This trades off a cheaper conditional branch inserted in every iteration (yellow) and a more expensive guard at object boundaries (orange).

Figure 6 .
Figure 6.Cost model to capture the point at which loop chunking becomes advantageous.The horizontal dotted line shows empirically when loop chunking benefits, and the vertical red line shows when the model predicts a beneficial outcome.

Figure 8 .
Figure 8. TrackFM can selectively apply the loop chunking optimization (like in k-means) to avoid collections with low object density.

Figure 10 .
Figure 10.Impact of object size on STREAM.Access patterns with high spatial locality benefit from the choice of a larger object size.

Figure 11 .
Figure 11.Speedup of prefetching coupled with loop chunking vs.only loop chunking.The combinations helps TrackFM extract more performance from workloads with spatial locality.

Figure 12 .
Figure 12.Speedup on STREAM relative to Fastswap with prefetching and loop chunking enabled.TrackFM's memory analysis helps to best exploit AIFM's high-performance prefetching.

Figure 13 .Figure 14 .
Figure 13.Applications that access memory at small granularities suffer when limited by the architected page size.

Figure 15 .
Figure15.Applying the loop chunking optimization to low density objects in the analytics application reduces performance.

Figure 16 .Figure 17 .
Figure 16.Key-value stores with small object sizes and little spatial locality suffer from I/O amplification in Fastswap.
).A single memory allocation can span multiple The object state table caches AIFM object metadata (shown on top, and reproduced from the AIFM paper DS ID (8b) P S Obj.Size (16b) Obj.ID (38b) or Figure 3.

Table 1 .
TrackFM fast-path vs. slow-path guard costs when a object is local.Costs are reported in median cycles over 1000 trials.

Table 2 .
Comparison of primitive overheads for TrackFM and Fastswap.Costs are reported in median cycles over 1000 trials.

Table 4 .
Comparison of TrackFM with prior work.