Affinity Alloc: Taming Not-So Near-Data Computing

To mitigate the data movement bottleneck on large multicore systems, the near-data computing paradigm (NDC) offloads computation to where the data resides on-chip. The benefit of NDC heavily depends on spatial affinity, where all relevant data are in the same location, e.g. same cache bank. However, existing NDC works lack a general and systematic solution: they either ignore the problem and abort NDC when there is no spatial affinity, or rely on error-prone manual data placement. Our insight is that the essential affinity relationship, i.e. data A should be close to data B, is orthogonal to microarchitecture details and input sizes. By co-optimizing the data structure and capturing this general affinity information in the data allocation interface, the allocator can automatically optimize for data affinity and load balance to make NDC computations truly near data. With this insight, we propose affinity alloc, a general framework to optimize data layout for near-data computing. It comprises an extended allocator runtime, co-optimized data structures, and lightweight extensions to the OS and microarchitecture. Evaluated on parallel workloads across broad domains, affinity alloc achieves 2.26 × speedup and 1.76 × energy efficiency over a state-of-the-art near-data computing technique with 72% traffic reduction.


INTRODUCTION
As systems scale aggressively in the number of cores and memory channels, data movement has become increasingly the bottleneck for the von Neumann architecture.To mitigate this, architects proposed various near-data computing (NDC) techniques to offload computation to the memory hierarchy, e.g.last-level cache (LLC) [10,37,75,93,96], on-chip network router [81], memory [5,6,41,43,44,56,58,60,89], storage [57,103], or multiple levels [36,62].By not bringing the data all the way to the core, neardata computing can achieve an integer multiple of performance and energy efficiency improvement, and is the key to continue efficient scaling for future systems.
However, simply pushing computing into the memory hierarchy does not guarantee that computation is now closer to the data, especially when the computation accesses more than a contiguous piece of data.Without a suitable data layout, the required operands may be scattered far away from each other.Despite its importance, prior near-data computing work either relies on manual coarse-grained data partition on reserved scratchpad space using customized APIs [5,20,30,44,81], or requires domain-specific preprocessing (e.g.graph partitioning) [40,41].Other work simply is oblivious to the data layout, and falls back to the conventional computing paradigm when NDC is not profitable [6,43,62,75,88,96].They all fall short of providing a general and systematic solution to enabling guided and efficient data layout.
Challenges: We provide the first general and programmable framework that automatically optimizes data layout for near-data computing.This is challenging as a hypothetical optimal data layout requires coordination of the entire system stack: to support customized data placement in the microarchitecture, to manage virtual to physical address translation, to expose network topology to the software, etc. Clearly such a complex approach is not ideal.
This calls for general yet concise abstractions at each level of the system to efficiently convey the information required for intelligent data placement decisions.For generality, the interface should be expressive to specify broad data layout requirements, from simple strided layouts to complex fine-grained pointer-based alignment.
Extra messages (in red) that could be eliminated by data affinity optimization.For simplicity, the interface should only convey the minimal essential information across layers to maintain portability.This works in both directions: the software should be agnostic to the microarchitecture, while the hardware should be oblivious to the actual data structures.The interface should be compatible with general programming languages and be expressive enough to enable advanced layout optimizations for near-data computing.
Insight I: To tackle these challenges, our first insight is that data placement can and should be optimized with data allocation.This is possible because most data layout requirements are known at allocation time [66], e.g. when allocating a linked list node, the previous node is already allocated, and if the new node can be placed closer to it, we can significantly reduce data movement when chasing the pointer.Also, picking the optimal data layout at allocation time saves the overhead of remapping later.Lastly, it incurs marginal programming complexity if the allocator can be reused as the new data placement interface.However, existing data allocators are either unaware of the data placement (e.g.malloc), or are imperative and opaque (e.g.numa_alloc_onnode), still leaving the placement decision to the programmer.We need a better allocator.
Insight II: Secondly, instead of directly dictating the data placement, the new allocator interface should capture the essential data alignment constraints for efficient near-data computing.Such constraints are general to describe complex data affinity relationships, e.g. the new linked list node should be close to the previous one.Also, they are determined by algorithms and data structures, but orthogonal to the microarchitecture details.This is crucial to maintain transparency and portability, freeing programmers from the burden of manual placement for each hardware generation.
Insight III: Perhaps most importantly, exposing a new allocator interface unlocks a variety of new opportunities to co-optimize the data structure to data affinity in NDC scenarios.For example, in graph algorithms, a global queue can be replaced by a spatially distributed queue to avoid remote accesses when pushing a new vertex into the frontier.Another example is using linked lists to replace the index array B[i] for indirect accesses A[B [i]].Conventionally, traversing a linked list requires costly pointer chasing and is not as efficient as an array.However, it provides the flexibility to place the index closer to the destination data A[B[i]], and may yield higher performance in NDC.Such opportunities are impossible without the new allocator considering data placement.
Our Approach: To summarize, we name our approach affinity alloc, as it systematically captures and optimizes data affinity for near-data computing.It contains a carefully designed allocator interface to capture the affinity information, a runtime library to lower the alignment constraints to an efficient data layout based on the underlying hardware details, and a lightweight yet general microarchitectural scheme to control the data layout.This design enables significantly more flexibility over manual data placementinstead of fixing data structure locations, we only describe how data structure elements should be kept close together.More importantly, it enables co-optimization between data structures and data layout to make NDC computations truly near the data.
In this work, we apply affinity alloc to optimize data placement for near on-chip SRAM computing, i.e. the last-level cache (LLC).The LLC-level is promising because capacity continues to scale in modern CPUs (768MB on AMD EPYC 7773X [1]), and many algorithms can be tiled for locality in the LLC.However, because affinity alloc addresses the fundamental data placement problem, the principles and its implementation can be generalized to other near-data computing levels and techniques, e.g.near memory controller, in HMC die, near storage, etc.
Contributions: Evaluated on parallel workloads with a cycle-level simulator, affinity alloc achieves 2.26× speedup and 1.76× energy efficiency for a state-of-the-art near-data computing technique with 72% NoC traffic reduction, and 7.53× speedup and 4.69× energy efficiency over a wide OOO CPU.We also show that it is critical to codesign the data structure to optimize for data affinity in near-data computing.Specifically, our main contributions are: • A general allocation interface to capture data alignment constraints for efficient near-data computing.• A full-system implementation of affinity alloc, with a lightweight runtime library and arch extension.• Software co-optimizations that leverage the new interface to fully realize the potential of near-data computing.• Detailed evaluation of how the new interface helps optimize the data layout and improves near-data computing.
Paper Organization: §2 introduces the baseline NDC.§3 discusses the data layout challenges and overviews our approach.§4 covers the basic interface and extensions to support affine layout, while §5 extends to irregular data layout.Methodology and evaluation are in §6 and §7.Further discussion and related work is in §8 and §9.

Affine access pattern to A[].
Comparison to t.

Compute C[i]= A[i]+B[i].
Affine access pattern to B[].

Config
Predicated by the atomic stream sx.
s p =s p .nxtIn this work, we leverage near-stream computing (NSC) [96] as the state-of-the-art baseline near-data computing framework 1 .Here we give background on this framework and point out opportunities for affinity-aware allocation.

Basic Near-Stream Computing
Stream Definition: NSC leverages "streams" as the basic unit for near-data computing, which has been widely adopted in general purpose computing [82,95,97] and reconfigurable accelerators [28,61,71,98,99].Streams are defined by the long-term access pattern to the data structure, e.g.affine pattern or pointer-chasing, and may contain NDC instructions.They are independently scheduled either at the core or near data at L3 banks.
Affine Stream: Fig 1(a) shows a multicore system with vector add Tiles are connected by a mesh network and contain a core, private L1/L2 and a shared L3 cache bank.One major overhead here is the unnecessary traffic to fetch and write back the arrays, which have no reuse at all.Such overhead is only going to be more severe as the system scales up and the data grows.
To mitigate such overhead, in

Near-Stream Computing Details
For completeness, here we include other details of NSC that are not required to understand affinity alloc.Readers should feel free to skip this subsection.

Synchronization:
The programmer annotates the loop with #pragma s_sync_free, indicating that there is no aliasing between core and stream, and no sequential semantics are needed.This enables the compiler to eliminate the original loop.Synchronization between the core and offloaded streams now relies on a coarsedgrained flow control scheme, i.e. one message contains credits for a few iterations.Similarly, context switch is possible by stopping issuing credits and waiting until all streams have reached the same point.Streams' progress is saved in the architectural state.
Predication: In Fig 2(c), if the vertex v has not been visited, i.e. compare and swap (CAS) operation succeeded, it is pushed into the queue for future processing.This push operation is broken into two streams: an atomic stream s t to increment the tail pointer of the queue, and a store stream s q to write v into the queue.Both streams are predicated by the atomic cas stream s x , and will be skipped when s x returns false.
Nested Stream: Finally, the inner loop streams in Fig 2c only take parameters from outer loop streams or some loop invariant values.Hence, the compiler can nest the inner loop streams into outer loop streams.Now, every iteration of the outer loop stream can configure an instance of the inner loop stream.Predication to skip the inner loop for certain outer loop iterations is supported.More importantly, this unlocks more parallelism by allowing multiple instances of inner loop streams to be executed concurrently, as no sequential semantics are required here for correctness.To reduce overheads, synchronization between SE core and SE L3 is coarse-grained, i.e. one message for multiple iterations.
Both SE core and SE L3 contain ALUs to handle simple scalar operations, e.g.addition, multiplication, comparison, etc.More complex computations are outlined into a separate function and lowered into the native ISA by the compiler (x86 in this work).The stream configuration contains the function pointer, and the stream computing manager (SCM) assigns these functions to lightweight spare simultaneous multithreading (SMT) threads.Since there is no memory access nor control flow in near-stream computation, it can skip the LSQ and branch prediction.Near-stream computation can also be executed by special hardware, e.g.FPGAs [62], but is beyond the scope of this work.For context switch, offloaded streams are terminated with progress recorded in architectural states.When switching back, streams resume execution in SE core .

MOTIVATION AND OVERVIEW
Here we first motivate affinity alloc by understanding the critical affine and irregular layout challenges in near-data computing.Then we overview how affinity alloc tackles these two challenges.

Affine Data Layout
We first consider a simple vector addition: Therefore, an intelligent near-data computing system should be aware of the data affinity requirement and colocate all three arrays as shown in Fig 3(c).This eliminates the data forwarding traffic and fully unlocks the potential of near data computing.
To quantify the impact of affine data layout, Fig 4 shows the performance and network traffic of vector addition with various data layouts, normalized to baseline in core computing (no offloading).We use an 8x8 mesh NoC and control the data layout such that  Although near-data computing always outperforms the baseline, its performance is very sensitive to the data layout (from 1.1× to 7.2×), as it dictates how much data traffic to forward the operands.A random data layout (i.e. each virtual page is mapped to a random physical page) avoids the pathological behavior, but only achieves 42% of the performance when data is aligned.
Challenges: Even for this simple case, optimizing the data layout already requires optimizations across the whole system stack: to convey the data alignment requirement from the application, to translate virtual addresses in the OS, to control the physical cache line mapping in L3 banks, etc.

Irregular Data Layout
The analogous data layout problems for irregular data structures are even more complicated to solve.This demonstrates the potential of having an optimal data layout for irregular data structures, including other pointer-based data structures, e.g.linked lists, trees, etc.By optimizing the data layout, the overhead of irregular accesses can be significantly reduced.
Challenges: Although promising, irregular data layout is even more challenging, as it requires fine-grained cache line layout and load balancing to ensure bank-level parallelism.

Affinity Alloc Approach Overview
To exploit these opportunities, we propose affinity alloc, a systematic data placement solution that optimizes data affinity during allocation for near-data computing.Fig 7 overviews the approach across different system levels.
Instead of having an imperative interface that exposes microarchitectural details and leaves the placement to the programmer (e.g.libnuma), an affinity alloc application only needs to convey the affinity information through the declarative allocator API.For example in Fig 7, when allocating a tree node, the pointer to the parent node is passed in so that the allocator can try to allocate the new node to the same bank as the parent node.Such affinity information is general enough to capture the essential relationship: that these pieces of data are used together and should be colocated.
To coordinate affinity information across all system levels, affinity alloc is designed by the divide and conquer principle: each layer tackles a simpler subproblem and only minimal information is exchanged between layers.Each layer is almost transparent to other layers.Specifically: • Application: We choose to enhance the allocator with affinity information (either an affine pattern for affine layouts or a list of affinity addresses for irregular layouts).This significantly reduces the programming complexity as affinity information can be straightforwardly extracted from the data structure, e.g. 2 Subject to a max 2% load imbalance between L3 banks, by moving chunks with the least traffic reduction to the least occupied bank.parent node in the binary search tree.Also, since affinity information is purely determined by the algorithm and data structure but orthogonal to the underlying microarchitecture, portability is maintained by linking a platform-optimized runtime.
• Runtime: Similarly in Fig 7 , The runtime is unaware of the data structure, but simply takes the affinity information and underlying network topology to determine the interleaving and the bank to allocate from.It also tracks the load balance to avoid creating a hot spot in the system.For example, the node n2 is colocated with its parent n5 for affinity, while n7 is spilled to bank 1 for load balancing (see bottom of Fig 7).To allocate, the runtime maintains a free list that is aware of the L3 banks and may require more space from the OS.• OS: The OS simply manages a pool for different interleaving sizes.Interleave pools are reserved in virtual address space when starting a program, and backed by contiguous physical addresses similar to a segment when accessed.It also passes the topology information to the runtime but is oblivious to the data structure or the load balance.• Microarchitecture: It supports customizable interleaving for physical addresses within interleave pools but is unaware of any program-specific details.
Data Structure Co-Optimization: Affinity alloc also enables novel data structure co-optimizations to harness the new opportunities from managing the data affinity.One example in the context of iterative graph processing is a spatially distributed work queue, leveraging the affine layout.Compared to a global queue, it reduces the overhead of managing the frontier in BFS and SSSP, as vertices can be pushed to the aligned local sub-queue with no remote accesses.This is possible in accelerators [2,26,47,48,72], but difficult for general-purpose processors without control over affinity.
Also, by supporting fine-grained irregular data layout, we can use a linked list to replace the array holding all edges in the compressed sparse row format (CSR).This provides the flexibility to colocate edges with the outgoing vertices, reducing the indirect traffic.To our knowledge, this optimization has not been explored even for accelerators, because of the lack of fine-grain affinity control.
More generally, data structures for near-data computing face significantly different tradeoffs.While contiguous arrays often have the benefit of simple prefetching on general architectures, affinitybased allocation and near-data computing offer significant advantages to pointer-based structures.Thus, affinity alloc opens new opportunities for codesign in the near-data computing era, which would otherwise be impossible or impractical to program.
Affinity Alloc Overview: Overall, affinity alloc adopts a clean layered design: the application specifies the affinity information, the runtime performs the affinity-aware allocation with load balancing, the OS manages the pools with different interleaving sizes, and the microarchitecture simply customizes the interleaving for each pool.With these lightweight extensions and data structure co-optimization (see §5), affinity alloc provides a general and systematic solution to make NDC computations truly near data.

AFFINE DATA LAYOUT
In this section, we take a bottom-up view: how to efficiently support customizable mapping from virtual address space to L3 bank locations in the microarchitecture and OS, then how the application and runtime leverages it to optimize for data affinity.

Mapping Virtual Addresses to L3 Banks
One obstacle to NDC data affinity optimization is that the mapping from virtual addresses to shared L3 banks is hidden from the user space or even the OS.First, address translation is managed by the OS.Also, modern CPUs usually employ complex hash functions to map a physical address to an L3 bank [32] to exploit bank-level parallelism and avoid hot spots.Therefore, we need to expose the mapping from virtual addresses to L3 banks to the software.
Interleave Pool: As shown in Fig 7, we introduce interleave pools.Each interleave pool is a reserved segment in the virtual address space, and addresses within an interleave pool are guaranteed to be mapped to L3 banks with the specified interleaving.For example, 64B cache lines within the 64B interleave pool are linearly mapped to L3 banks one by one.Given a pool with interleaving  and starting virtual address , we can compute the L3 bank for a given virtual address  within the pool: Similar to the heap, interleave pools are managed by the OS, and the runtime can request an expansion (similar to how mmap or brk is used to expand the heap).We provide a pool for power-of-two interleavings from 64B (one cache line) to 4kB (one page, see below for larger interleavings), i.e. 7 interleave pools per process 3 .
Physical Address: Each interleave pool is mapped to contiguous physical pages.To ensure this, when the OS handles a page fault on an unmapped interleave pool virtual address  , it will allocate physical pages from the start of that interleave pool until  , and may copy data and remap pages to make sufficient space (similar to how direct segment [11] or RMM [54] supports continuous virtual to physical mapping).To complete the picture, the microarchitecture is extended with an interleave override table (IOT, Table 1) at each L2 and L3 cache controller.Other Interleavings: Interleavings below a cache line size (64B) are not supported, as they spread a single cache line to multiple L3 banks.This requires extra metadata to track sub-line coherence states and is beyond this work.Large interleavings beyond a page size (4kB) but aligning to page boundaries (e.g.8kB, 12kB) are supported by mapping virtual pages to 4kB interleaved physical pages at the desired L3 bank 4 .Finally, interleavings that are not powerof-two help reduce the padding overhead, and can be supported at the cost of a more complicated division instead of a right shift in Eq. 1 when querying the IOT.This is left as future work.
Other Interleave Patterns: The mapping from virtual addresses to L3 banks (i.e.Eq 1) is a simple 1D linear pattern.More complicated interleaving patterns can also be supported, e.g. a 2D pattern that fills L3 banks in the order of quadrant, or a two-level wrapping around that first wraps a few times within each row before moving to the next row.These more sophisticated interleave patterns can be supported by either changing how L3 banks are numbered or enhancing Eq 1, and can provide more flexibility for the runtime to optimize the data layout.However, we find that a simple 1D linear pattern is expressive enough to achieve optimal spatial affinity for the affine workloads we studied.

Affine Layout Optimizations
With the OS and microarchitectural extensions to expose the mapping from virtual addresses to L3 banks, it is already possible for the application to customize the data layout.However, instead of leaving this burden to the programmer, we provide a runtime that automatically optimizes for the data layout and requires only abstracted affinity information from the application.A, 1, 1, 0, false});

Affine Affinity Alloc API:
Here align_p and align_q control the ratio between the aligned element indexes, and align_x adds the offset.Essentially, this is equivalent to defining an affine transformation  =  +  between the index space.These parameters can be straightforwardly determined from the access pattern, e.g. to align simply set align_p=4, align_q=1 and align_x=2.
The runtime records the metadata and selected layout of allocated arrays.When allocating a new array with inter-array affine affinity, it computes the interleaving of the new array by considering the ratio of element sizes and the interleaving of the aligned-to array.Specifically, the new array's interleaving is computed by: By factoring in the ratio of element sizes, the runtime chooses a 16B interleaving for the array double C[N] in Fig 8(b).From the perspective of L3 bank locality, this effectively converts the struct-of-array into an array-of-struct, with each element aligned within the same L3 bank to eliminate data forward traffic.
Partition Vertexes with Spatial Queue // Distribute vertex partition.V = malloc_aff({sizeof(T), N, nullptr, 1, 1, 0, true}); // Align spatial queue to V[N].Q = malloc_aff({sizeof(int), N, V, 1, 1, 0, false}); // Align queue tails to V[N].T = malloc_aff({sizeof(int64), P, V, N/P, 1, 0, false}); Once the interleaving is determined, the runtime allocates from the corresponding interleave pool and ensures that the start bank is offset by _ × _  /  .Notice that in certain cases the alignment is not perfect, i.e. when _ × _  is not a multiple of   , or when we have to round the computed   to a valid interleaving supported by the system.However, such cases can be mitigated by padding the array and supporting interleavings that are not power-of-two in future work (see below).Currently, in these cases, the runtime can simply fall back to the baseline allocator without hurting the performance.
Freeing Data: Data allocated by malloc_aff() is freed with free_aff(void*) (omitted in Fig 8(a)).Since the runtime records the metadata for allocated arrays, it can put the space back to the free list similar to a normal allocator.

Intra-Array Affine Affinity:
We also support affinity within a single array.In Fig 8(c) we access the column of the 2D array A[M,N] and hence want to optimize for affinity between rows.This can be done by setting align_to to nullptr and align_x to N5 .The runtime picks a valid interleaving that minimizes the Manhattan distance between A[i] and A[i+N].For example, in Fig 8(c) one row of array A[M,N] is mapped to one row of the mesh topology, and the Manhattan distance is one hop to the bank below it.When N is small, the runtime could also pick an interleaving that fits one or multiple rows into a single bank to further reduce the distance.Array B[M,N] is handled with inter-array affine affinity.

Distribute Partitions:
We deliberately design the interface to only specify the general affinity relationship, and delegate the runtime to select a proper interleaving across platforms.However, the programmer may want to have a very coarse-grained interleaving, especially when distributing a partitioned array across banks.Since align_p/q/x can only specify the affinity information but not interleaving, we add a partition flag to force an interleaving that evenly distributes the array across all banks.each sub-queue is aligned with the vertex partition, and when pushing a vertex v, it is pushed to the local sub-queue with no indirect traffic.Affinity alloc supports mismatch between the number of partitions P and L3 banks B (i.e.P≠B), but having them equal yields better load balancing and higher performance.Priority queues, e.g.MultiQueues [79], can also be implemented as one queue per bank.Heap rearrangement involves pointer-chasing, which is supported by NSC.This software optimization is not possible without affinity alloc to control the data alignment.

IRREGULAR DATA LAYOUT 5.1 Support Irregular Layout
While affine access patterns are relatively simple to optimize, irregular access patterns such as indirect and pointer-chasing accesses are data-dependent and are notorious for low spatial locality.However, with a small extension to the API, we show that affinity alloc can optimize the data layout for irregular data structures without extra modification to the OS or microarchitecture.
Irregular Layout API: Fig 10 shows the irregular affinity allocation API and function to allocate a new node to a linked list using affinity alloc.In addition to the allocating size, the API can also provide a list of affinity addresses that the newly allocated data should be close to.In the linked list example, it is the previous node prev.Affinity addresses should be within some interleave pool so that the runtime can infer the mapped L3 bank.This simple yet powerful API conveys sufficient information to the runtime to optimize for irregular affinity while remaining oblivious to the actual allocated data structure.We limit the maximal number of affinity addresses per allocation to 32, and the application can sample a subset if there are more affinity addresses.
Irregular Allocation: To allocate, the runtime rounds up the allocating size to a valid interleaving size.This usually incurs no overhead, as irregular data structures often use allocation sizes that are power-of-two and aligned to cache line granularity to avoid false sharing.The runtime also maintains a free list for every valid interleaving size and every bank.After selecting the bank to allocate based on the affinity addresses and load balance (see §5.2), the runtime allocates from the free list of that bank, and may require the OS to expand the specific pool if running out of space.To free an object allocated with irregular layout API, we reuse the same interface free_aff(void*).The runtime distinguishes irregular layout objects from affine arrays by checking if the address matches an allocated affine array.The interleaving of the object can be directly inferred from the interleave pool it belongs to.Since irregular layout objects are allocated at interleave granularity, the runtime knows the size of the object and can free the space by adding it back to the free list.Unlike conventional allocators, the runtime maintains no meta-data for irregular layout objects, avoiding space overheads for fine-grained allocations.
All modifications to support irregular data layout are limited to application and runtime.The OS and microarchitecture only need to handle coarse-grained interleave pools and physical address ranges.

Bank Select Policy
Simply optimizing for data affinity may result in pathological unbalanced layout.For example, in the bottom left of Fig 10, the whole linked list is allocated to a single bank, leading to low bank-level parallelism and high capacity miss rate.Therefore, we design the bank select policy to consider both data affinity and load balance.Specifically, the runtime computes a score for each bank: Here _ℎ is the average hops to the provided affinity addresses, and  is the number of irregular allocations to that bank. is a weight coefficient to control how much the runtime should emphasize load balancing.The bank with the minimal score is selected.This score function is inspired by the one used by AB-NDP [89] to optimize task scheduling, while here we extend it for data allocation.We evaluate the sensitivity to  in §7.

Data Structure Co-Optimization
Supporting irregular affinity allows the runtime to optimize the data layout for a variety of data structures, provided that they offer sufficient flexibility for data placement.This covers many important pointer-based data structures, e.g.linked lists and trees.Such data structures can benefit from affinity alloc without changing the data organization itself, simply by adopting the new allocator API.
Similarly, our approach opens up many new codesign opportunities for coarse-grained data structures that are not flexible enough to directly benefit from affinity alloc, e.g. the index array Benchmark Layout Parameters pathfinder [22] Affine 1.5M entries, 8 iters srad [22] Affine 1k×2k, 8 iters hotspot [22] Affine 2k×1k, 8 iters hotspot3D [22] Affine 256×1k×8, 8 iters bfs [13] Linked CSR Kronecker generated pr_push [13] Linked CSR 128k nodes 4M edges sssp [13] Linked CSR A/B/C: 0.57/0.19/0.19pr_pull [13] Linked CSR weight [1, 11), in which the edges are stored in a linked list, and we can place each edge list node closer to the pointed vertices by specifying the affinity addresses.This is how we achieve the optimizations discussed in Fig 5 (page 4).This comes with the cost of extra pointer-chasing between nodes, which is usually much more expensive than the linear accesses in the original CSR format.However, we argue that the tradeoffs in near-data computing are very different: 1. Pointer-chasing overheads are amortized by indirect traffic reduction since each node can hold multiple edges.For example, a 64B cache line can hold 14 edges of 4B after the 8B pointer.2. Unlike conventional CPUs where the run ahead distance is limited by the size of the reorder buffer (ROB), in NDC the pointer-chasing task can be decoupled and run ahead of the edge processing task, further hiding the latency.
Most importantly, co-optimizing the data structure with affinity alloc unlocks the benefit of the fine-grained irregular layout at a low cost ( (||) to scan the edges once).Such co-optimization is the key to unlocking the full potential of near-data computing and can be applied to other domains and near-data computing systems.

METHODOLOGY Compiler and Runtime:
We extend the open-source LLVM-based near-stream computing compiler [96] to support predication on streams and dynamic loop bounds ( §2).Programs are compiled to x86 extended with near-stream computing instructions.We implement the affinity alloc runtime in C++ and manually replace the original malloc and free calls with affinity alloc API.
Simulator: We use gem5 v20.0+ [63] for execution-driven, cyclelevel simulation, extended with partial AVX-512 support.The caches are extended with NSC support and the interleave override table (IOT) to customize the interleaving between L3 banks.We emulate the syscall to expand interleave pools in gem5.We leverage McPAT [59] to estimate the energy and area with the 22nm process.
Parameters and Configurations: Table 2 lists system parameters.The only extension to the baseline near-stream computing system is the IOT to support customized L3 interleavings for interleave pools.The baseline OOO cores use advanced L1 and L2 prefetchers [8], but no computation is offloaded (labelled as In-Core in §7).For nearmemory computing, Near-L3 offloads streams and the associated computation to SE L3 , but is oblivious to data affinity.For affinity alloc, we simulate the modified binary with affinity information conveyed to the runtime.
Benchmarks: We evaluate 10 OpenMP workloads compiled with -O3 and AVX-512, covering various affine and irregular data layouts.For graph workloads, In-Core and Near-L3 use the original CSR format, while affinity alloc adopts the new linked CSR representation.For the pointer-chasing workloads, we randomly generate and insert the nodes into the binary tree without balancing it.For link_list and hash_join, they both search through link lists, but link_list has much longer lists (512) while the buckets in hash_join are much smaller (<= 8).Table 2 summarizes the input data size and the major data layout pattern for each benchmark.Some benchmarks have alternate implementations, i.e. push and pull-based for page_rank and bfs.For page_rank, we added the push version besides the original pull-based implementation in GAP suite [13], and select the best implementation for each configuration (pull version for In-Core and push version for Near-L3 and affinity alloc).For bfs, the state-of-the-art implementation dynamically switches between pushing and pulling based on some runtime heuristics [12].We discuss the tradeoffs between pushing/pulling and the heuristics we used in §7.

EVALUATION
We first evaluate affinity alloc on a variety of workloads, bank selection policies and input sizes to demonstrate the performance and energy efficiency benefits due to improved data affinity.We then perform a detailed study on how key graph processing workloads benefit from codesigning the data structure with affinity alloc.For the microarchitecture, affinity alloc only introduces a small interleave override table (IOT).Estimated with CACTI 7 [9], the IOT takes 32kB (512B per bank), and accounts for 0.07mm 2 , less than 0.1‰ of the whole chip.
Bank Selection Policy: Fig 13 shows the speedup and NoC traffic when affinity alloc employs different bank selection policies for irregular data layout, normalized to Rnd which randomly selects the bank to allocate.Lnr selects the bank in a round-robin fashion, while Min-Hop always picks the bank with the least distance to affinity addresses (same as setting  = 0 in Eq 4).We also evaluate the hybrid policy that considers both affinity information and load balance with various  , labeled as Hybrid-H.Higher  forces the policy to favor the less occupied bank to balance the load.
As expected, Rnd and Lnr are oblivious to the affinity information and achieve similar performance.Lnr only outperforms Rnd by 25% on link_list, as we allocate the nodes one by one and Lnr allocates the node to the next bank, reducing the pointer-chasing distance (about 60% traffic reduction).However, this is not optimal compared to colocating neighboring nodes in the same bank, which eliminates the need to migrate.Also, linear allocation is less  On the other hand, Min-Hop optimizes the data affinity and achieves significant speedup and traffic reduction on most benchmarks.However, since it does not consider the load balance, it may produce pathological data layout.For example, in bin_tree it allocates the entire tree to a single bank.Although it successfully eliminates the migration traffic (much less offload traffic in Fig 13), it dramatically increases the miss rate to that L3 bank and results in a huge slowdown.
The hybrid policy Hybrid-H avoids such pathological cases by allocating to less occupied banks to balance the load.It also achieves better bank-level parallelism and improves the performance over Min-Hop.To see this, Fig 14 shows the timeline of number of atomic streams per L3 bank in bfs_push for Rnd, Min-Hop and Hybrid-5.We show the distribution by plotting the number of atomic streams from least to most occupied bank.For example, the 25% line indicates that 75% banks have higher occupancy.Rnd has higher stream occupancy, as it takes much longer for each stream to finish the indirect atomic access.Hybrid-5 achieves better load Figure 16: Speedup of Linked CSR on Large Graphs balancing than with a higher 25% line.Overall, Hybrid-5 achieves the highest performance with slightly more traffic, and is chosen as the default policy.
Large Input Size: Fig 15 shows the speedup and L3 miss rate of affine workloads when scaling up the input size.Since this work focuses on near-cache computing, the benefits of affinity alloc significantly drop when the working set cannot fit in the cache (>75% L3 miss rate for 8× input size).Fig 16 shows the same evaluation on graph workloads.We scale up the graph by increasing the number of vertices, while keeping the average vertex degree the same.Due to the irregular access pattern, we can get some reuse on the vertex properties, leading to <20% L3 miss rate.Therefore, affinity alloc still yields some performance improvement for the 8× graph.When | | = 2 18 , the graph can still fit in the L3 cache for pr_push and bfs, but not for sssp due to extra edge weights.The implication is that the already common optimization of tiling and partitioning for the on-chip cache becomes even more important.Also, as the on-chip cache continues to scale up (768MB on AMD EPYC 7773X [1]), the number of tiles required can be reduced (hence less overheads).This is orthogonal to this work.When there is no reuse at all on the chip, future work could also apply affinity alloc to align data in DRAM to benefit NDC techniques near the memory controller or inside DRAM.

Graph Processing
Graph processing contains heavy indirect accesses and benefits from improved data affinity provided by affinity alloc.Here we evaluate codesigning the algorithm in NDC scenarios, as well as sensitivity on graph structures.
Pushing vs. Pulling: Graph processing algorithms page_rank and bfs have both push-based and pull-based implementations.These approaches have different trade-offs: Pushing (i.e.top-down) approach propagates updates to outgoing neighbors and is implemented with atomic access, while pulling (i.e.bottom-up) queries incoming neighbors and involves reduction.Near-data computing naturally supports remote atomic accesses, but suffers from indirect reduction which requires collecting operands distributed among LLC banks.On the other hand, general-purpose processors can perform efficient reduction using registers, but suffer from many coherence misses when contention on atomic accesses is high.Overall, we observe that near-data computing usually favors the push-based implementation, while in-core computing works  better with the pull-based one.In our evaluation, this is the default choice for page_rank, in which all edges are active and processed in each iteration.However, in bfs, each iteration has different characteristics and may benefit from per-iteration choices between pushing and pulling [12].As expected for In-Core, pushing works well for the first and last few iterations, as there are few active nodes and therefore fewer coherence misses compared to the middle iterations.Iterations in the middle (Iter2, Iter3 and Iter4 of In-Core in Fig 18 ) favor pulling, as it avoids the overheads of coherence misses on contended vertices.More generally, the number of scout edges represents the number of pushing operations in the next iteration, and the default bfs implementation in GAP suite [13] switches to pulling if the ratio of scout edges exceeds a threshold.
This trade-off is different in near-data computing, as it is much cheaper to perform in-place atomic operations in L3 without the overheads of coherence misses.Affinity alloc improves the spatial locality and further reduces the overheads of remote atomic accesses.Therefore, near-data computing chooses pushing for more iterations.For example, in Aff-Alloc only Iter3 uses pulling in Fig 18, which suffers from excessive failed compare and exchange operations on visited vertices and has a much lower active node ratio compared to the scout edge ratio in the previous iteration in Fig 17.We adopt this insight and extend the default switching policy to estimate the chance of failed atomic operations by taking into account the ratio of visited vertices for Aff-Alloc: • ℎ → : Visited Node > 40% and Scout Edge > 6%.
We find this policy robust across all evaluated graphs.This study and the linked CSR format shows that NDC poses many different trade-offs that require software and data structure codesign.To quantify this, Fig 19 shows the speedup of affinity alloc on various synthesized power law graphs, normalized to Rnd.We fix the total number of edges but change the average node degree.Affinity alloc actually achieves higher speedup on high-degree graphs (1.5× when  = 4 and 2.4× when  = 128).This is because the edge list is sorted by outgoing vertex id (as is common practice), and the longer the edge list, the more likely that outgoing vertices of edges within one cache are mapped to the same or neighboring banks.We believe affinity alloc provides a new angle to co-optimize NDC and data structures.
Real World Graphs: We also evaluate affinity alloc on real-world social network graphs.Table 4 lists the detailed information.These power-law graphs have a high average degree and are hard to partition.Fig 20 shows the speedup and traffic reduction of affinity alloc on these graphs, normalized to Near-L3.Overall, affinity alloc successfully optimizes the fine-grained irregular data layout, and Hybrid-5 achieves 2.0× speedup over Near-L3.This clearly demonstrates the benefit of co-optimizing the data structure and affinity data layout for near-data computing.

DISCUSSION
Dynamic Data Structures: Although this work focuses on static data structures (i.e.unchanged after creation), it is an interesting direction to apply affinity alloc to dynamic data structures, especially for those that are pointer-based (e.g.trees, linked CSR).A particular example is dynamic graph processing [3,33,45,50,85] which queries evolving graphs.In this work we extend the static CSR format with pointers to provide the flexibility to support irregular layout optimization, which needs some preprocessing.However, some prior works already leverage pointer-based data structures similar to linked CSR to flexibly insert and delete from the graph [46,74], which can naturally benefit from the improved spatial locality from affinity alloc without extra preprocessing.
Generally, if the affinity requirement changes, e.g.reinserting the tree node to a different location, the previous layout choice becomes suboptimal.If the runtime is aware of the data structure modification, e.g.via 'realloc()', the layout could also be dynamically adjusted, or fall back to the default random layout if dynamic remapping overhead is intolerable.This is left as future work.
Fragmentation: One major challenge to support dynamic allocation is to handle fragmentation.In principle, the major source of fragmentation is limiting freed space in the interleave pool to allocations with the same interleaving requirement (OS can still reclaim pages at both ends by shrinking the interleave pool).For example, considering three consecutively allocated arrays A[], B[] and C[] in the same interleave pool.The free space from releasing B[] can only be reused for data structures with the same interleaving, as interleave pools are backed by contiguous physical addresses.However, this fragmentation was not seen in our static application set.A software solution is to compact the pool.Another possibility is to dynamically break and merge interleave pools of the same interleaving.In the above example, the single interleave pool can be split into two: one for A[] and the other one for C[], and the free space in between can be claimed for other interleaving or normal allocations without the overhead of copying and compacting.This requires a larger interleave override table (IOT) in microarchitecture similar to prior works (e.g.RMM [54] has 32 range entries vs. 7 interleave pools in this work).

RELATED WORK
Multicore Caching and Dynamic Data Layout: Multicore caches are physically distributed, giving rise to non-uniform cache access (NUCA) [55].Many dynamic NUCA (D-NUCA) designs have been proposed that change the data layout to reduce data movement [7, 14-18, 21, 23-25, 31, 39, 51, 65, 83, 90, 91].Unlike affinity alloc, these designs do not offload computation near data.Rather, they move frequently accessed data closer to the cores accessing it.Several limitations make D-NUCA schemes hard to apply to near-data computing.Early D-NUCAs treated the on-chip cache banks as a hierarchy, gradually migrating data closer to cores that access it [7,16,21,25,39,51].These designs require another layer of directories to locate data dynamically.As a result, most accesses still require an expensive global lookup, eliminating most of the benefit of adapting the data layout.Later D-NUCAs control data layout via the virtual memory system (i.e., page table and TLBs) so that no additional directory lookup is required [7, 16-18, 35, 83, 84, 90, 91].These single-lookup D-NUCAs significantly reduce data movement, but can only control data layout at page granularity, which we have shown is insufficient (Fig 6).Hotpad [92] designs a scratchpad hierarchy for managed languages (e.g.Java), but does not optimize for data affinity among banks.
Whirlpool [66] is a D-NUCA that controls data layout via the memory allocator, similar to affinity alloc.Whirlpool uses the memory allocator to separate data into different "pools" and uses a different data layout for each pool, letting the cache separate data with different access patterns.By contrast, affinity alloc lets programmers express the affinity between related data and control the layout so that related data is placed at the same location.
None of these works support single-lookup for fine-grained irregular affinity, nor do they explore the benefit of co-optimizing the software to enable flexible data placement.
The scope of near-data computing can be broadened beyond memory-hierarchy offloading to those that only have a horizontal dimension: i.e. those that can map tasks to different locations depending on locality.This includes works from the Swarm family of ordered-algorithm accelerators [2,48,49,76,87] that use task hints to map tasks near-data [47,100].Several prior multicore accelerators [5,72,73] and reconfigurable architectures [26,27,69,70] have this capability.Most vertical near-data architectures have a horizontal aspect.We focus on improving the effectiveness of horizontal near-data, but future work could also optimize vertically across levels.
Many of these works are oblivious to the data layout and take a best effort approach to fall back to conventional execution when near-data computing is not profitable, e.g.[6,29,43,52,62,75,88,96].Other techniques require manual data placement using imperative APIs, e.g.[5,20,30,34,44,81].Hong et al. [40] organize the linked list nodes into the same HMC vault, and Gearbox [58] performs hybrid partition on SpMV and SpMSpV.These techniques are limited to a specific domain or affine workloads.Another line of work [52,88] leverages the compiler to reschedule computation to optimize the arrival window in NDC.However, it left the mapping between address space and cache banks as future work.Kandemir [53] proposes loop transformation to reduce reuse distance in space for affine loops.Although it does not handle irregular accesses, it could be combined with affinity alloc to handle some tricky cases with less user intervention, e.g.transforming the loop to simplify the affinity requirements.
Affinity alloc is orthogonal to these techniques -it tackles the fundamental data layout problem in a systematic and programmable fashion.These near-data techniques could all benefit from an affinity alloc-like approach to improve data affinity.It is future work to extend affinity allocation to consider multiple memory hierarchy levels simultaneously.

CONCLUSION
This work systematically addresses the data layout problem in NDC by constructing a clean layered design across the system.The application only needs to specify the essential affinity information with the extended allocator interface, and the runtime can automatically optimize data affinity and load balance.More importantly, affinity alloc opens up new design space to co-optimize data structures with data affinity.This is a first but critical step to revisiting many tradeoffs and realizing the full potential of the near-data computing paradigm, where computation is truly near the data.
Fig 1 demonstrates, with Fig 1(a) depicting a conventional system.Fig 1(b) shows an NDC vector addition, where arrays not aligned in memory cause extra communication to collect operands.Fig 1(c) shows similar overheads for indirect accesses, which dominate graph processing workloads to access neighboring vertices.Naïvely offloading computation near data may yield no data movement reduction or even hurt the performance.Therefore, an intelligent data layout decision is essential to fully realize the potential of near-data computing.

Figure 2 :
Figure 2: Example Near-Stream Computing Programs 2 BACKGROUND ON NEAR-DATA BASELINE Fig 2(a), the near-stream computing compiler recognizes that there are three affine streams: two load streams s a =A[0:N], s b =B[0:N], and one store stream s c =C[0:N].It also extracts and associates the computation (i.e.addition) with the store stream s c .This forms a stream dependence graph, in which edges represent the elementwise dependence between streams.In Fig 1(b), all streams are offloaded to the shared L3 banks where the data resides and automatically migrate to the next bank following the access pattern.Stream s a and s b directly forward their data to stream s c .Stream s c performs SIMD ops on a spare thread of the remote core and then writes directly to L3.An ideal data layout would colocate corresponding elements of the three arrays in the same bank to eliminate the data forwarding traffic (green and blue arrows in Fig1(b)), which is the goal of this work.Pointer-Chasing Stream: Fig 2(b) shows a lined list traversal.The pointer-chasing stream s p =s p .nxt can be offloaded to compare against the target t.It also checks the loop condition based on the next pointer and comparison result.If evaluated to false, the stream is terminated, and the final value of hit is returned to the core.An ideal allocator would place neighboring nodes in the same or close banks to reduce pointer-chasing distance.Indirect Stream: Indirect accesses like A[B[i]] can also be offloaded.Fig 2(c) shows a push-based BFS kernel.The inner loop contains an indirect atomic access P[Edges[i]] to update the neighboring vertex's parent.Fig 1(c) shows the indirect stream offloaded along with the base stream: it reads the edge array Edges[], generates indirect addresses and sends out indirect requests to target L3 banks.This eliminates the round-trip to the core for address generation.An affinity-aware allocator would place B[i] closer to the pointed A[B[i]] to reduce indirect traffic.

Figure 3 :
Figure 3: Affine Data Layout for Vec Add the stream is short or has high reuse in the private cache) or offload it to LLC.If offloading, it sends a configure packet to the L3 stream engine (SE L3 ), which starts to access the L3 bank and perform NDC.To reduce overheads, synchronization between SE core and SE L3 is coarse-grained, i.e. one message for multiple iterations.Both SE core and SE L3 contain ALUs to handle simple scalar operations, e.g.addition, multiplication, comparison, etc.More complex computations are outlined into a separate function and lowered into the native ISA by the compiler (x86 in this work).The stream configuration contains the function pointer, and the stream computing manager (SCM) assigns these functions to lightweight spare simultaneous multithreading (SMT) threads.Since there is no memory access nor control flow in near-stream computation, it can skip the LSQ and branch prediction.Near-stream computation can also be executed by special hardware, e.g.FPGAs[62], but is beyond the scope of this work.For context switch, offloaded streams are terminated with progress recorded in architectural states.When switching back, streams resume execution in SE core .
As shown in Fig 1(b) and Fig 3(a), When offloaded to the L3 cache, s a and s b forward the data to s c , which writes back the added result.Intuitively, the placement of array A[], B[] and C[] in the shared L3 banks directly affects the data forwarding traffic and performance.Fig 3(a) shows a naïve affine data layout for the vector addition.For simplicity, we assume A[] and B[] are aligned in the shared L3 cache.However, since A[] and B[] are not aligned with C[], we have to forward both operands through the network, leading to not so near-data computing.Such oblivious data layouts may even lead to pathological cases.For example, in Fig 3(b), C[i] is mapped two banks behind A[i] and B[i], causing a bisection bottleneck in the network and significantly reducing the effective bandwidth.

Figure 4 :
Figure 4: Impact of Affine Data Layout on Vec Add

Figure 5 :
Figure 5: Irregular Data Layout for Graph Edge List Fig 5(a)  shows the baseline placement for a graph, using a compressed sparse row (CSR) format.We assume each cache line can hold two vertices (blue) or edges (green), and L3 banks are interleaved at cache line granularity.Many graph workloads (e.g.BFS, SSSP) scan edges and update pointed vertices.When offloaded in NSC, it takes 19 hops for indirect accesses to the vertices (green arrows) and 3 hops for stream migration (black arrows).However, as shown in Fig5(b), if we can place the edges closer to the pointed-to vertices, we can significantly reduce the indirect access traffic to only 3 hops at the cost of a slightly longer migration distance.

Figure 6 :
Figure 6: Impact of Irregular Data Layout To quantify such benefit, Fig 6 shows the speedup and traffic reduction if we can break the edge list in the CSR format into chunks of various sizes and freely map them to the L3 bank with minimal indirect traffic 2 .Smaller chunk sizes enable more fine-grained control on data layout.With 64B chunk (a cache line), irregular data layout optimization yields 60% traffic reduction and 2.14× speedup.An ideal configuration without indirect traffic achieves 4.1× speedup.This demonstrates the potential of having an optimal data layout for irregular data structures, including other pointer-based data structures, e.g.linked lists, trees, etc.By optimizing the data layout, the overhead of irregular accesses can be significantly reduced.
Fig 8(a)  shows the API to allocate an array with affinity information wrapped in the AffineArray struct.Besides the size of the element (elem_size) and the number of elements (num_elem), it also contains parameters to define the affinity relationship between arrays (orange box in Fig8(a)).

Figure 8 :
Figure 8: Affine Data Layout Optimizations Inter-Array Affine Affinity: Fig 8(b) shows how the API is used to optimize inter-array affine affinity.First, array A[N] is allocated with all default parameters, and the runtime simply picks the default interleaving, which is the cache line size (8B in Fig 8(b)).When allocating array B[N], we specify that B[i] aligns with A[i] by setting align_to to A. More generally, the affinity relationship between the allocating array B[N] and the aligned-to array A[N] is defined as:

Figure 9 :
Figure 9: Distribute Partitions (Assume  =  ×  ) Fig 9 shows a common use case in graph processing when the vertex array V[N] is partitioned among banks by setting partition to true.Use Case: Spatially Distributed Queue: Another more sophisticated use case of affinity alloc is to implement a spatially distributed queue.In the push-based BFS in Fig 2(c), the updated vertex v is pushed into a global queue for future processing.However, the tail of the global queue and the writing position is not colocated with the vertex, requiring indirect traffic to push into the global queue.Instead, in Fig 9 we allocate a spatially distributed queue, with one sub-queue per partition.The tail pointer and data storage of void* malloc_aff(uint size, // Alloc size.// Specify affinity addrs.int num_aff_addrs, void** aff_addrs); void linked_list_append(Node *prev, T v) // Allocate new node near to prev.Node *n = malloc_aff(sizeof(Node), 1, &prev); n->v = v; n->nxt = prev->nxt; prev->nxt = n;

Figure 11 :
Figure11: Linked CSR Format Free Data: To free an object allocated with irregular layout API, we reuse the same interface free_aff(void*).The runtime distinguishes irregular layout objects from affine arrays by checking if the address matches an allocated affine array.The interleaving of the object can be directly inferred from the interleave pool it belongs to.Since irregular layout objects are allocated at interleave granularity, the runtime knows the size of the object and can free the space by adding it back to the free list.Unlike conventional allocators, the runtime maintains no meta-data for irregular layout objects, avoiding space overheads for fine-grained allocations.All modifications to support irregular data layout are limited to application and runtime.The OS and microarchitecture only need to handle coarse-grained interleave pools and physical address ranges.

Figure 12 :
Figure 12: Overall Performance and Traffic Reduction 7.1 General Evaluation Overall Performance: Fig 12 shows the overall performance for all benchmarks.The speedup and energy efficiency are normalized to Near-L3, while the NoC traffic is normalized to In-Core where no computation is offloaded to the L3 cache.Overall, affinity alloc achieves 7.53× speedup and 4.69× energy efficiency over In-Core, and 2.26×/1.76×over Near-L3.The benefit comes from the reduced NoC traffic for various messages: the data traffic to forward the operand in affine workloads (e.g.stencil1d), the control traffic to perform indirect remote accesses in graph workloads, as well as the stream migration traffic to chase the pointer in pointer-based data structures.Overall, affinity alloc reduces the network traffic by 72% and 87% over Near-L3 and In-Core respectively, with 34% NoC utilization.For the microarchitecture, affinity alloc only introduces a small interleave override table (IOT).Estimated with CACTI 7[9], the IOT takes 32kB (512B per bank), and accounts for 0.07mm 2 , less than 0.1‰ of the whole chip.

Figure 13 :
Figure 13: Sensitivity on Irregular Layout Policies

Figure 14 :
Figure 14: Distribution of Atomic Stream in BFS-Push likely the case in real production scenarios, and when list nodes are inserted randomly, Lnr would behave the same as Rnd.On the other hand, Min-Hop optimizes the data affinity and achieves significant speedup and traffic reduction on most benchmarks.However, since it does not consider the load balance, it may produce pathological data layout.For example, in bin_tree it allocates the entire tree to a single bank.Although it successfully eliminates the migration traffic (much less offload traffic inFig 13), it dramatically increases the miss rate to that L3 bank and results in a huge slowdown.The hybrid policy Hybrid-H avoids such pathological cases by allocating to less occupied banks to balance the load.It also achieves better bank-level parallelism and improves the performance over Min-Hop.To see this, Fig14shows the timeline of number of atomic streams per L3 bank in bfs_push for Rnd, Min-Hop and Hybrid-5.We show the distribution by plotting the number of atomic streams from least to most occupied bank.For example, the 25% line indicates that 75% banks have higher occupancy.Rnd has higher stream occupancy, as it takes much longer for each stream to finish the indirect atomic access.Hybrid-5 achieves better load

Figure 15 :
Figure 15: Speedup of Affine Layout on Large Inputs

Figure 18 :
Figure 18: BFS Push vs. Pull Timeline Fig 17 shows three key characteristics for iteration : Visited Nodes: Total visited nodes after iteration ; Active Nodes: Visited nodes during iteration ; Scout Edges: Outgoing edges from active nodes in iteration .All three are normalized to the total number of nodes or outgoing edges in the graph.Fig 18 shows the timeline of bfs using only pushing/pulling and a switching policy.

Figure 20 :
Figure 20: Performance on Real World Graphs

Table 1 :
Interleave Override Table (IOT)Each entry overrides the interleave for physical addresses within [, ).The L2/L3 cache controller as well as the SE core /SE L3 query this table to determine which bank a cache line is mapped to, so that it can forward the request or offload/migrate the stream.Since this table is accessed frequently (every L2 miss and L3 access), mapping each interleave pool to contiguous physical addresses ensures that only one IOT entry is required per interleave pool, reducing the pressure on the size of IOT.

Table 2 :
B[] in A[B[i]] can only be remapped at page granularity with marginal performance gain (Fig 6 in page 5).In this work, we focus on codesigning graph representations to optimize data affinity.Fig 11 shows a toy undirected graph and the original compressed sparse row (CSR) format.In CSR format, each vertex has an index pointing to its first edge.However, since the edges are stored in a single array, we can only optimize for data affinity at very coarse System and arch Parameters (cy.: cycle)

Table 3 :
Workloads Parameters granularity, i.e. partitioning the graph among banks with the affine layout API.However, power-law graphs are hard to partition with many inter-partition edges.We need more flexibility in the data structure to optimize data affinity at finer granularity.This motivates for a Linked CSR format (Fig

Table 4 :
Figure 19: Speedup vs. Avg.Node Degree Real World GraphsSensitivity to Node Degree: One fundamental difference between affinity alloc and a conventional graph partitioning scheme is the optimization granularity.Conventional graph partitioning divides the graph into a few coarse-grained subgraphs, and usually struggles for high-degree graphs.On the other hand, by co-optimizing the data structure, affinity alloc can optimize data affinity at cache line granularity and scales well with the connectivity.