At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads

Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56× for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.


INTRODUCTION
Historically, the reliable performance increase of von Neumann-based general-purpose processors (CPUs) was driven by two technological trends.The first, observed by Gordon E. Moore [76], is that the number of transistors in an integrated circuit doubles roughly every two years.The second, called Dennard's scaling [30], postulates that as transistors get smaller their power density stays constant.These trends synergized well, allowing computer architectures to continuously improve performance through, for example, aggressive pipelining and superscalar techniques without running into thermal limitations by, e.g., reducing the operating voltage.In the early 2000s, Dennard's scaling ended [51] and forced architects to shift their attention from improving instruction-level parallelism to exploiting on-chip multiple-instruction multiple-data parallelism [43].This immediate remedy to the end of Dennard's scaling applies to this day in the form of processors such as Fujitsu A64FX [96], AMD Ryzen [105], or NVIDIA GPUs [79,86].
Unfortunately, Moore's law is impending termination [107], and we are entering a post-Moore era [112], home to a diversity of architectures, such as quantum-, neuromorphic-, or reconfigurable computing [49].Many of these prototypes hold promise but are still immature, focus on a niche use case, or incur long development cycles.However, there is one salient solution that is growing in maturity and which can facilitate performance improvements in the decades to come even for the classic von Neumann CPUs we have come to rely upon-3D integrated circuit (IC) stacking [14].3D ICs refer to the general technologies of vertically building integrated circuits and can be done in multiple ways, such as by stacking multiple discrete dies and connecting them using coarse through-silicon vias (TSVs) or growing the 3D integrated circuit monolithically on the wafer [100].
Recent advances in 3D integrated circuits have enabled many times higher capacity for on-chip memory (caches) than traditional systems (e.g., AMD V-Cache [40]).Intuition tells us that an increased cache size, resulting from 3D-stacking, will help alleviate the performance bottlenecks of key scientific applications.To demonstrate this, we conduct a pilot study where we execute one of the important proxy-apps from the DoE ExaScale Computing Project (ECP) suite, MiniFE [50] (cf.Section 3.3), on AMD EPYC Milan and Milan-X CPUs-two architecturally similar processors with vastly different L3 cache sizes [17].Figure 1 overviews our result of the pilot study, and we see that for a subset of problem sizes, in particular the 160×160×160 input, the 3-times larger L3 capacity of Milan-X yields up-to 3.4x improvements over baseline Milan for this memory-bound application, which motivates us to further research 3D-stacked caches.Fig. 2. A sample of representative server-grade CPUs of each generational micro-architecture in comparison to our study of LARC; Left: total on-chip last-level cache (in GiB); Right: per-core last-level cache (in MiB) for the same CPUs; The two LARC variants will be discussed in detail in Sec.5.1 3D integrated circuits have various benefits [52], including (i) shorter wire lengths in the interconnect leading to reduced power consumption, (ii) improved memory bandwidth through on-chip integration that can alleviate performance bottlenecks in memory-bound applications, (iii) higher package density yielding more compute and smaller system footprint, and (iv) possibly lower fabrication cost due to smaller die size (thus improved yield).All these are very desirable benefits in today's exascale (and future) High-Performance Computing (HPC) systems.But how far can 3D ICs (with a focus on increased on-chip cache) take us in HPC?
Contributions: We study our research questions from three different levels of abstraction: (i) we design a novel exploration framework that allows us to simulate HPC applications running on a hypothetical processor having infinitely large L1D cache.We use this framework, that is orders of magnitude faster than cycle-accurate simulators, to estimate an upper-bound for cache-based improvements; (ii) we model a hypothetical LARge Cache processor (LARC), that builds on the design of A64FX, with an LLC (Last Level Caches) designed with eight stacked SRAM dies under 1.5 nm manufacturing assumption; (iii) we complement our study with a plethora of simulations of HPC proxy-applications and CPU micro-benchmarks; and lastly (iv) we find that over half (31 out of 52) of the simulated applications experience a ≥ 2x speedup on LARC's Core Memory Group (CMG) that occupies only one fourth the area of the baseline A64FX CMG.For applications that are responsive to larger cache capacity, this would translate to an average improvement of 9.56x (geometric mean) when we assume ideal scaling and compare at the full chip level.
The novelty in this paper lies in the purpose which LARC serves, and not the design of LARC itself.As Figure 2 shows, the capacity (and bandwidth; not shown) of the LLC have increased at a moderately gradual slope over the last two decades-with Milan-X being a noticeable outlier in per-core LLC.However, we are querying the effect of an LLC, that is an order of magnitude above the trend line as depicted in Figure 2, on HPC applications.On top of our provided baseline, further application-specific restructuring to utilize large caches [69] will result in even greater benefit.

CPUS EMPOWERED WITH HIGH-CAPACITY CACHE: THE FUTURE OF HPC?
The memory bandwidth of modern systems has been the bottleneck (the "memory wall" [71]) ever since CPU performance started to outgrow the bandwidth of memory subsystems in the early 1990s [70].Today, this trend continues to shape the performance optimization landscape in highperformance computing [83,85].Diverse memory technologies are emerging to overcome said data movement bottleneck, such as Processing-in-Memory (PIM) [12], 3D-stackable High-Bandwidth Memory (HBM) [74], deeper (and more complex) memory hierarchies [115], and-the topic of the present paper-novel 3D-stacked caches [14,68,98].
In this study, our aspiration is to gauge the far end of processor technology and how it may evolve in six to eight years from now, circa 2028, when processors using 1.5 nm technology are expected to be available according to the IEEE IRDS Roadmap [53, Figure ES9].More specifically, as 3D-stacked SRAM memory [120] becomes more common, what are the performance implications for common HPC workloads, and what new challenges lie ahead for the community?However, before attempting to understand what performance may look like six years from now, we must describe how the processor itself might change.In this section, we introduce, motivate, and reason about our design choices of what we envision as a hypothetical CPU that capitalizes on large capacity 3D-stacked cache, briefly called LARC (LARge Cache processor).Before looking at LARC, we must first set and analyze a baseline processor.

LARC' Baseline: The A64FX Processor
We choose to base our future CPU design on the A64FX [118].Fujitsu's Arm-based A64FX is powering Supercomputer Fugaku [96], leader of the HPCG (TOP500 [104]; cf.Section 3.3) and Graph500 performance charts.A64FX is manufactured in 7 nm technology and has a total of 52 Arm cores (with Scalable Vector Extensions [103]) distributed across four compute clusters, called Core Memory Groups (CMGs).Twelve cores are available to the user, and one core is exclusively used for management.Each core has a local 64 KiB instruction and data-cache, and is capable of delivering 70.4 Gflop/s (IEEE-754 double-precision) performance-accumulated: 845 Gflop/s per CMG (user cores) or 3.4 Tflop/s for the entire chip.Each CMG contains a 8 MiB L2 cache slice, delivering over 900 GB/s bandwidth to the CMG [118].The combined L2 cache, which is the CPU's 32 MiB last level cache (LLC), is kept coherent through a ring interconnect that connects the four CMGs.Inside the CMG, a crossbar switch is used to connect the cores and the L2 slice.The L2 cache has 16-way set associativity, a line-size of 256 bytes, and the bus-width between the L1 and L2 cache is set to be 128 bytes (read) and 64 bytes (write).
We emphasize that our aim is not to propose a successor of A64FX, nor are we particularly restricting our vision by the design constrains of A64FX (e.g., power budget).However, we build our design on A64FX because: (i) as mentioned above, A64FX represents the high-end in performance for commercially available CPUs, so it is a logical starting point.(ii) A64FX is the only commerciallyavailable CPU, currently in continued production, with HBM.The expected bandwidth ratio between future HBM and future 3D-stacked caches is similar to the ratio between traditional DRAM and LLC bandwidths [80], which is what applications and performance models are accustomed to.(iii) The A64FX LLC cache design (particularly the L2 slices connected by a crossbar switch) happens to be convenient and thus, requires a minimal effort to extend the design in a simulated environment.
In conclusion, while we extend the A64FX architecture, our workflow itself can be generalized to cover any of the processors supported by CPU simulators (e.g., variants of gem5 [13] can simulate other architectures, including x86).

Floorplan Analysis for Fujitsu A64FX
In order to estimate the floorplan of the future LARC processor built on 1.5 nm technology, we first need the floorplan of the current A64FX processor built at 7 nm.We do know that the die size of A64FX is ≈ 400 mm 2 [96].With the openly-available die shots including processor core segments highlighted [82], we can estimate most of the A64FX floorplan, including the size of CMGs and processor cores, as shown in Figure 3. Overall, each CMG is ≈ 48 mm 2 in area, where an A64FX core occupies ≈ 2.25 mm 2 area.The remaining parts of the CMG consist of the L2 cache slice and controller as well as the interconnect for intra-CMG communication.3. Difference between A64FX's Core Memory Group (CMG) and a LARC CMG in various performancegoverning parameters; Most notable (for our study) is the 48x increase in per-CMG L2 cache capacity; Note: despite appearing similar in the figure, the LARC CMG is, in fact, four times smaller.

From A64FX's to LARC's CMG Layout
Knowing the floorplan, we proceed to describe how we envision the CMG design with 1.5 nm technology.We scale the CMG by moving four generations, from 7 nm to 1.5 nm, and reduce the silicon footprint by around 8x (≈ 1.7x per generation) for the entire CMG [39].The new CMG consumes as little as 6 mm 2 of silicon area.Next, we reclaim the area currently occupied by the L2 cache and controller and replace it with three additional CPU cores, yielding a total of 16.Further, inline with the projected year 2019→2028 growth in the number of cores [54, Table SA-1], we double the core count of the CMG to 32, which leads to it occupying ≈ 12 mm 2 of silicon area.We pessimistically leave the interconnect area unchanged and continue to use it as the primary means for communication.We call this new variant as LARC's CMG.Finally, we assume the same die size, and hence, LARC would have 16 CMGs, each with 32 cores, in comparison to A64FX's 4 CMG with 12+1 cores each.For LARC, we ignore the management core.However, our performance analysis will remain on the CMG level, instead of full chip, due to limitations we detail in Section 3.2.

LARC's Vertically Stacked Cache
In the above design, we removed the L2 cache and controller from the CMG of LARC.We now assume that the L2 cache can be directly placed vertically on the CMG through 3D stacking [68].We build our estimations based on experiments from Shiba et al. [98], who demonstrated the feasibility of stacking up-to-eight SRAM dies on top of a processor using a ThruChip Interface (TCI).The capacity and bandwidth of stacked memory is a function of several parameters: the number of channels available ( ch ), the per-channel capacity ( cap in KiB), their width ( in bytes), the number of stacked dies ( dies ), and the operating frequency ( clk in GHz).Shiba et al. [98] estimated that at a 10 nm process technology, eight stacks would provide ≈ 512 MiB of aggregated SRAM capacity for a footprint of ≈ 121 mm 2 .In their design, each stack has 128 channels of 512 KiB capacity.In our work, we conservatively assume an 8x scaling from 10 nm to 1.5 nm, and thus, at 12 mm 2 area (the size of one LARC CMG),  ch on each die would be ≈ 102 (=128*8/10).
We approximate  ch to a nearby sum of power-of-two number, viz.,  ch = 96.Thus, with eight stacked dies ( dies = 8), our 3D SRAM cache has a total storage capacity of  dies • ch • cap = 384 MiB per CMG.We estimate the bandwidth in a similar way.We know from previous studies [98] that 3D-stacked SRAM, built on 40 nm technology, can operate at 300 MHz.We conservatively expect the same SRAM to operate at ( clk =)1 GHz when moving from 40 nm→1.5 nm.To account for the increased working set size of future applications, we assume a channel width ( ) of 16 byte, compared to the 4 byte width assumed in [98].With this, the CMG bandwidth becomes:  ch •  clk •  = 1536 GB/s.The read-and write-latency of their SRAM cache is 3 cycles, including the vertical data movement overhead [98].
While stacked DRAM caches theoretically provide higher capacity than stacked SRAM caches, they have limitations.For example, the latency of stacked DRAM is only 50% lower compared to DDR3 DRAM, and hence, they exacerbate miss latency; they requires refresh operations which consumes energy and reduces availability; and due to their large size, the stacked DRAM caches require special techniques for managing metadata and avoiding bandwidth bloat [23,74].The tag size of a stacked DRAM may exceed the LLC capacity, and hence, the tags may need to be stored in the DRAM itself which worsens hit latency.Set-associative designs and serial tag-data accesses further increase hit latency.Proposed architectural techniques and mitigation strategies, such as Loh-Hill cache [67], have yet to solve these problems.By contrast, 3D SRAM caches do not suffer from any of these issues.In fact, at iso-capacity, a 3D SRAM cache has even lower access latency than a 2D SRAM cache.Since stacked 3D SRAM caches have lower capacity than stacked DRAM, its metadata (e.g., tag) can be easily stored in SRAM itself, further reducing the access latency.
For our cache design, we assume a 256 B cache block design, which avoids bandwidth bloat.Each tag takes 6 B and as such, the total tag array size for each CMG becomes 9 MiB.This tag array can be easily placed in the cache itself.We assume that tag and data accesses happen sequentially.The tags and data of a cache set are stored on a single die.Hence, on every access, only one die needs to be activated.Since this takes only few cycles, the overall miss penalty remains small and comparable to that of A64FX' LLC.
To show that our cache projections are realistic, we compare it with AMD's 3D V-cache design.It uses a single stacked die for the L3 cache, providing 64 MiB capacity (in addition to the 32 MiB cache in the base die) at 7 nm [26,40] and only 3 to 4 cycles of extra latency compared to the non-stacked version [21].It has 36 mm 2 area and has a bandwidth of 2 TB/s.When stacking additional dies on top, and assuming an 8x scaling of the area by going from 7 nm to 1.5 nm, we speculate that the LLC capacity of this commercial processor could easily exceed that of our proposed LARC.

LARC's Core Memory Group (CMG)
At last, we detail our experimental CMG built on a hypothetical 1.5 nm technology: the LARC CMG.An illustration of this system is shown in Figure 3.Each CMG consists of 32 A64FX-like cores, which keeps the L1 instruction-and data-cache to 64 KiB each, yielding a per CMG performance of ≈ 2.3 Tflop/s (IEEE-754 double-precision).A 384 MiB L2 cache is stacked vertically on the top of the CMG through eight SRAM layers.
We keep the HBM memory bandwidth per CMG to its current A64FX value of 256 GB/s to be able to quantify performance improvements from the proposed large capacity 3D cache in isolation from any improvements that would come from increased HBM bandwidth.Furthermore, we make no assumption on the technology scaling of blocks that contain hard-to-scale-down analog components (e.g., TofuD or PCIe IP blocks) and instead focus exclusively on scaling the CMG-part of the System-on-Chip (i.e., processing cores, L1/L2 caches, and intra-chip interconnects).
While our study focuses on evaluating a single CMG, we conclude that a complete, hypothetical LARC CPU, with a die size similar to the current A64FX, would contain 512 processing cores, 6 GiB of stacked L2 cache, a peak L2 bandwidth of 24.6 TB/s, a peak HBM bandwidth of 4.1 TB/s, and a total of 36 Tflop/s of raw, double-precision, compute.The A64FX processor has a peak HBM bandwidth of 1 TB/s, whereas our envisioned LARC CPU has 4× more CMGs and hence, a peak HBM bandwidth of 4.1 TB/s.Thus, compared to A64FX, LARC has higher effective bandwidth of external memory.Further changes to the HBM generation are beyond the scope of this study.

LARC's Power and Thermal Considerations
To estimate the power consumption of LARC, we analyze A64FX's current consumption and extrapolate to 1.5 nm by leveraging public technology roadmaps.A64FX's peak power, achieved while running DGEMM, is 122 W [117]; where 95 W correspond to core power and 15 W correspond to the memory interface (MIF), and hence, we conclude 1.98 W/core and 3.75 W/MIF.Therefore, a LARC CMG with 32 cores in 7 nm would consume 67.1 W. TSMC projects that shrinking from 7 nm to 5 nm yields a power reduction of about 30% [99], i.e., 46.98 W for LARC's CMG in 5 nm.IRDS's roadmap [53,Figure ES9] indicates a further compounded power reduction (at iso frequency) of 42% when moving from 5 nm to 1.5 nm, i.e., 27.37 W for LARC's CMG in 1.5 nm.As the full LARC chip is estimated to include 16 CMGs, we project a total power of 438 Watt (not including the L2 cache).
Next, we estimate the power consumed by the principal part of this study-the 384 MiB L2 cache.A 4 MiB SRAM L2 cache in 7 nm consumes 64 mW of static power [44].Assuming a similar (pessimistic) static power consumption at 1.5 nm and extrapolated to 384 MiB, we find that our cache would have a static power consumption of 6.14 W. Scaled to the full 16 CMGs of our hypothetical LARC, we arrive at a static power consumption of 98.3 W. This static power consumption of caches represents between 90% and 98% of the entire power consumption (at 350 K temperature, see e.g., [5,20]), where the remainder is the dynamic power consumption.If we assume a pessimistic 9:1 ratio between static and dynamic power, then this yields a total power consumption of 109.23 W for 6 GiB of chip-wide stacked L2 cache.
To conclude, a LARC processor (16 CMG) would have to be designed for a thermal design power (TDP) of 547 W. While this expected TDP is more than the current A64FX, it is not entirely unlike emerging architectures, such as NVIDIA's H100 [81] that consumes up to 700 W or the AMD Instinct MI250X GPU [3] at 560 W. We stress that our estimate of 547 W is peak power draw achieved only during parallel DGEMM execution.Adjusting for Stream Triad, based on the breakdown in [117], we conclude a realistic, and considerably lower, power consumption of 420 W for bandwidth-bound applications running on the whole LARC chip.
Finally, while this L2 cache power estimation might appear pessimistic, there are ample opportunities to further reduce power consumption.To save static energy, all the un-accessed dies can be changed to data-retentive, low-power (sleep) state.To deal with remaining thermal issues after stacking the cache layers underneath the cores instead of on top, one can additionally adapt simple direct-die cooling or advanced techniques [18,106], such as high- thermal compound [42], microfluid cooling [114], or thermal-aware floorplanning, task-scheduling and data-placement optimizations.Specifically, microfluid cooling can handle power densities of 3.5 W/mm 2 and hot-spot power levels of over 20 W/mm 2 for 3D-stacked chips [1].By contrast, our LARC CPU has a power density of 2.85 W/mm 2 at 192 mm 2 if we ignore adjunct components such as I/O die, PCIe, TofuD interface, etc., and around half the power density at 400 mm 2 if these components are included.

PROJECTING PERFORMANCE IMPROVEMENT IN SIMULATED ENVIRONMENTS
Analyzing LARC's feasibility is only the first step, and hence we have to demonstrate the effects of the proposed changes on real workloads to allow a meaningful cost-benefit analysis by CPU vendors.This section details two simulation approaches (one novel; one established) and discusses the HPC applications, which we evaluate extensively in Sections 4 and 5.

Simulating Unrestricted Locality with MCA
Designing and executing even initial studies (i.e.no complex memory models, etc.) with cycle-level gem5 simulations for realistic workloads takes substantial time with unknown outcome.Therefore, one would want to have a first-order approximation of a very large and fast cache.Regrettably, and to the best of our knowledge, existing approaches for fast first-order approximations do generally not support complex HPC applications, i.e., the existing tools neither handle multi-threading correctly nor do they have support for MPI applications [6].Hence, we design a simulation approach, using Machine Code Analyzers (MCA), which can estimate the speedup for a given application ordersof-magnitude faster than gem5 (typically hours instead of months; cf. next section).This upper bound in expected performance improvement allows us to: (i) get a perspective on the best possible performance improvement if all read/writes can be satisfied from the cache; and (ii) justify more accurate simulations and classify their results with respect to the baseline and the upper bound.Machine Code Analyzers, such as llvm-mca [66], have been designed to study microarchitectures, improve compilers, and investigate resource pressure for application kernels.Usually, the input for these tools is a short Assembly sequence and they output, among other things, an expected throughput for a given CPU when the sequence is executed many times and all data is available in L1 data cache.For most real applications, the latter assumption is obviously incorrect, however, it is ideal to gauge an upper bound on performance when all the memory-bottlenecks disappear.
Unfortunately, it is neither feasible to record all executed instructions in one long sequence, nor to analyze a full program sequence with llvm-mca.Hence, we break the program execution into basic blocks (at most tens or hundreds of instructions) and evaluate their throughput individually.For a given combination of a program and input (called workload hereafter), the basic blocks and their dependencies create a directed Control Flow Graph (CFG) [56] with one source (program start) and one sink (program termination).All intermediate nodes (representing basic blocks) of the graph can have multiple parent-and dependent-nodes, as well as self-references (e.g.basic blocks of for-loops).Knowing the "runtime" of each basic block and the number of invocations per basic block, we can estimate the runtime of the entire workload by summation of the parts.
We utilize the Software Development Emulator (SDE) [57] from Intel to record the basic blocks and their caller/callee dependencies for a workload with modest runtime overhead (typically in order of 1000x slowdown).SDE also notes down the number of invocations per CFG edge for a workload, i.e., how often the program counter (PC) jumped from one specific basic block to another specific block.We developed a program which parses the output of Intel SDE and establishes an internal representation of the Control Flow Graph.The internal CFG nodes are then amended with Assembly extracted from the program's binary, since SDE's Assembly output is not compatible with Machine Code Analyzers.Our program subsequently executes a Machine Code Analyzer for each basic block, getting in return an estimated cycles-per-iteration metric (CPIter).We record the per-block CPIter at the directed CFG edge from caller to callee, which already holds the number of invocations of this edge, effectively creating a "weighted" graph.Figure 4 showcases the result and it is easy to see that the summation of all edges in the CFG is equivalent to the estimated runtime of the entire workload (assuming all data is inside the L1 data cache).
The above outlined approach works for both sequential and parallel programs.Intel SDE can record the instruction execution and caller/callee dependencies for thread-parallel programs, e.g.pthreads, OpenMP, or TBB.Furthermore, we can attach SDE to individual MPI ranks to get the data for it.Therefore, we are able to estimate the runtime for MPI+X parallelized HPC applications by the following equation: processor frequency in Hz (1) under the assumption that MPI ranks and threads do not share computational resources 1 , where we sum up the number of cycles required for each block (i.e., CFG edges) considering only the "slowest" thread and rank, and divide by the CPU frequency to convert the total cycles into runtime.
The self-imposed restriction of Machine Code Analyzers is the limited accuracy compared to cycle-accurate simulators, due to their distinct design goal.To improve our CPIter estimate, we rely on four different MCAs, namely llvm-mca [66], Intel ACA (IACA) [55], uiCA [2], and OSACA [65], and take the median of the results.Another shortcoming of MCA tools is that most of them estimate the throughput of basic blocks in isolation while assuming looping behavior of the assembly block (PC jumps from last back to first instruction).Neither "block looping" nor an empty instruction pipeline (single iteration of the block) are realistic for some blocks.Hence, for non-looping basic blocks, we estimate the CPIter by feeding the MCA tool with the blocks of caller and callee, and the callee's CPIter is calculated by subtracting the cycle of retirement of its last instruction from the caller's last instruction retirement (instead of when the callee's first instructions are decoded, which can overlap with execution of caller instructions).Further, we correct some cycle estimates for specific instructions within our tool in post-processing, since we encountered a few unsupported or grossly mis-estimated instructions while validating our tool against benchmarks.We refer the reader to Section 4.1 for more details.

Cycle-level Accuracy: CPUs Simulated in gem5
While the MCAs can give a first-order approximation, we still require highly accurate predictions for our 3D-stacked, cache-rich CPU.Hence, we employ an open-source system architecture simulator, called gem5 [13].It supports Arm, x86, and RISC-V CPUs to varying degrees of accuracy, and can be extended with memory models for higher simulation fidelity of the memory subsystem.We use gem5's "syscall emulation" mode to executes applications directly without booting a Linux kernel.
Fortunately, RIKEN released their gem5 version which was specially tailored for A64FX's codesign to support SVE, HBM2, and other advanced features [94].Hence it is well suited to simulate our LARC proposal in Section 2.4.This version of gem5 has been validated for A64FX [62], and can be used with production compilers from Fujitsu.Albeit, while evaluating RIKEN's gem5, we noticed a few drawbacks, such as the lack of support for: (i) dynamically linked binaries; (ii) adequate memory management (freeing memory after application's free() calls); (iii) simulating more than 16 CPU cores due to limits in the cache coherence protocol; (iv) multi-rank MPI-based programs; and (v) simulating more than one A64FX CMG.
We modify gem5 to remedy the first three problems.However, the last two problems remain intractable without major changes to the simulator's codebase, and hence we limit ourselves to single-CMG simulations (with one MPI rank).Relying on the assumption that most HPC codes are weak scaled across multiple NUMA domains and compute nodes, we believe the single-rank approach still serves as a solid foundation for future performance projection.However, even singlerank MPI binaries require numerous unsupported system calls.To circumvent this problem, we extend and deploy a MPI stub library [101].

Relevant HPC (Proxy-)Apps and Benchmarks
Instead of relying on a narrow set of cherry-picked applications, we attempt to cover a broad spectrum of typical scientific/HPC workloads.We customize and extend a publicly available benchmarking framework2 [34,35] with a few additional benchmarks and necessary features to perform the MCA-and gem5-based simulations.The benchmark complexity ranges from simple kernels to large code bases (O(100,000s) lines-of-code) which are used by vendors for architecture comparisons and used by HPC centers for hardware procurements [41].Hereafter, we detail the list of 127 included workloads, summed up across all benchmark suites, which are sized to fit within a single node and which could be simulated with gem5 in a reasonable time (≤ six months).
Polyhedral Benchmark Suite.The PolyBench/C suite contains 30 single-threaded, scientific kernels which can be parameterized in memory occupancy (∈ [16 KiB, 120 MiB]) [90].Unless stated otherwise, we use the largest configuration.TOP500, STREAM, and Deep Learning Benchmarks.High Performance Linpack (HPL) [36] solves a dense system of linear equations  =  of size 36,864 in our case.High Performance Conjugate Gradients (HPCG) [37] applies a conjugate gradient solver to a system of linear equation (with sparse matrix ).We choose 120 3 for HPCG's global problem size.BabelStream [29] evaluates the memory subsystem of CPUs and accelerators, and we configure 2 GiB input vectors.Moreover, we implement a micro-benchmark, DLproxy, to isolate the single-precision GEMM operation ( = 1577088;  = 27;  = 32) which is commonly found in 2D deep convolutional neural networks, such as 224×224 ImageNet classification workloads [111].
NASA Advanced Supercomputing Parallel Benchmarks.The NAS Parallel Benchmarks (NPB) [11,110] consists of nine kernels and proxy-apps which are common in computational fluid dynamics (CFD).The original MPI-only set has been expanded with ten OpenMP-only benchmarks [60] and we select the class B input size for all of them.RIKEN's Fiber Mini-Apps and TAPP Kernels.To aid the co-design of Supercomputer Fugaku, RIKEN developed the Fiber proxy-application set [92], a benchmark suite representing the scientific priority areas of Japan.Additionally, RIKEN released scaled-down TAPP kernels [93] of their priority applications which are tailored for fast simulations with gem5 [62].Our workloads are as follows: FFB [46] with the 3D-flow problem discretized into 50×50×50 sub-regions; FFVC [84] using 144×144×144 cuboids; MODYLAS [9] with the wat222 workload; mVMC [73] with the strongscaling test reduced to 1/8th of the samples and 1/3rd of the lattice size; NICAM [108] with a single (not 11) simulated day; NTChem [78] with the H 2 O workload; QCD [16] with the class 2 input.

MCA-BASED SIMULATION RESULTS
Sections 4.1 and 4.2 are dedicated to our MCA-based estimation of the upper bound on performance improvement with abundant L1 cache.First, we evaluate the accuracy of this approach, and then apply the novel methodology to our benchmarking sets.

MCA-based Simulator Validation
During the development of our MCA-based simulator, we implemented numerous micro-benchmarks to fine-tune the CPI estimation capabilities while comparing the results to an Intel ® Xeon ® processor E5-2650v4 (formerly code named Broadwell).Our micro-benchmarks comprise MPI-/OpenMPonly, MPI+OpenMP, and single-threaded tests (exercising recursive functions, floating-point-or integer-intensive operations, L1-localised, or stream-like operation).
Needless to say, applying MCA-based simulations to full workloads or complex application kernels is still error-prone, since these tools are designed to analyze small Assembly sequences without guarantee for accurate absolute performance numbers.Regardless, we validate the current status of our tool using PolyBench/C with MINI inputs.In theory, these input sizes (≈ 16 KiB) should all fit into the 32 KiB L1D cache of the Broadwell.Hence, measuring the kernel execution time for these PolyBench tests should yield numbers close to MCA-based runtime estimates.For the baseline measurements, we set all cores of the Broadwell to 2.2 GHz, set the uncore to 2.7 GHz, and disable turbo boost; compile each workload with Intel's Parallel Studio XE 3 , and execute every test for 100 times (since many only run for a few ms) to determine the fastest possible execution time.The difference between the real baseline results and our MCA-based estimates is visualized in Figure 5 as projected relative runtime difference.
The data shows that on average our MCA-based method slightly overestimates: MCA approach predicts faster execution times then it should.Only seven out of 30 workloads are expected to run slower than what we observe on the real Broadwell (i.e., y-value ≤1).For eight of the PolyBench tests, our tool estimates the runtime to be over 2x faster than our measurements.Hence, we can conclude that for 73% of the micro-benchmarks, the MCA-based method is reasonably accurate: within 2x slower-to-2x faster.While a 2x discrepancy might appear high, we have to point out that our cross-validations using SST [95,113] and third-party gem5 models [7] for Intel CPUs yield similar inaccuracies 4 , but our MCA-based method is substantially faster.
Another indicator for the accuracy of our MCA-approach can be drawn from DGEMM (double precision gemm benchmark in Figure 5).Theoretically, DGEMM performs close to peak and is not memory-bound for large matrices, and hence the measured runtime and MCA-based estimates are expected to match.Unfortunately, PolyBench's Gflop/s rate for gemm is far from peak (due to its hand-coded loop-nest), and therefore we replace it with an Intel MKL-based implementation of equal matrix dimensions.For the PolyBench input sizes MINI, . . ., EXTRALARGE in our MKLbased implementation, our MCA tool estimates a faster runtime by 6.4x, 75%, 11%, 1.9%, and 1.5%, respectively.This closely matches the achievable single-core Gflop/s of the E5-2650v4: for MINI and the MKL-based runs, we measure only 2 Gflop/s, while for EXTRALARGE we peak out at the expected 32 Gflop/s.The low Gflop/s measurements for MINI (and SMALL) demonstrate that MKL is not yet compute-bound, and hence causes the 6.4x (and 75%) misprediction.

Speedup-potential with Unrestricted Locality
In this Section, we take on the entire benchmark suite from Section 3.3 with the MCA-based approach and evaluate their speedup potential when all data fits into L1.
The baseline measurements for the speedup estimates are conducted on a dual-socket Intel Broadwell E5-2650v4 system with 48 cores (2-way hyper-threading enabled, cores are set to 2.2 GHz, turbo boost disabled).For all listed benchmarks, excluding SPEC CPU and OMP, we focus on the solver times only, i.e., we ignore data initialization and post-processing phases.Since most proxy-apps are parallelized with MPI and/or OpenMP, we perform an initial sweep of possible configurations of ranks and threads to determine the fastest time-to-solution (TTS) for our strongscaling benchmarks, and the highest figure-of-merit (as reported by the benchmarks) for weakscaling workloads.The highest performing configurations is executed ten times to determine the TTS of the kernel as our reference point in Figure 6.
The same MPI/OMP configurations are then used for our MCA-based estimate.Under the assumption that some MPI-parallized benchmarks experience imbalances, we randomly sample up to nine ranks (in addition to rank 0) 5 , execute the selected rank with Intel SDE (and the remaining ranks normally), and calculate the estimated runtime using Equation (1) and the 2.2 GHz processor frequency.The resulting runtime estimate is divided by the measured runtime to determine the upper-bound speedup potential per application when all its data would fit into L1D, see Figure 6.
For PolyBench/C workloads, we see similar speedup trends as for its smallest inputs which we used in , although the expected speedup for EXTRALARGE increases to a peak of 8.4x for the ludcomp kernel.Only four kernels show no performance increase, presumably by being compute-bound and not bandwidth-bound: 2mm, 3mm, doitgen, and trisolv.Overall, the MCA-based approach estimates a geometric mean () speedup of 2.9x from fitting all data into L1D.RIKEN's TAPP kernels benefit the most from unrestricted locality.Especially kernel 20 (SpMV), which represents one core function of the FFB application, shows a speedup of 20x.Altogether, we see a projection of (=)2.6xincreased performance, but also two cases (kernels 5 and 9) where the MCA tool estimates a ≈ 50% slowdown.These two are from GENESIS [61] and NICAM, respectively, but as detailed in Section 4.1, some inaccuracy is expected as the trade-off for the faster simulation time.
Fig. 6.Projected speedup against a baseline dual-socket Intel Broadwell E5-2650v4 system while assuming all data fits into L1D with "optimistic" load-to-use latency; Top row, left to right: PolyBench, RIKEN TAPP kernels, NPB (OMP); Bottom row, left to right: NPB (MPI), TOP500 etc., ECP proxies, RIKEN Fiber apps, SPEC CPU[int/single] and CPU[float/OMP], SPEC OMP NPB's OpenMP version of a conjugate gradient (CG) solver is another workload with a large theoretical performance gain of 13.1x.In total, we expect a (=)3x gain for all NAS Parallel Benchmarks; specifically, (=)4x for the OpenMP versions and (=)2.3xfor the MPI versions.The potential gain for CG is not surprising, since these solvers are predominantly bound by memory bandwidth and are sensitive to memory latency [38].High Performance Linpack is unsurprisingly not expected to gain any performance by placing all its data into L1 cache, as this benchmarks is compute-bound.In fact, our MCA tool expected a small runtime decrease of 11%.By contrast, DLproxy, which uses MKL's SGEMM, would benefit from a large L1, since MKL cannot achieve peak Gflop/s for the tall/skinny matrix in this workload (cf.Section 3.3).XSBench and miniAMR show the highest gains for ECP's and RIKEN's proxy-apps, with a value of 7.3x and 7.4x, respectively.This appears to be in line with expectation from the roofline characteristics of the benchmarks when measured on a similar compute node [33].
A deeper look at roofline analysis in [33] reveal that there is no strong correlation between the position of an application on the roofline model and the expected performance gain from solely running out of L1D cache.We speculate that other, hidden bottlenecks are exposed by our MCA approach, such as data dependencies and lack of concurrency in the applications, which limit the expected speedup.Apart from noticeable outliers in the expected speedup, such as lbm, ilbdc, and especially swim, the potential from enlarged L1D is rather slim for SPEC, and only (=)1.9xruntime reduction can be expected across all 34 workloads.

GEM5-BASED SIMULATION RESULTS
In Section 5.1, we detail our choice for the simulated architectures in gem5.Similarly structured to the MCA-based simulations, Sections 5.2 and 5.3 highlight our validation of gem5 for our proposed CPU architectures and evaluate numerous benchmarks and proxy applications on said architecture, and we summarize the results in Section 5.4.

LARC CMG Models in gem5 and A64FX S Baseline
As we discussed in Section 2.4, we envision one LARC CMG to have 32 cores, 384 MiB L2 cache, and 1.6 TB/s L2 bandwidth.Regretfully, gem5 (at least RIKEN's version) can only be configured with L2 cache sizes that are 2 X , and therefore we either have to scale up or down LARC's L2 cache size.Hence, we explore both as distinct options, one conservative and one technologically aggressive Starting at a baseline, i.e., a simulated version of A64FX which we label as A64FX S , and in order to materialize the properties of the LARC CMG (cf.Section 2.4), we modify three parameters in our gem5 model.We modify: (i) the number of cores in the system to match 32 (up from A64FX S ' baseline of 12); (ii) the size of the total L2 cache to match the capacity of the eight stacked layers (256/512 MiB, up from A64FX S ' L2 size of 8 MiB per CMG); and (iii) we adjust the number of L2 banks in LARC A to control the bandwidth.
We introduce a fourth gem5 configuration, called A64FX 32 , which simulates one baseline A64FX S CMG but with 32 cores.These four configurations A64FX S →A64FX 32 →LARC C →LARC A should allow us to determine the speedup gains from the larger core count and larger L2 cache, individually.The core frequency is universally set to 2.2 GHz.Table 2 summarizes the four gem5 configurations.

gem5-based Simulation and Configuration Validation
We perform OpenMP tests to verify our gem5 simulator for up to 32 cores.For the L2 cache size and bandwidth changes, we employ a STREAM Triad benchmark, parameterized to avoid cache line conflicts among participating threads.Splitting the A64FX S CMG L2 cache into 12 chunks (one per thread) yields a working size of 683 KiB.Hence, the three 128 KiB vectors of the Triad operation will fit into the L2 cache.We increase the total vector size in proportion to the number of threads and test the achievable L2 bandwidth for LARC C and LARC A .Additionally, Figure 7a includes the baseline A64FX S CMG scaled to 12 cores.The simulation shows that LARC C 's L2 bandwidth peaks out at 792 GB/s and LARC A 's bandwidth goes up to 1450 GB/s for this particular test case, which is, respectively, 1% and 9% lower than our estimates shown above.The baseline A64FX S closely matches the bandwidth of the real A64FX CPU executing this test.
Another validation test we perform is setting the number of cores to the maximum (12 and 32, respectively) and scale the vector size from 2 KiB per core to a total of 1 GiB for the three vectors.Figure 7b shows the results for this simulation.In the memory range of tens to hundreds of KiB, the Triad operation can be done from L1 cache, for which LARC C and LARC A show higher bandwidth.Their 2.7x higher core count results in 2.6x higher aggregated L1 bandwidth.For the Triad, for the memory sizes that fit into L2 cache, we see a behavior similar to Figure 7a.Past 8 MiB, the A64FX S configuration shows the expected bandwidth drop to HBM2 level, while for LARC C and LARC A , the expected L2 cache bandwidth is maintained until 256 MiB and 512 MiB, respectively.This validates that our gem5 settings yield the expected LLC characteristics.Lastly, to validate the LARC configuration and to see the changes applied to more complex science kernels, we perform a sensitivity analysis of cache parameters with the RIKEN TAPP kernels.In Figure 8, we vary L2 cache access latency, size, and bandwidth in ranges beyond our LARC C and LARC A target architectures.This analysis will help us in adjusting our expectations when future LARC-like architectures deviate from our design parameters, e.g., by stacking less SRAM layers or having higher L2 access latency.In this parameter sweep, LARC C will be the baseline and we vary one parameter while keeping the others fixed.The top row of Figure 8 shows the latency sweep, where we choose 22 cycles as best latency (which is 2× the data load latency from L1 for SVE instructions in A64FX).The worst case of 52 cycles is equidistant to our baseline in the opposite direction, and two additional latencies are selected in between.Similarly, we adjust the L2 size (middle row; simulating more or less SRAM stacks or a larger or smaller semiconductor process nodes) and L2 bank bits in gem5, see bottom row of Figure 8.The latter indirectly controls the L2 bandwidth of the simulated architectures.The latency change has minimal impact, since HPC applications are typically not latency bound.However, the L2 cache capacity and bandwidth can have a significant impact on performance, as expected, since they determine the amount of data that can be stored and accessed quickly.For some of the TAPP kernels, though, the performance is unaffected by these parameters 6 , since these kernels are actually shrunk-down versions specifically designed for cycle-level architecture simulations, and therefore have low memory footprint.6 for reference; TAPP kernels 3-6 (multiple Nbody kernels) and 18 (MatVecDotP) are limited to 12 threads, hence we omit A64FX 32 ; Missing (cf.Fig. 6) primarily due to gem5 issues or exceeding simulation time limit.PolyBench results (single core) are also omitted due to limited speedup across all of them and no noteworthy outliers.

Speedup-potential with Restricted Locality
To further refine our projections gained by abundant cache, we proceed with the cycle-level simulations of the proxy-applications and benchmarks listed in Section 3.3.
We compile all benchmarks with Fujitsu's Software Technical Computing Suite (v4.6.1)targeting the real A64FX, and simulate the single-rank workloads in gem5 for our four configurations.Unfortunately, three of our MPI-based benchmarks require multi-rank MPI: MODYLAS, NICAM, and NTChem, and hence we omit them.Furthermore, we skip the MPI-only versions of NPB.Hereafter, we only report proxy applications and benchmarks which ran to completion within gem5 (i.e., gem5-crashes or simulated application-crashes are excluded when infeasible to patch, and simulations exceeding the 6-months time limit are ignored).
The per-configuration speedup is given relative to the baseline A64FX S configuration.We exclude initialization and post-processing times, and measure only the main kernel runtime, except for the SPEC benchmarks as described in Section 4.2.These results are presented in Figure 9 and show the effects of the gradual expansion of simulated resources.The average (single CMG) speedups from LARC C and LARC A are ≈ 1.9x and ≈ 2.1x, respectively, with some applications reaching ≈ 4.4x for LARC C and ≈ 4.6x for LARC A .
As expected, most benchmarks benefit from the additional cores and cache capacity, most prominently MG-OMP which gains a small speedup of ≈ 1.3x from the extra cores, ≈ 2x speedup from the extra cache, and with 512 MiB cache and higher bandwidth reaches ≈ 4.6x speedup.Comparable incremental improvements with all three architecture steps are observable in other workloads, such as TAPP kernel 7 (DifferOpVer) and 17 (MatVecSplit), showing good scaling on multiple cores and being memory-bound since they benefit from the additional cores and cache capacity.TAPP kernels 19 and 20, XSBench, roms, and imagick (SPEC OMP) show similar gain in runtime, but the difference between LARC C and LARC A is smaller, implying that the problem size either fits into the 256 MiB L2 (e.g., XSBench) or the workload arrives at a point of diminishing returns from the 2x larger cache.TAPP kernels 8, 9, 12-15, and FT-OMP suffer a slowdown from cache contention in A64FX 32 .LARC C and LARC A avoid the cache contention, resulting in speedups similar to the benchmarks discussed earlier.EP-OMP, CoMD, and other compute-bound benchmarks benefit only from the higher core count, with both LARCs providing similar speedup as A64FX 32 .
Expectedly, single-threaded workloads (all of PolyBench's benchmarks) show little to no improvements over A64FX S , i.e., they do not benefit from more cores.However, these benchmarks also do not show a performance gain from a larger 3D-stacked L2 cache, albeit their working set size exceeding A64FX S ' 8 MiB L2 yet fitting into LARC' larger cache.We only see a limited speedup of (=)4.3%across all of them and no noteworthy outliers, and hence omit them in Figure 9.We attribute other outliers, such as the slowdown of imagick (SPEC-CPU), to similar intrinsic property of the benchmark: our testing on a real A64FX reveals that imagick has a sweetspot at 8 OpenMP threads, and scales negatively thereafter; and the TAPP kernels 3-6 and 18 were customized for the 12-core A64FX CMG and cannot run effectively on 32 threads without a rewrite.Hence, we limit gem5 to 12 cores for these TAPP kernels, and we see that only the MatVecDotP kernel of the ADVENTURE application [4] benefits from a larger L2.Further proxy-applications and benchmarks missing from Figure 9, yet appearing in Figure 6, are the unfavorable result of persistent, repeatable simulator errors-sometimes occurring after months of simulation.
We should note that in some cases the benchmarks' implementation and the quality of the compiler may skew the results, for instance, BabelStream measuring memory bandwidth on a 2 GiB buffer.Being unoptimized for A64FX, BabelStream's baseline underperforms in terms of percore bandwidth (compared to STREAM Triad tests in Figure 7a and 7b) which in turn results in performance gain when the number of cores increases to 32.
Overall, the speedup on A64FX 32 can originate from the following reasons: (i) the program is compute-bound (a valid result); (ii) the workload exhibits both compute-bound and memory-bound tendencies in different components of a proxy-application (a valid result); (iii) the program is highly latency-bound, and hence the speedup can be the result of the larger aggregate L1 cache (a valid result); or (iv) a poor baseline resulting in a slightly misleading result.
We confirm the validity of attributing improvement to the high capacity L2 by inspecting the L2 cache-miss rates of our gem5 simulations (with the miss rate of some selected examples listed in Table 3).The reduction in cache-miss rates reported in the table is consistent with the performance improvements we observe in Figure 9.

Summary of the Results
Our gem5 simulations indicate that more than half (31 out of 52) of the applications experienced a larger than two times speedup on LARC A compared to the baseline A64FX S CMG.For over two-thirds (24 out of 31) of these applications, the performance gains are directly attributed to the larger (3D-stacked) cache, i.e., with at least 10% gain by either of the two LARC configurations over the A64FX 32 variant.Most notably, out of all the RIKEN TAPP kernels that experienced meaningful speedup on LARC, a majority benefited from the expanded cache, rather than the increase in number of cores.This carries particular importance as these kernels are highly tuned for A64FX.

DISCUSSION AND LIMITATIONS
In this study, we simulated a single LARC CMG in gem5, and its potential future effect on common HPC workloads.

The Prospect of LARC
In reality, if a LARC processor were incepted in 2028, it would contain 16 LARC CMGs, which correspond to the same silicon area as the current A64FX CPU, and it is important to understand what impact such a processor would have on the HPC community and its applications.Unfortunately, it is hard to give a conclusive answer to such a forward looking question today.However, if we do ideal scaling of both A64FX and LARC CMGs and compare at the full chip level, then a LARC system in 2028 could give between 4.91x (xz; SPEC CPU) and 18.57x (MG-OMP; NPB) performance improvements over the current A64FX processor with an average improvement of (=)9.56xfor applications that are responsive to larger cache capacity.For applications that do not yet benefit from a larger cache, future studies should (continue to) consider algorithmic improvements [69], as well as investigate the potential of allocating parts of the cache to vary compute capabilities, for example, processing-in-memory [12] or alternative compute modules, e.g., CGRAs [89].

Considerations and Limitations
Our MCA-based estimation framework only gives a first-order approximation for a hypothetical CPU with sufficiently large L1 cache to host the entire data structures of a specific workload.This approach has some advantages and disadvantages and should be used with caution, but it also has capabilities which we have not yet detailed, such as estimating the runtime of the same binary/workload for different (ISA-compatible) x86 systems by simply replacing the MCA target architecture and altering the CPU clock frequency.
We emphasize that we run applications as they are, i.e., without any algorithmic optimizations to the larger last level cache, in our MCA-and gem5-based simulators.This is also true to our motivating experiment shown in Figure 1.While the cache capacity of AMD's Milan-X CPU is about three times that of Milan, it is far from what we envision in 2028.Hence, our Milan-X results serve as a first-order indication of what SRAM-in its current available SoTA-can offer.
Another notable aspect, which is outside the main scope of this extrapolation study, is the heat dissipation of CPU cores in face of the 3D-stacked cache.It has been reported that AMD's Milan-X carefully stacks caches above areas of the chip that are not used for compute, i.e., mostly above caches [77].Our assumption is that, by 2028, manufacturing technologies will have advanced enough to overcome this limitation.Yet, for interested readers we provide further details on thermal and power estimates for our hypothetical LARC CPU in Section 2.6.

RELATED WORK
Stacked Memory and Caches: The size of LLC has increased for the last 25 years [58], a trend anticipated to continue into the future.Yet, 2D IC becomes hard to exploit for additional performance, despite recent attempts by IBM [27,59].However, 3D-stacking is becoming a promising alternative [52], as demonstrated by AMD's 3D V-Cache [40], Samsung's proposed 3D SRAM stacking solution [64] based on 7 nm TSVs, or the most recent study of 7 nm TCI-based 2-and 4-layer SRAM stacks by Shiba et al. [97].Moreover, academics explored 3D-stacked DRAM cache [48,119], but these incur much higher latency and power consumption [74,98].Non-Volatile Memory is considered as LLC alternative, yet it suffers similar latency issues [63].Lastly, NVIDIA applied for a patent of an 8-layer memory stack fused with a processor die [28], theorizing a 50x improvement in bytes-to-flop ratio.However, what differs our work from the work of our peers is: (i) we focus on the real-world impact of future caches, several magnitudes larger than those found today.
Performance modeling tools and methodologies: Computer architecture research is often based on simulators, such as the Structural Simulation Toolkit (SST) [95] or CODES [24], for efficiently evaluating and optimizing HPC architectures and applications.The gem5 simulator, by Binkert et al. [13], is widely used by academia and vendors for micro-and full-system architecture emulation and simulation.It supports validated models for x86 [7] and Arm [62].We refer the interested reader to www.gem5.org/publications/for an comprehensive library of gem5-based research and derivative works.However, what differs our work from the work of our peers is: (ii): unlike prior work that utilizes (relatively) small kernels, our work operates on large-scale MPI/OpenMP-parallelized proxyapplications in order to quantify the impact of caches on realistic workloads.To our knowledge of reported research-driven gem5 simulations, this is the largest scale of cycle-accurate simulations conducted in terms of the aggregate number of instructions simulated (6.08 × 10 13 ).
Other methods such as MUSA by Grass et al. [45] are closer to our MCA-based approach, since MUSA uses PIN which is the basis for Intel SDE (used in this study), but focus on MPI analysis and multi-node workloads.We are not the first to utilize Machine Code Analyzers, see [2,65] and derivative works such as [8,19,22,25,72,91].However, what differs our work from the work of our peers is: (iii): instead of estimating accurate performance of existing system architectures, our MCA-based approach tries to gauge the upper-bound in obtainable performance, and exposes bottlenecks better than the roofline approach, for common HPC applications.

CONCLUSION
We aspire to understand the performance implications of emerging SRAM-based die-stacking on future HPC processors.We first designed a methodology to project the upper bound that an infinitely large cache would have on relevant HPC applications.We find that several well-known HPC applications and benchmarks have ample opportunities to exploit an increased cache capacity.
We further expand our study by proposing a hypothetical processor (called LARC) in 1.5 nm technology.This processor would have nearly 6 GiB L2 cache memory; compared to our baseline A64FX S CPU architecture with 32 MiB L2 cache.Next, we exercise a single LARC CMG using a plethora of HPC applications and benchmarks using the gem5 simulator and contrast the observed performance against the existing A64FX S CMG.We find that the LARC CMG would (on average) be 1.9x faster than the corresponding A64FX S CMG, albeit consuming 1  4 th of the area.When area-normalized to the real A64FX CMG (by assuming optimistic ideal scaling), we can expect to see an average boost of 9.56x for cache-sensitive HPC applications by the end of this decade.
Finally, we expect that the larger caches will motivate and facilitate algorithmic advances that in combination with the abundant cache can potentially yield an order of magnitude gain in performance, as demonstrated by the tile low-rank (TLR) approximations [69].These approaches however require a minimum size of the cache to reach their fullest potential.We firmly believe that the combination of high-bandwidth, large, 3D-stacked caches, and algorithmic advances, is the path forward for the next generation of HPC processors when attempting to break the "memory wall".

FAIR COMMITMENT BY THE AUTHORS
We developed a framework of scripts and git submodules to manage the R&D of LARC, to set up the benchmarking infrastructure, and to perform the simulations.After cloning our repository https://gitlab.com/domke/LARC(or downloading the artifacts from https://doi.org/10.5281/zenodo.6420658),one has access to all benchmarks (see Section 3.3), patches, scripts, and our collected data.Only minor modifications to the configuration files should be necessary, such as changing host names, paths to compilers, or downloading licensed third-party software, before testing on another system.If users deviate from our OS version (CentOS Linux release 7.9.2009, and intel_pstate=disable kernel parameter) then some additional changes might be required.

Fig.
Fig.3.Difference between A64FX's Core Memory Group (CMG) and a LARC CMG in various performancegoverning parameters; Most notable (for our study) is the 48x increase in per-CMG L2 cache capacity; Note: despite appearing similar in the figure, the LARC CMG is, in fact, four times smaller.

DCFGFig. 4 .
Fig. 4. Illustration of our runtime estimation pipeline with the MCA-based tool for an accumulative kernel executed with  = 42; Dotted line: branch not taken; Solid line: kernel execution as recorded by SDE; Edges in directed CFG annotated by number of jumps between basic blocks; Details in Section 3.1

Table 1 .
Systems configuration for the benchmarked AMD EPYC 7763 Milan and 7773X Milan-X (for more details: see Zen 3 microarch.)

Table 2 .
Chip area and simulator configurations for gem5 The conservative option, called LARC C , is limited to 256 MiB L2 cache at ∼ 800 GB/s, while the aggressive version, LARC A , doubles both values, to 512 MiB and ∼ 1.6 TB/s, respectively.