SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

In this work, fundamental performance, power, and energy characteristics of the full SPEChpc 2021 benchmark suite are assessed on two different clusters based on Intel Ice Lake and Sapphire Rapids CPUs using the MPI-only codes' variants. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy to solution.


INTRODUCTION AND RELATED WORK
Modern HPC systems and programming models are becoming more complicated, heterogeneous, and diversified, making it difficult to evaluate performance and aim for performance portability.This necessitates a carefully crafted benchmark collection to design the hardware and software stacks of future large-scale systems.
Contributions.This work's primary contributions are as follows: We provide an overview of the full SPEChpc 2021 benchmark suite in MPI-only mode and provide performance and energy metrics on ccNUMA domain, node, and multi-node levels on two clusters with different generations of modern Intel server CPUs.We further pinpoint scalability and performance issues and identify their root causes, demonstrating the value of fundamental resource metrics like data volume and bandwidths.Finally, we show that analyzing power dissipation and energy consumption requires a clear distinction between memory-bound and non-memory-bound codes and that the minimization of energy-delay product and energy are dominated by chip idle power and code scaling characteristics.
Overview.This paper is organized as follows: We introduce the SPEChpc 2021 Benchmarks in Sect. 2 and describe our experimental setup and methodology in Sect.3. We then discuss SPEChpc 2021 parallel benchmark results: In Sect. 4 we focus on node-level performance, power, and energy using the tiny workloads.Similarly, Sect. 5 uses the small workloads for multi-node analysis.Finally, Sect.6 summarizes the paper and gives an outlook to future work.

SPECHPC 2021 BENCHMARKS
The SPEChpc 2021 collection, released in October 2021, covers a wide spectrum of science and engineering programs that are representative of HPC workloads and are portable across CPUs and accelerators.It aims to be the industry standard for assessing the efficiency of parallel computing workloads on heterogeneous platforms.The benchmarks are accessible on the SPEC website for non-profit use.
Implementation.Table 1 lists the names of all nine benchmarks, their input configurations for tiny and small workloads, their programming language, the lines of code, and the employed collective communication primitives.Table 2 briefly outlines the numerical information and the prospective application domain of benchmarks.The SPEChpc 2021 benchmarks use multiple programming languages (Fortran, C, and C++) and parallel programming models (MPI, MPI+OpenACC, MPI+OpenMP, and MPI+OpenMP with target offload) and are integrated with a benchmarking harness to ensure results correctness and sensible reporting.
Workload suites.To meet the need for different system sizes, the four suites "tiny", "small", "medium" and "large" allow running the benchmark suite from one to hundreds of nodes.The SPEC benchmarks concentrate on compute-intensive parallel performance.Each benchmark distributes the same workload over any number of active processes or threads (strong scaling).The {tiny, ¶ The automatic generation of the input conditions for all provided particles has been added to the source code for testing purposes.
⋆ The finest grid comprises boxes of size 323 grid points and * a total of 512 3 (tiny suite) and 1024 3 (medium suite) grid points.
§ Models: (1) Colliding Thermals, (2) Rising Thermals, (3) Mountain Gravity Waves, (4) Turbulence, (5) Density Current, (6) Injection small, medium, large} workloads utilize up to {0.06, 0.48, 4, 14.5} TB of memory and are, according to the documentation, designed to run on clusters using {1-256, 64-1024, 256-4096, 2048-32768} processes, respectively.SPEChpc 2021 benchmark set-up.This work focuses on the MPI versions of the SPEChpc benchmarks on CPU-only systems.Our goal is not to achieve best performance by choosing the appropriate programming model but rather to pinpoint peculiarities of the various benchmark codes.MPI+X hybrid-parallel will surely have their own set of issues and warrant their own investigation.Furthermore we expect some insights from MPI-only variants to be relevant on larger scales and hybrid programming models, such as the prime number problem or the cutting problem; see Sections 4 and 5.
We only include findings for the tiny and small workloads since the medium and large workloads are only supported by six out of the nine benchmarks.Single-node performance is examined first, using the tiny workloads; after that we turn to multi-node performance using the small workloads and up to 1664 MPI processes on both clusters.For additional information we refer to our detailed performance data artifact appendix 2 .

HARDWARE-SOFTWARE SETUP
The hardware and software environments employed for all experiments are shown in Table 3.Two Intel-based InfiniBand (HDR-100) clusters were at our disposal:    Trace Analyzer and Collector (ITAC) tool 4 was used for visualizing MPI event traces.Since RAPL measurements vary across nodes, all benchmarks were run on the same node for the node-level analysis of Sect. 4. The working sets of the tiny or small suites were at least ten times the size of the last-level cache of one node, which prevented it from fitting into the available cache 5 .Nevertheless, cache effects could be observed in multi-node scaling for some of the codes.Memory bandwidths were determined using the ratio of memory data volume to wall-clock time.Before performing the measurements, at least two warm-up time steps, including global synchronisation, were conducted to allow the MPI runtime to stabilize and eliminate first-call overhead.To account for variations in runtime, we repeated code executions several times and only statistically significant deviations were reported.

NODE-LEVEL ANALYSIS
The "tiny" workload suite was used for node-level analysis.We first examine each code's scalability, memory-boundness, vectorization, and the underlying causes of scaling issues.We then determine the impact of these findings on power and energy consumed by "hot" and "cold" codes and the relevance of the baseline (idle) power and energy-delay product.

Performance and speedup
In this section we show how the scalability, vectorization, process timeline, memory bandwidth, and data transfer volume can be used to find the underlying causes of non-ideal scalability.On both clusters, lbm and minisweep show reproducible fluctuations in scalability which makes the speedup across domains less meaningful.The superlinear scaling for weather on ClusterB is caused by cache effects as the Sapphire Rapids CPU has significantly more aggregate outer-level cache; this shows the fact that the non-memory-bound weather benchmark still contains some memory-intensive kernels.
4.1.2Performance.Adopting a Roofline-like view of hardwaresoftware interaction, performance metrics allow to compare application performance differences with hardware properties like peak performance and memory bandwidth.According to Table 3, comparing ClusterB with ClusterA the ratio of peak performance and memory bandwidth is 1.2 and 1.5 respectively.We therefore expect a node of ClusterB to be 1.2 to 1.5 times faster than a node of ClusterA, depending on whether the code is compute bound or memory bound.From Figure 1(b-c, e-f) we can read the following actual performance ratios: In a number of applications, the speedup of Sapphire Rapids over Ice Lake exceeds the expected ratio.This can be attributed to architectural enhancements introduced in Sapphire Rapids, such as the larger L2 and L3 caches and L3 bandwidth 6 .This applies to both memory-bound and non-memory-bound codes, as the latter can still be cache sensitive.4.1.3Vectorization.The vectorization (SIMD) analysis on both systems is shown in Figures 1(b-c) and (e-f) separately for memorybound and non-memory-bound codes.The chosen compiler flags allow the compiler to utilize AVX-512 instructions, and all nine benchmarks primarily use them on both CPUs.From the data we can extract the following vectorization ratios, which we define as the ratio of actual numerical work (flops) done with SIMD instructions to the overall numerical work: The percentage of vectorized work is similar on both systems.The memory-bound cloverleaf and pot3d codes and the most compute-intensive lbm show the highest vectorization ratio.However, the memory-bound tealeaf and the non-memory-bound soma code are poorly vectorized.Looking at the memory bandwidth of weather (see next section), it is probable that it might become fully memory bound if it could be efficiently vectorized.
4.1.4Bandwidth.Figure 2(a-b) presents node memory bandwidth measurements of all benchmarks on ClusterA and ClusterB.Five of the nine benchmarks (hpgmgfv, cloverleaf, tealeaf, pot3d, weather) draw a significant fraction of the available memory bandwidth of the node, with only the first four actually achieving the saturated memory bandwidth on a ccNUMA domain (75-78 GB/s for ClusterA and 58-62 GB/s for ClusterB).Among these four, the strongly saturated pot3d, cloverleaf, and tealeaf show strong saturation patterns, whereas hpgmgfv is only weakly saturating and becomes less memory-bound with more cores.weather's memory bandwidth behavior on a ccNUMA domain indicates the presence of memory-bound and non-memory-bound kernels, with the latter dominating the runtime on both systems.In memory-bound codes, the fact that L3 cache on ClusterA has a greater bandwidth than L2 (124 GB/s vs. 80 GB/s for pot3d) indicates that L3 is a victim cache (with memory prefetchers enabled) and sees additional traffic coming down from L2; see orange data points in Figure 2(c-d).The non-memory-bound lbm code includes a strongly memory-bound "propagate" kernel performing sparse memory accesses and a "collide" kernel with about 6600 floating-point operations per lattice site update and consequently high intensity.In contrast to minisweep, the performance scaling for lbm (see Figure 1) shows large fluctuations with clear upper and lower limits.In-cache effects can not be observed in memory or L3 data volume; see Figure 2(e-h).The drop in performance is not always accompanied by excess L2 data volume, indicating several overlapping effects.For instance, L2 data traffic peaks at {22, 23, 31 and 45} processes but poor performance is observed also with {47-50, 68-71} processes.Powers of two in both the x and y directions, such as 4096 and 16384, are particularly susceptible to these fluctuations.The 44 and 45 processes are not that dissimilar from a domain decomposition and layer condition point of view, but they differ greatly in terms of L2 data transfer and traffic.Further, in addition to drawing huge L2 data volume, runs at 22 and 23 processes also consume more L3 and memory bandwidth, which is an indication of cache thrashing effects.More L1 misses draw more data from L2 , but there is headroom in the L2 bandwidth (400 GB/s at 45 processes versus 100 GB/s at 44 processes; see Figure 2(d)), indicating that L2 is not the bottleneck but rather the L1 cache.Since TLBs are extremely sensitive to alignment problems, several parallel data streams in a Structure of Array (SoA) memory-layout for bm may cause problems, since many concurrent data streams hit different pages, leading to a shortage of TLB entries; L1 cache bank conflicts are also a possible culprit.Such issues are typically reflected in certain processes being slower if the local domain size is unfortunate.This shows, e.g., in the ITAC timeline for 71 processes on ClusterA (inset of Figure 2(h)), where performance is about 33% smaller than on 72 processes due to process 70 being significantly slower, leading to extra waiting times on the others.Upshot: The SPEChpc 2021 suite covers a wide range of scalability, vectorization, and memory-and non-memory-boundedness patterns on the node level.The speedup of Sapphire Rapids vs. Ice Lake is within the range expected from peak performance and bandwidth improvements together with larger LLCs.minisweep suffers an up to 75% performance hit due to MPI serialization, while lbm shows fluctuating performance with varying process count due to multiple data alignment issues.

CPU and DRAM Power
In Figure 3 we show CPU and DRAM power for all benchmarks on the node level.In (a) and (c) we choose a representation that allows to identify different scaling patterns and their impact on power dissipation: CPU and DRAM power are plotted against speedup (single-core baseline) up to the first ccNUMA domain boundary (half a socket on ClusterA, a quarter socket on ClusterB).Saturating, scalable, and erratic scaling behavior can be clearly discerned in this way.Even with only a single ccNUMA domain populated, 90%-95% of the total power is consumed by the CPU compared to only about 10% (ClusterA) and 5% (ClusterB) by the memory modules.As anticipated and shown in (b) and (d), going from one socket (up to 36 and 52 processes on ClusterA and ClusterB) to two sockets results in a two-fold increase in maximum power.In accordance with a naive CPU and DRAM power model, on-chip and DRAM power grows linearly with the number of active cores until a bottleneck is hit, after which additional inactive cores wait for memory, and therefore the slope of on-chip power still grows but more slowly, without additional performance.DRAM power becomes constant after the memory bandwidth has saturated.However, it also depends on actual memory access pattern, such as continuous vs. burst mode, manufacturing process, DIMM organization, and number of concurrent data streams.

Hot and cool benchmarks.
There are clearly "hot" and "cool" SPEChpc benchmarks with high and low per-CPU power dissipation.The hot benchmarks come close to the TDP of both systems.For instance, on ClusterA and ClusterB, sph-exa achieves 98% and 97% of the socket TDP (244 W and 333 W), while soma reaches only 89% and 85% (222 W and 298 W); see Figure 3.
The low-intensity memory-bound benchmarks {pot3d, tealeaf, cloverleaf} attain the highest DRAM power (16 W on one cc-NUMA domain on ClusterA and 10-13 W for distinct ccNUMA domain on ClusterB), which remains constant for saturated memory bandwidth and demonstrates how the DRAM's power consumption is strongly tied to the memory bandwidth utilization.Conversely, for the high-intensity non-memory-bound {sph-exa, lbm, minisweep, soma} benchmarks, most power is drawn in the computational units and the cache hierarchy, but their DRAM power is limited, hitting a low of 9.5 W and 5.5 W minimum DRAM power for soma on one ccNUMA domain of ClusterA and ClusterB.4.2.2Fluctuating performance impact on power.The different power behavior of the lbm and minisweep codes is worth noting.With fluctuating performance, lbm has lower power than minisweep.Also the performance drops with lbm are associated with slight power drops while in minisweep more cores also lead to more power.This is due to the different reasons for the drops in both codes (MPI waiting time in minisweep vs. slow execution in lbm).

4.2.3
Comparison of system power.On both systems, the general behavior with respect to power dissipation is similar; see Fig. 3.The memory-bound and non-memory-bound benchmarks on Clus-terB consume 40% and 25% more on-chip power than on ClusterA, respectively.On the two recent architectures considered here, the extrapolated zero-core baseline for chip power is now substantially higher than that of earlier architectures, being around 50% of the 350 W TDP on Sapphire Rapids (176-181 W) and 40% of the 250 W TDP on Ice Lake (95-101 W).To put this into perspective, on the Sandy Bridge server architecture from 2012, baseline power only accounted for less than 20% of the 120 W TDP [2,19].
The DDR5 memory on ClusterB is more power efficient and has significantly less impact on total power than DDR4 on ClusterA despite its larger size.It employs a lower voltage and half-rate analog and digital clocking, i.e., DDR5 can achieve the same data transfer rate with half clock frequency [27].Upshot: A part of the SPEChpc 2021 benchmarks attain power dissipation close to TDP.The latest generation of processors exhibit a substantial idle power and a rather cool DDR5 RAM with half-rate clocking.

CPU and DRAM Energy to solution
Energy to solution, along with the energy-delay product (EDP) is a primary metric for energy efficiency.In order to study the relevant parameter space and identify optimal operating points, the Z-plot [2] is most useful (see Fig. 4(a, b)).It relates energy to the speedup (or performance) of a code, with the amount of resources (number of cores here) as a parameter within a data set.In a Zplot, horizontal lines mark constant energy, vertical lines mark constant speedup (or performance), and lines through the origin mark constant EDP (the slope being proportional to the EDP).

Energy and EDP minimums.
In previous Intel architectures, reducing the energy to solution of memory-bound code involved concurrency throttling, i.e., reducing the number of active cores on a ccNUMA domain [19,30].On the latest designs, however, the baseline power is so dominating that using less than the full ccNUMA domain results in minor energy savings only.Moreover, the minimum energy and minimum EDP operating points are so close together as to be hardly discernible.This suggests that making code faster ("code race-to-idle") is now the primary means of energy reduction.If strong performance fluctuations are present, the raceto-idle rule calls for avoiding low-performance operating points.This is clearly visible in Fig. 4(c) for the lbm and minisweep codes.the bottleneck, both systems should exhibit comparable energy to solution because the 50% higher memory bandwidth on ClusterB makes up for the 40% higher power.However, if the core performance is the bottleneck, then the newer CPU would be less efficient because the 40% increase in power is not offset by a 20% gain in core performance of Sapphire Rapids over Ice Lake.It goes without saying that the rest of the system (board, network, disks) is ignored in this assessment.
Upshot: High baseline power on the new Intel CPUs prevents energy from significantly increasing for saturated bandwidth, bringing the  and EDP minimums closer and making runtime equivalent to energy.On a socket basis, Sapphire Rapids only pays off energy-wise for memory-bound workloads.

MULTI-NODE ANALYSIS
For large process counts to benefit more from increased workload, the "small" workload suite was used for multi-node analysis.
Communication routines.The runtime breakdown obtained using ITAC was consistent with [11].Under strong scaling, all benchmarks suffer from significant communication overhead.In decreasing order, {soma, tealeaf, pot3d, sph-exa, cloverleaf, hpgmgfv} heavily employ reductions via MPI_Allreduce.Point-topoint communication is the dominant contribution to communication overhead in {weather, minisweep, hpgmgfv, cloverleaf, sph-exa and pot3d} (again in decreasing order).MPI_Barrier takes significant time in lbm.However, it could be avoided because it is only used to synchronize processes at the end of each iteration.

Strong-scaling performance
The results from multi-node strong scaling experiments shown in Fig. 5 indicate that two antagonistic effects determine the scaling behavior: communication overhead and memory data volume (specifically that a major part of the working set fits into cache).The following table summarizes which scaling behavior can be attributed to which cause(s) for the different benchmarks: In Fig. 5(b, e) we show the per-node memory bandwidth.Perfect performance scaling with no cache effects would show as a horizontal line here.All benchmarks except soma exhibit a declining per-node bandwidth, which either indicates poor scaling due to communication overhead (or load imbalance) or a combination of cache effects which reduce the memory data volume.This is why we also show the overall memory transfer volume in Fig. 5(c, f).A rise in data traffic over the optimal horizontal line signals effects such as data replication.The similar tendency in code scaling for node-level "tiny" and cluster-level "small" workloads is caused by a mild change in the problem size per node as both are increased.

Four fundamental scaling patterns.
In the following we describe the different patterns observed in the multi-node scaling.
Case A: Cache effect prevails over communication overhead.With increasing node count, weather shows a significant decrease in memory data volume and bandwidth on both clusters (but stronger on ClusterB).This cache effect dominates and leads to superlinear scaling.The difference between the clusters is due to the fact that ClusterB has 1.45 times more L3 per core and 1.6 times more L2 cache per core than ClusterA, so the working set of weather can fit earlier into the cache of ClusterB's CPUs.
Case B: Communication overhead and cache effects balance out.The superlinear scaling due to cache effects (Case A) can be counteracted by increasing communication volume or synchronization overhead, to the point of causing linear scaling.The codes weather on ClusterA and tealeaf on both systems fall under this category.
Case C: Communication overhead dominates over cache effect.In this case, memory traffic drops with increasing node count, but the anticipated super-linear scaling is outweighed by high communication overhead.On the cluster level, hpgmgfv falls under category.The increasing communication cost is caused by point-to-point communication and reductions.
Case D: No cache effect; only communication overhead.In this case, significant MPI communication is the only factor contributing to the poor scaling, where the memory data volume remains the same and the memory bandwidth declines.The cloverleaf and soma codes on both systems fall under this category.Further, data   set size affects communication overhead, i.e., smaller data sets have a higher likelihood of having significant communication overhead.The poor scaling observed in {minisweep, soma, sph-exa} is caused by a confluence of large MPI communication {blocking pairwise, MPI_Allreduce, both} and a comparatively small data set; see Fig. 5(c, f).

5.1.2
Intriguing non-memory-bound case of soma.Out of the nine benchmarks, soma is the one that spends the majority of its total time in MPI reductions.Beyond one to three nodes, soma does not scale effectively.However, with increasing node count it draws significant and increasing memory bandwidth, up to about half the maximum on ClusterA and up to 33% of the maximum on ClusterB.Hence, the memory bandwidth per node increases while scaling is poor, which constitutes an unusual pattern.The memory data volume sheds more light on this issue: A perfect linear rise in aggregated data traffic vs. number of nodes on both systems indicates that soma appears to have a lot of replicated data; as long as the code remains non-memory-bound, this might be of minor significance.However, given a linear rise in memory traffic and a logarithmic rise in reduction overhead, the question arises whether at some point (i.e., number of nodes) the code might become memory bound.
Our measurements demonstrate that this can not happen with soma at least within the "small" working set and number of processes for which it is officially designed: The per-node memory bandwidth initially increases to 150 GB/s (far below the limit of 350 GB/s), and then remains essentially constant at 150 GB/s, at which point the code ceases scale at all.

Comparison of cluster performance.
The two clusters' interconnects are identical so differences in communication performance are not expected.The fundamental distinction of ClusterB's node is that it features more cores, more cache per core, higher memory bandwidth, and higher machine balance (memory bandwidth to peak ratio).Our findings reveal that the tendency in code scaling, whether it is poor or good, is qualitatively consistent across clusters.However, the superlinear multi-node scaling of weather is stronger on ClusterB due to its larger cache.The scaling of cloverleaf is slightly worse on ClusterB due to the higher single-node baseline (250 Gflop/s vs. 160 Gflop/s, owing to memory bandwidth).Similarly, sph-exa's significantly inferior scaling efficiency on ClusterB is caused by the 47% higher performance on a single node compared to ClusterA (6.2 Gflop/s vs. 4.2 Gflop/s).
Upshot: Based on two antagonistic effects, namely communication overhead and cache effects (measured using fundamental resource metrics), all SPEChpc 2021 benchmarks fall under four fundamentally distinct categories.Especially interesting cases are soma with its excess memory traffic with rising process count and weather with its strongly superlinear scaling.

Scaling impact on power and energy
Figure 6(a, c) shows total power dissipation scaling on multiple nodes.The codes of the suite attain 74-85% (ClusterA) and 63-76% (ClusterB) of the CPU TDP limit on the full set of nodes (5.9-6.8 kW out of 8 kW on ClusterA and 7.1-8.5 kW out of 11.2 kW on ClusterB).The baseline power of the coolest code dominates the dynamic power, with a share of 82% (5.8 KW vs. 1.3 KW) on ClusterB and 53% on ClusterA (3.1 KW vs. 2.8 KW).The poorly scalable benchmarks {minisweep, soma, sph-exa} require more resources (node-hours) with increasing node count, which leads to rising energy consumption (see Fig. 6(b, d)).Scalable codes such as tealeaf exhibit constant energy as anticipated.For soma, the overall energy rises linearly up to three nodes but with a steeper slope beyond because of declining scalability.Upshot: Multi-node energy consumption scaling mainly depends on the scaling properties of a code; poorly scalable benchmarks always burn more energy when scaling out, with soma marking an especially interesting case.

SUMMARY AND FUTURE WORK
We provided an in-depth node-level and multi-node analysis of the MPI-only versions of the SPEChpc 2021 benchmarks with respect to power/energy and performance on clusters based on Intel Ice Lake and Sapphire Rapids CPUs.Using speedup, performance, data traffic, and bandwidth metrics we could categorize the codes with respect to their memory and communication boundedness, and we could uncover unusual scaling patterns that are rooted in excess replicated data, performance bugs in MPI communication, data alignment issues, and, in one case, counteracting superlinear scaling and communication overhead effects.One the node level, the overall performance advantage of Sapphire Rapids vs. Ice Lake is in line with expectations based on maximum memory bandwidth, peak performance, and cache size ratios.We showed that there is a 25% variation in power dissipation on the package level across benchmarks, but that the "hot" codes are able to come very close to the TDP limit on both CPUs.We also observed that idle power, i.e., the hypothetical power dissipation of the CPU with zero active cores, takes a much higher fraction of the overall power than in older CPUs.This has the important consequence that minimum energy to solution and energy-delay product operating points are practically identical, idling cores save negligible energy, and race-to-idle via code optimization becomes the pivotal energy reduction strategy.On Sapphire Rapids, DRAM power is measurably lower than on Ice Lake due to the new DDR5 technology; however, this reduction is all but insignificant considering the chip and full-system power dissipation.
In future work we will more thoroughly investigate optimization opportunities and further performance patterns (such as the multi-faceted fluctuations in lbm) as well as further parallelization approaches beyond pure MPI.Furthermore, we expect insight from studying desynchronization [3,5,[7][8][9] and idle wave phenomena [4,6] in those benchmarks that show mixture of memory-, compute-, and communication-bound behavior.

5 5 Figure 1 :
Figure 1: SPEChpc 2021 tiny suite performance on a node of ClusterA (top) and ClusterB (bottom).The shaded background marks the ccNUMA domains of each cluster.Two codes (lbm and minisweep) exhibit intriguing patterns that hold up across multiple runs on each system.(a, d) The speedup (min, max, average) on the first ccNUMA domain is shown in an inset.(b-c, ef) A well-vectorized code has a small difference between DP (actual performance) and DP-AVX (vectorized part only).

Figure 2 :
Figure 2: Node-level bandwidth and data volume behavior of (a-b, e-f) memory, (c, g) L3 cache, and (d, h) L2 cache for the SPEChpc 2021 tiny suite on both clusters.The background shading layers denote the ccNUMA domains.On ClusterA, timeline inset displays in (g) the minisweep time spent in MPI_Recv (red), computation (blue), and MPI_Send (yellow) for 59 processes, while in (h) the lbm time spent in MPI_Wait (red), computation (blue), and MPI_Barrier (yellow) for 71 processes.

4. 1 . 5
Consistent fluctuations in minisweep.The minisweep code shows consistent performance fluctuations with changing core count.Specifically, prime numbers and some other special numbers of processes such as {9, 26, 34, 51, 69} are detrimental for performance; see Figure 2.For example, on ClusterA, performance drops by 75% from 58 to 59 processes, where 75% of the time is spent in MPI_Recv, 5.5% in MPI_Sendecv, and 19.5% in computation.The root cause is a communication serialization performance bug as shown by the ITAC timeline of 59 processes in the inset of Figure 2(g).The minisweep code uses open boundary conditions and synchronous rendezvous mode (due to large messages).Traces show that every process sends to its top neighbor first; with open boundary conditions, only the top process in the chain does not have a top neighbor and can thus call MPI_Recv right away.Subsequently, the communication "ripples" through the processes, leading to massive MPI waiting times.4.1.6Consistent fluctuations in lbm.

Figure 3 :
Figure 3: SPEChpc 2021 tiny suite power dissipation of CPUs and DRAM (via RAPL); (a, c) on one ccNUMA domain, showing power vs. speedup, and (b, d) full intra-node node for both clusters.In (a, c), the baseline power was determined by extrapolating o zero cores (black dotted lines).In (b, d) the ccNUMA domains are indicated by the background shading layers.
‡Starting the Chebyshev method requires providing approximations of the minimum and maximum eigenvalues.⨿ This includes the Line of Codes (LOC) from HDF5 library as well.

Table 2 :
Numeric and domain data of SPEChpc 2021 suite.

Table 3 :
Key hardware and software attributes of systems.The expected clock frequency was verified with the likwid-perfctr utility, which was also used for reading hardware performance events.On ClusterB, we had to employ a beta version of the LIKWID suite that supports Sapphire Rapids CPUs.This work used the first official release (version 1.0.3) of the SPEChpc 2021 suite.Time-resolved Roofline plots 2 of the benchmarks were obtained using the ClusterCockpit monitoring framework [? ].The Intel , minisweep sph-exa, minisweep § Soma's scaling is constrained by the communication overhead before the excess memory issue becomes problematic.