CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC Nodes

Modern high-performance computing systems have multiple GPUs and network interface cards (NICs) per node. The resulting network architectures have multilevel hierarchies of subnetworks with different interconnect and software technologies. These systems offer multiple vendor-provided communication capabilities and library implementations (IPC, MPI, NCCL, RCCL, OneCCL) with APIs providing varying levels of performance across the different levels. Understanding this performance is currently difficult because of the wide range of architectures and programming models (CUDA, HIP, OneAPI). We present CommBench, a library with cross-system portability and a high-level API that enables developers to easily build microbenchmarks relevant to their use cases and gain insight into the performance (bandwidth & latency) of multiple implementation libraries on different networks. We demonstrate CommBench with three sets of microbenchmarks that profile the performance of six systems. Our experimental results reveal the effect of multiple NICs on optimizing the bandwidth across nodes and also present the performance characteristics of four available communication libraries within and across nodes of NVIDIA, AMD, and Intel GPU networks.


INTRODUCTION
Communication networks on high-performance computing (HPC) systems are complex and diverse.The challenge for application and library developers when targeting such machines is to understand a network's characteristics and the resulting performance implications to design or select tailored communication strategies for their programs [3,16].
In particular, HPC network architectures have become more hierarchical in an effort to sustain high bandwidth, low latency, and energy-efficient communication as aggregate compute has grown.HPC systems today emphasize fat node designs with large numbers of accelerators per node connected by a fast internal network as well as the traditional network across nodes [5,8,9,37].Applications that exploit these hierarchical networks can achieve significantly higher overall performance [23,33].
Traditional HPC network benchmark suites perform point-topoint (P2P) and collective, e.g., scatter-point-to-all and all-to-all, communications.These tests assume no hierarchy-all points are peers.In hierarchical systems, the performance of such tests depends on the locality of the endpoints in the communication hierarchy.For example, point-to-point tests give different results for endpoints within the same node vs different nodes.
Our key insight is that in characterizing the performance of a hierarchical network, we should use groups of processors corresponding to the levels of that hierarchy.In particular, in addition to using either individual processors or all processors, we should also cover intermediate-sized groups of processors, such as all processors in a node.We can then develop cross-group communication patterns to evaluate performance across groups of communicating processors that accurately represent the underlying network hierarchy.To this end, we introduce CommBench, an extensible framework for constructing benchmarks of nontrivial communication patterns, stress-testing communication layers, and identifying optimal communication configurations in hierarchical machines.
A significant challenge is that each system has different numbers of GPUs and network interface cards (NICs) per node, and thus each system exhibits a different physical topology.Figure 1 shows example systems such as Frontier and Aurora, which consist of dualdie GPUs where each die is connected to an intra-node network in a heterogeneous way, yielding non-obvious performance behavior that can be explored with CommBench.To accommodate such level of diversity in the systems of interest, we provide a unified parameterization of the group-to-group patterns and make them portable and scalable across systems with various architectures.
Moreover, communication libraries often exhibit different logical connectivity between processors than the underlying physical hardware connectivity.As a result, we observe different performance behaviors across implementations of the message-passing interface (MPI) [11,13,20,29] and other collective communication libraries [27], such as varying group-to-group performance even with the same physical topology.CommBench proposes several group-to-group patterns for straightforward comparison and summary of the differences in performance between different libraries.For additional patterns that fall outside our proposed patterns, CommBench provides an API so that the developers can build custom microbenchmarks for their own communication patterns.
Our key finding is that the performance of libraries on hierarchical networks vary significantly across systems.Therefore, portable micro-benchmarking across systems is crucial for porting libraries and applications in a performant way.To address the problem, this paper makes the following main contributions: • We propose group communication benchmarks for isolating performance characteristics at specific levels of the networking hierarchy.Our evaluation reveals performance characteristics by gradually increasing the load on GPU-to-NIC communications in a parameterized way.To compare the empirical results of our group-togroup communication patterns with theoretical limits, we derive *Each AMD MI250x [30] and Intel PVC [26] involves two processor dies referred to as "graphics compute dies" or "tiles", which we count as separate GPUs in the rest of the paper.
b) Summit ( 6 analytical models for hierarchical topologies.The proposed abstractions allowed us to port i) multi-step, ii) group-to-group, and iii) application-specific microbenchmarks in significantly different hierarchical network designs.

OVERVIEW OF HIERARCHICAL NETWORKS
This section dissects the hierarchical network architecture of six current HPC systems that are summarized in Table 1.

Intra-Node Network Architecture
Modern systems involve heterogeneous node architectures with intra-node networks that are composed of i) a high-bandwidth interconnect for communication across GPUs (see Figure 1) and ii) a GPU-to-NIC interconnect for enabling communication across nodes (see Figure 2).We will first dissect the former and then the latter to understand contemporary HPC networks.

2.1.1
High-Bandwidth Links.The systems of interest (Table 1) involve various numbers of GPUs in each node.These GPUs are connected with a high-bandwidth (sub)network with propriety links, as depicted in Figure 1.The communication bandwidth & latency across GPUs depends on the link topology, which may differ significantly across systems.Some systems have uniform topologies such as all-to-all (Figure 1 (a) Delta and Perlmutter) or star (Figure 1 (e) DGX-A100), which are easier to understand: In uniform networks, there is no hierarchy among GPUs, i.e., they are connected to each other with the same number and type of links and hence with the same bandwidth and latency.The performance of heterogeneous interconnects is less obvious.For example, GPUs in Figure 1 (b) Summit, (c) Frontier, and (d) Aurora nodes are non-uniformly connected with different numbers and types of links.For reasoning about these networks, we consider conceptual affinity groups, discusssed next.
In hierarchical networks, communication among GPUs with a closer affinity has a lower cost than that of "distant" GPUs.The levels of affinity corresponds to the levels of the hierarchy.For example, Figure 1 (b) Summit, (c) Frontier, and (d) Aurora form twolevel hierarchies with different affinity groups.In (b), the groups correspond to the half nodes, i.e., GPUs (0, 1, 2) and (3,4,5), where the bandwidth is higher within groups than across groups.In (c) and (d), the groups correspond to GPU pairs co-located in a single box, i.e., (0, 1), (2, 3), (4,5), and so on.However, the boxes in (c) are connected in nonobvious ways: The number of communication links is embedded in the vendor's low-level software and not part of the public interface.
2.1.2GPU-to-NIC Associations.When GPUs communicate across nodes, they do so through the NICs.Therefore, understanding GPUto-NIC associations is crucial for understanding inter-node communication.Figure 2 shows the physical and logical topologies between GPUs and NICs in each node of our systems.The physical topology refers to the hardware connections between devices, often with PCIe links and switches, while the logical topology refers to the software bindings between GPUs and NICs for moving data into or out of a node.For example, we have found that communication libraries associate a single specific NIC with each GPU, even when the GPU has the same physical connection to multiple NICs.
The logical GPU-to-NIC bindings vary across systems depending on the number of GPUs and NICs and their affinity in the subnetwork.These bindings are determined by the communication library implementation (e.g., MPI or NCCL) and in our testing were found to be static 2 .In our experiments, we use the default associations: (a), (c), and (e) are packed, (d) is round-robin, and (b) and (f) are bijective, i.e., one-to-one.

Network Hierarchy Across Nodes
Communication across nodes takes place on an external network interconnect, e.g., InfiniBand and Slingshot, where each node is connected to the network through multiple NICs [7,38].The external network switches deliver data from the NIC associated with the sender GPU to the NIC associated with the receiving GPU.The topology of the network varies across systems, depending on their scale.For example, Summit (4,608 nodes) has a fat tree topology, where the bandwidth across nodes is uniform.To reduce the cable cost, newer systems such as Frontier (9,472 nodes) and Aurora 2 Except a special case that is explained in Section 5.   (10,872 nodes) have three-hop dragonfly networks that introduce additional hierarchy [22].In this work, we focus on the immediate effects of GPU-to-NIC associations in multi-node communication, and therefore we carried out our experiments up to a small number of nodes which are placed in the close vicinity by the scheduler.

SOFTWARE 3.1 Overview
CommBench offers a system-agnostic API to build custom microbenchmarks.For ease of portability, several standard communication libraries are supported by CommBench.

Integrated Communication Libraries
CommBench is intended to measure performance as seen from an end-user application.Therefore we have integrated the most popular communication libraries (MPI, NCCL / RCCL (XCCL), and IPC) used by many applications.
For building custom microbenchmarks, CommBench relies on P2P communication functions: e.g., MPI_Isend / MPI_Irecv for MPI, and ncclSend / ncclRecv for NCCL.These functions have different GPU-aware and non-blocking protocols, and are implemented with different lower-level APIs for the networks within and across nodes.The communication software stack is deep and diverse as depicted in Figure 3, and the implementation on a specific fabric is handled by lower-level interfaces that are closer to the hardware, which are tested indirectly by CommBench.
Within nodes, libraries often use vendor-provided IPC mechanisms for message passing through the high-bandwidth links (Section 2.1.1).Nevertheless, we observe in our evaluation that higher-level library implementations are sometimes inefficient or have significant software overhead.For accurate measurements of intra-node networks, we also expose vendor-provided IPC mechanisms directly in CommBench.

Portability Across GPU Vendors
CommBench is portable across CPUs and GPUs from multiple vendors: there are OneAPI, CUDA, and HIP versions for programming GPUs of Intel, Nvidia, and AMD, respectively.The choice of port and whether CommBench uses CPU or GPU communication are both made at compile-time.According to the selection of library and port, CommBench implements each P2P communication with one of the nine capabilities listed in Table 2.

MICROBENCHMARK IMPLEMENTATION
CommBench is designed around an API for composing communication patterns succinctly.A microbencmark is composed of P2P communications.There may be a single step with concurrent communications, or multiple steps, where each step depends on the previous one.Listing 1 outlines the CommBench API for constructing a single step.

CommBench API
Each benchmarking step requires three things: 1) a persistent communicator that memoizes and executes the desired communication pattern, 2) creation of the communication pattern using individual P2P communications, and 3) validation through isolated measurements.
The persistent communicator is realized with the Comm object defined in Listing 1, Line 5. The communication registry is made by the add function on Line 7. The intended use case for CommBench is for the user to supply the desired communication pattern and then call start (Line 9), which kicks off all registered communications at once, maximizing usage of the machine's bandwidth.The start 3 OneCCL currently does not support non-blocking P2P functions and therefore not applicable to the results in this paper.call is nonblocking and all buffers supplied to CommBench could be in use until the corresponding wait (Line 11) call completes.
After the wait call, the buffers can be safely reused.For measuring the time of a step, we provide an integrated measure API (Line 13) which executes the communications multiple times and reports the minimum, maximum, average, and median times over a specified number of iterations.We use the CommBench API for implementing the following microbenchmarks.

Striping Data Across Nodes
The point-to-point functions of libraries utilize only a single NIC, although there are multiple NICs per node in current systems.Therefore, conventional point-to-point benchmarks do not measure the full potential bandwidth across nodes.We propose a striping microbenchmark for measuring the point-to-point bandwidth across multi-NIC nodes.
This microbenchmark consists of three consecutive steps depicted in Figures 4 (a)-(c).To accommodate any node type, the code is parameterized for any message size () and number of GPUs per node ( = 3).The communication steps are programmed separately in Listing 2. In Lines 2-4, the implementation library for each step is selected (IPC within nodes and NCCL across nodes).Lines 8-17 register the three communication patterns.Line 19 creates a vector of communicators that represents the communication sequence, where each step depends on the previous one.The measure_async function in Line 21 executes the communication steps asynchronously while preserving the data dependencies across steps, which we explain next.In multi-step microbenchmarks, we assume each step depends on its predecessor.A naive way of preserving such dependencies is a lock-step global execution with a barrier (global synchronization) between each step.However, including the barriers would not only cause idle time, and hence inefficiency, but also yield inaccurate latency measurements, especially on large numbers of nodes.CommBench does not use global synchronization in its execution.Instead, it uses finer-level synchronization functions (start and wait) that are explained in Section 4.1.Figure 5 (a) depicts such a communication sequence for the striping example.To preserve data dependencies, a GPU must wait for completion of a current step before starting the next step.In this case, a GPU waits only if it has an outstanding dependency on another GPU in the current step for preserving the data dependencies across steps.If there is no dependency, the GPU moves on to execute the subsequent step.

Group Communication Patterns
We propose group communication patterns to measure the performance across groups of processors at a specific level of the communication hierarchy in isolation.Group communication patterns are useful for stressing the network across a set of processors at a specific level of the communication hierarchy, such as a node.We define three families of built-in patterns for varying the communication workload gradually across two nodes rail, asymmetric, and symmetric (Figure 6).Then, we design scaling patterns for multiple nodes with various directions of data movement.Group communication patterns reveal the effect of 1) static and dynamic (if any) associations between GPUs and NICs, 2) hardware limits in isolation (e.g., switch, link, NIC), and 3) the software overhead of libraries (e.g., MPI, NCCL).

Pattern Parameterization.
We propose the parameterized group-to-group patterns as shown in Figure 6 (a)-(c).These are bipartite patterns between GPUs in different groups.The configuration parameters are  and , where  is the node size in terms of number of GPUs, and  is the number of nodes.
Figures 6 (a)-(c) show the communications across groups with parameters  = 3 and  = 2 for the (a) rail, (b) asymmetric, and (c) symmetric group-to-group patterns. 4To provide more diversity of patterns, we define an additional parameter  ≤  that represents a subgroup within a group.Figure 6 shows the family of patterns for varying .
The rail pattern generalizes the P2P pattern between GPUs in corresponding positions of two or more nodes.Selecting  = 1 for the rail pattern recovers exactly an internode P2P pattern.By choosing  = 1, 2, 3 in this example, the rail pattern tests the capacity of one-to-one communication between two nodes with different numbers of simultaneously participating pairs of GPUs.
The asymmetric pattern maps  GPUs in the first group to all  GPUs in the other group.For example, when  = 1 the pattern is equivalent to a one-to- pattern, where the sender and receiver GPUs are in different nodes.By increasing , we activate GPUs incrementally to increase the workload.When  = , the asymmetric pattern converges to the symmetric pattern shown in Figure 6 (b)-(c).
In the asymmetric pattern, the parameter  is only used to limit the number of GPUs in the first group participating in the communication.However, for the rail family and the symmetric family,  is used to limit the number of GPUs in both nodes participating in the communication.
The bottom of Figure 6 shows the communication matrix corresponding to the group communication pattern above, where each entry corresponds to a P2P communication originating from the sending process to the receiving process.The off-diagonal blocks show the inter-node communication pattern.These patterns are registered into a single communicator and executed as depicted in Figure 5 (b).

Direction of Data Movement
. A further refinement is to consider the direction of data movement.We consider (a) unidirectional, (b) bidirectional, and (c) omnidirectional communication patterns across multiple groups as seen in Figure 7 with our rail and symmetric patterns.The unidirectional patterns assume that there is a primary group that sends data to all the rest of the groups.The bidirectional pattern is the same as the unidirectional pattern except that the communications are in both directions.Omnidirectional communication captures patterns patterns where all groups communicate with all other groups, rather than having one group that communicates either unidirectionally or bidirectionally with the other groups.Since the asymmetric pattern is defined only on two groups, there is no sensible definition of the omnidirectional asymmetric pattern.For the rest of the paper, a group communication pattern is described by the three parameters (, , ) and a direction.

Application Case Study
Many applications involve irregular communications, where each GPU communicates with a sparse subset of GPUs and the message lengths vary.Our final custom microbenchmark replicates irregular communication patterns from an application, MemXCT [17], as a complement to the regular communication patterns discussed above.
This microbenchmark is composed of concurrent P2P communications across GPUs as a result of a distributed sparse matrix multiplication.The communication pattern across GPUs depend on the sparsity pattern of the matrix and it's partitioning.For microbenchmarking, we chose the ADS4 dataset given in the application repository 5 .We extracted communication patterns for four nodes of each system which corresponds to 16, 24, 32, and 48 GPUs.Internally, CommBench stores a distributed sparse matrix that tracks the communication pattern as it is being created.Invocation of each add function corresponds to adding an entry to the sparse communication matrix.This matrix does not store any communication beyond metadata needed to track the data dependencies across GPUs.
For separate measurements within and across nodes, we register the P2P communications into separate communicators.Nevertheless, these steps are independent of each other, and therefore can be run concurrently to hide one behind another.Concurrent execution is expressed using the synchronization functions as depicted in Figure 5 (c).The concurrent execution waits for completion of whichever step takes the most time-inter-node communication in this case as seen in Figure 8 (c).

EVALUATION
To cover a wide variety of contemporary communication architectures, we perform experiments on the six systems discussed in Section 2. Our experiments use the default software versions installed on each system; the specific version will be listd in the artifact description appendix.For MPI, Summit uses Spectrum, Delta and DGX-A100 use OpenMPI, and rest of the systems use vendormodified MPICH implementations.To place an MPI rank  on a GPU, we place it on node ⌊ /⌋ and on GPU with index ( mod ) as Figure 9: Time for moving one GB from GPU to GPU across two nodes of six systems.Striping utilizes multiple NICs with three steps: split (intra-node), translate (across nodes), and assemble (intra-node).We obtain the optimal performance with a mix of libraries across steps, e.g., I+N+I means that we use IPC within nodes and NCCL across nodes.As a baseline, we also report the direct (unstriped) P2P functions of MPI, NCCL, and RCCL across nodes.
shown in Figure 1, where solid black circles are GPUs and the numbers are ( mod ).We worked with facility staff and administrators of these systems to ensure we used the best available configurations for our benchmarks.

Striping Data Movement Across Nodes
We first run the striping (Section 4.2) microbenchmark on all systems for testing available libraries within and across nodes.Our results are summarized in Figure 9.We observe four common behaviors across systems: (1) Mixed-library implementations improve over uniform implementations in all cases of striped data movement.The uniform implementations with MPI (M+M+M), NCCL (N+N+N), and RCCL (R+R+R) are not efficient on at least one level of the network hierarchy.The optimal mixture of libraries uses the most efficient protocol for each level and achieves a (geometric) average 2.17× speedup.(2) Asynchronous communication improves the performance of striping.The stacked bars represent the minimum time for each communication step in isolation, and the hollow black frames represent the end-to-end time of multiple steps.CommBench executes the steps back-to-back asynchronously, i.e., without any barrier, while respecting point-to-point data dependencies across steps.As a result, asynchronous is up to 20% faster and does not yield a slowdown in any case.(3) The P2P implementations of NCCL are anamolously slow within nodes when multiple nodes are involved in a communication.This problem does not arise with MPI, IPC, or on a single-node NCCL execution.Therefore, on average, MPI and IPC implementations are 4.78× faster than NCCL within nodes.We also found that the RCCL implementation is not performant.We have confirmed that RCCL implements a TCP protocol with Slingshot-11 NICs that underutilizes the high-bandwidth fabrics that connect GPUs and nodes (see Section 5.2.2). ( 4) MPICH uses get protocol in one-sided IPC communications within nodes.Subscribing them to a single stream or copy engine on the receiving GPU causes serialization in the assembly step.CommBench's IPC implementation uses put protocol by default, initiating the assembly step from multiple GPUs and obtains substantial speedup (e.g., 11× on Aurora).We also incorporated get protocol to overcome serialization on the split step on Aurora.The striping microbenchmark exposes other inefficiencies, or at least surprising asymmetries, in communication software.For example, the split and assemble steps have symmetric-opposite patterns as seen in Figure 4: split moves data from one GPU to other GPUs, while assemble moves data from other GPUs to one GPU within a node.Despite using the same communication links, they obtain different bandwidths, most significantly on Perlmutter, Frontier, and Aurora, with up to a 11× difference.
We have verified all findings independent of CommBench to assure that our tool does not introduce any significant artifacts.We cannot speculate as to causes or solutions as we do not have full access to the library implementations.The main point of our microbenchmarks is to expose the performance characteristics so that system administrators and application developers will be aware of them.

Group Communications Across Nodes
We characterize the multi-NIC performance with our group-togroup patterns-specifically using the rail and asymmetric families as shown in Figure 6 (a)-(b), respectively-across two physical nodes.We characterize the multi-NIC utilization by varying the subgroup size  (see Section 4.3.1)and model the bandwidth in terms of the number of NICs involved in the communication across nodes.
Our evaluation, Figure 10-11, shows the bandwidth and latency across nodes when we set the following values for (, , ): (2, 4, ) for Delta and Perlmutter, (2, 6, ) for Summit, (2, 8, ) for Frontier, and DGX-A100, and (2, 12, ) for Aurora.Since we do not have direct control over NICs with the high-level communication libraries that we test, we vary the number  of GPUs to empirically determine their logical bindings.We first present the bandwidth results with large messages (larger than 16 MB) in Figure 10 and then latency results with small messages (4 bytes) in Figure 11.

Modeling Bandwidth Across
Nodes.We use our group-togroup patterns to characterize the GPU-to-NIC behavior.On our    1)-( 2).Groupto-group patterns gradually change the workload across nodes to expose hardware differences across systems, testing libraries' performance portability and helping developers make choices for moving their applications across systems.
We confirm the proposed model empirical results in Figure 10 (a)-(e).On the other hand, (f) Aurora employs a round-robin scheme that can be modeled as where  in this case is the number of active NICs in each node.We activate all (eight) NICs per node, resulting in the bandwidth profile in Figure 10 (f).The models given in (1) and ( 2) assume the static GPU-to-NIC associations in Figure 2 (a)-(f).The static models break down if the associations are determined dynamically, as in (f) DGX-A100 when we enable dynamic behaviour with NCCL's hardware-specific plugin, where the logical topology changes depending on the workload.The speedup is provided by Mellanox propriety software (SHARP [12]); we were able to expose this behaviour using the unidirectional asymmetric group-to-group patterns.

Underperforming
Cases.The measurements in Figure 10 quantify the performance of the libraries relative to our models of peak performance.The figure also exposes some severely underperforming cases that we suggest developers avoid.
The first case (Figure 10 (d), purple bars that are barely seen) is the RCCL library, which is a port of NCCL for AMD GPUs on Frontier.As mentioned earlier, we confirmed that the RCCL does not have native implementation for Slingshot-11 NICs and therefore it falls back to the TCP protocol.Hence, RCCL gives the correct result but with peak bandwidth of no more than a few GB/s and high latency.
The other underperforming case is GPU-aware MPI on DGX-A100 (Figure 10 (e), orange bars).We confirmed with the system admin of this particular machine that it was configured for machine learning frameworks such as PyTorch, which relies on NCCL and hence the GPU-aware MPI is not tuned for multiple NICs.CommBench makes it possible to reliably detect and isolate such configuration issues.11, which is significantly higher than MPI, which is approximately 6 microseconds on Frontier and 8 microseconds on Aurora.On the other hand, RCCL's TCP implementation has about 12 ms latency on Frontier (not plotted in the figure).These observations matches with the existing literature [35].

5.2.5
Group-to-Group vs. MPI Collectives.Group-to-group collective patterns in Figure 11 measure the lower bound for the MPI collectives, because the group benchmarks measure the portion of time across nodes only, whereas MPI benchmarks measure the end-to-end collective functions, i.e., both within nodes and across nodes.
To validate the lower-bound property of our benchmark, we compare our group-to-group benchmarks with traditional MPI collective functions.We compare MPI_Scatter and MPI_Alltoall (represented with diamond marks) with our asymmetric family patterns with unidirectional ( = 1) and bidirectional ( = ) patterns, respectively.The group-to-group benchmarks characterize the network performance more accurately than MPI collective functions, because the former only measure the communication across the targeted interconnect (across nodes) whereas the latter performs additional (intra-node) communications.

Self Communication
The P2P bandwidth for self communication (i.e., from one buffer to another within the same GPU's memory) is supposed to be comparable with the processor's memory bandwidth 6 .However, we occasionally observe significantly lower bandwidth than expected.Table 3 shows the utilization of the rated GPU memory bandwidth with different libraries. 6Theoretical GPU memory bandwidths are; Summit: 900 GB/s, Delta, Perlmutter, and DGX-A100: 1.55 TB/s, Frontier & Aurora: 1.64 TB/s.

Scaling of Group Communications
We use CommBench on Frontier and Aurora using the rail and symmetric patterns on multiple nodes.We use configurations (, 8, 8) for (a) Frontier and (, 12, 12) for (b) Aurora and stress the external network by increasing  up to eight nodes, employing 64 and 96 GPUs, respectively, as shown in Figure 12.We employ MPI for measuring the bandwidth on both CPUs and GPUs.
On both systems, the bidirectional rail pattern achieves the highest bandwidth, approximately 175 GB/s/node on Frontier's GPUs (orange) and 275 GB/s/node on Aurora's CPUs (blue) in Figure 12.As a result, both systems utilize 90% of the theoretical bandwidth, although Frontier with GPU and Aurora with CPU due to their direct connections to NICs.The symmetric pattern obtains greater bandwidth when a higher number of nodes participate in communication due to a better saturation of the overall network.On the other hand, omnidirectional bandwidth shows the opposite due to the contention on switches that connect the nodes.

Application-Specific Microbenchmark
We introduce a case study in Section 4.4.In this application-specific microbenchmark, the communication pattern is irregular and intraand inter-node portions are first measured in isolation and then the intra-and inter-node communications are run concurrently.The results are shown in Figure 13.In concurrent execution, the intra-node cost is hidden behind the communication across nodes in all cases except Aurora: Communications within and across nodes slow each other down, although they occur on separate networks.
We use an MPI collective function (MPI_Alltoallv) as a sanity check for this microbenchmark.The collective function expresses a nonuniform all-to-all communications such as this applicationspecific communication graph.NCCL has no equivalent collective function, which means users must implement such nonuniform patterns themselves if they wish to use NCCL.CommBench makes it much easier to write such application-specific patterns, particularly   Figure 13: Time spent for application-specific microbenchmark on four nodes of each system (lower is better).Comm-Bench executes different libraries within and across nodes concurrently, detects underperformed cases, and exposes performance portability across systems.
if one wants to exploit combinations of different libraries for the best performance.

RELATED WORK
There are multiple previous efforts on benchmarking HPC networks [2,6,21,25,34].To the best of our knowledge, none consider group communication patterns that characterize multi-GPU, multi-NIC behavior with multiple libraries.
Prior research has explored bandwidth saturation with respect to the number of processors in CPU-based systems [14,15,18].We follow a similar approach, focusing on hierarchical systems with multiple GPUs and NICs to saturate the bandwidth within and across nodes.We also explore the logical topology between GPUs and NICs.
Previous work has investigated understanding and modeling inter-GPU communication in large-scale HPC systems, examining data movement variations between multiple GPUs [4] and irregular P2P communications [24].However, these studies primarily focused on MPI as the sole communication layer and relied heavily on CPU involvement.Furthermore, characterizing interconnect heterogeneity [30] has mostly targeted single systems, and microbenchmarks [31] exploring transfer behavior across data placements have been limited to CUDA primitives.CommBench introduces group-to-group patterns and empirically tests them on a variety of HPC systems while offering the flexibility to employ different communication libraries.
coNCePTual [28] is a DSL for designing benchmarks that stress communication layers, focusing on fine-grain control over application properties such as buffer lifetimes.coNCePTual does not focus on support for collective, hierarchical communications, nor does it attempt to elucidate the performance characteristics of hierarchical networks.
The current practice of system administrators and users is to run standard benchmarks provided by the MPI and NCCL distributions, such as MVAPICH benchmark (OSU Benchmarks [1]) and NCCL tests.These tests report the performance of P2P and collective communications that are offered by optimized collectives within communication layers [19,32,36], but lack an API for userdefined, application-specific communication patterns.We address this limitation by providing the CommBench API, enabling users to easily express and benchmark desired communication patterns across various communication layers.

CONCLUSION
Contemporary HPC systems comprise fat nodes with multiple GPUs and NICs that form complex network hierarchies that traditional collective benchmarks do not adequately characterize.To understand the performance of multilevel networks, we propose extended group-to-group benchmarking patterns to target specific levels of the network hierarchy.We implement these patterns with Comm-Bench, a framework for composing and benchmarking user-defined communication patterns with multiple GPU communication libraries.We evaluate CommBench on six state-of-the-art systems.Our benchmarks reveal the performance characteristics of these systems; for example, we identified three multi-NIC scaling behaviors in packed, round-robin, and dynamic schemes, exposing the logical binding of GPUs to NICs that is not normally visible to the user.Depending on the system, library choice, and underlying group-to-group pattern, we saturate between 50%-90% of the theoretical bandwidth available in each configuration.Since we can stress specific communication channels in our approach, we consistently measure higher bandwidth (up to 30%) and lower latency (up to 3×) with group-to-group patterns compared to traditional collective patterns.CommBench's portability and flexibility make benchmarking of modern communication networks more comprehensive, more detailed, and easier.
Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science User Facility using NERSC award ASCR-ERCAP0029675.

Figure 2 :
Figure 2: Interconnect between GPUs and NICs within nodes.All devices are physically connected, but each GPU uses a single NIC for P2P communicating across nodes.In our experiments, we use the default bindings as shown in (a)-(f).Machines peak bandwidth is based on the number and bandwidth of NICs per node (12.5-200GB/s).

Figure 3 :
Figure 3: Overview of the communication software stack.

Listing 1 : 4 / 5 Comm ( Library ); 6 /
API for registering each microbenchmark step.1 // Data type is templatised as T 2 template typename <T > 3 class Comm { / Create a benchmark step with a library of choice ./ Register a P2P communication into the step .

Figure 4 :
Figure 4: Striping of P2P data across GPUs for maximizing the bandwidth across nodes.It takes three steps to a) split the original data  into three stripes- 0 ,  1 ,  2 , b) translate the stripes across nodes using all GPUs, and c) assemble of the original data at the receiving GPU.

Figure 5 :
Figure5: Synchronization schemes for scheduling steps of the proposed microbenchmarks from a process' perspective.The horizontal axis represents time and each box corresponds to a communication step.The vertical dashed lines show the earliest moment of return to the indicated functions on each GPU.For accurate measurements, the start and wait functions must be called from all processes in the shown order.The end-to-end time is the maximum of total time taken on all processes.

Figure 8 :
Figure 8: Application-specific (MemXCT) communication patterns for (a) 16 GPUs and (b) 48 GPUs.The individual message sizes shrink whereas (c) the total data movement grows with the number of GPUs.We replicated this pattern for microbenchmarking on four nodes of each system.
The resulting patterns across 16 and 48 GPUs are shown in Figure 8 (a) and (b), respectively.

Figure 10 :
Figure 10: Bisection bandwidth profiles across two nodes were measured with blue (CPU-Only MPI), orange (GPU-Aware MPI), yellow (NCCL), and purple (RCCL) bars.Hollow bars show the proposed model in Equations (1)-(2).Groupto-group patterns gradually change the workload across nodes to expose hardware differences across systems, testing libraries' performance portability and helping developers make choices for moving their applications across systems.

Figure 11 :
Figure 11: Group-to-group latency across two nodes with the rail and asymmetric patterns.The horizontal index represents the variable .The comparisons with corresponding MPI collective functions are marked with diamonds.

Figure 12 :
Figure 12: Bandwidth per node with unidirectional (hollow circles), bidirectional (solid circles), and omnidirectional (solid squares) scaling patterns (shown in Figure 7) across multiple nodes of (a) Frontier and (b) Aurora.Blue benchmarks represents CPU-Only MPI and orange benchmarks represents GPU-Aware MPI measurements.
I+M I+N Coll M+M R+R I+M I+R Coll M+M N+N I+M I+N Coll M+M

Table 1 :
Number of CPUs, GPUs, and NICs per node on test systems.

Table 3 :
Utilization of rated GPU memory bandwidth for selfcommunication of 1 GB.MPI vs. NCCL Performance.When available, NCCL usually obtains higher bandwidth than MPI on GPUs, nevertheless, we observe a higher latency with NCCL, compared to with MPI.The latency of NCCL is more than 40 microseconds across systems as shown with the yellow marks in Figure