Bruck Algorithm Performance Analysis for Multi-GPU All-to-All Communication

In high-performance computing, collective communication is critical for facilitating comprehensive data exchange involving all processes within an MPI communicator. Due to their inherently global nature, many collective operations present scalability challenges, particularly the all-to-all data shuffle with its quadratic communication pattern. Using a logarithmic communication pattern, the Bruck algorithm was designed to provide communication efficiency for all-to-all data shuffles involving short-sized messages. The Bruck algorithm has been extensively used to facilitate global data shuffles in a multi-CPU environment and is also part of the MPICH and Open MPI implementations. This work presents the first investigation of using the Bruck algorithm for all-to-all communication in multi-GPU systems using the NVIDIA Collective Communications Library (NCCL). Our experimental study demonstrates that while the Bruck algorithm exhibits superior performance for small-sized messages in a multi-CPU environment, the same advantages are not evident for multi-GPU environments. Furthermore, we describe and compare an optimized Bruck algorithm implementation in NCCL and compare it to NCCL’s default all-to-all and MPI-based implementations. Finally, we discuss the challenges and opportunities of implementing new multi-GPU collectives using NCCL’s public-facing API.


INTRODUCTION
Collective functions facilitate data exchange involving all processes within an MPI communicator.Historically, collective functions have been used extensively by irregular applications [21,29,35] to manage their non-uniform and often sparse workloads.Collectives are generally known to be challenging to scale due to their global nature.Among all collectives, the all-to-all data shuffle is notorious for being the most difficult to scale [14,27,33] -primarily because of its quadratic communication pattern.
Global data shuffle can be classified into two categories: uniform, where processes exchange the same amount of data among each other, and non-uniform, where exchanged message sizes can vary.Both can be performed using MPI's built-in collectives MPI_Alltoall and MPI_Alltoallv.These are used by a variety of applications, including parallel training of large-scale neural networks [26], transpose computations in parallel FFT computation [9] and parallel sorting [30].
In this work, we focus on uniform all-to-all where data exchanges are generally performed using two kinds of algorithms: spreadout [13] or Bruck [6,33].Spread-out internally performs a linear (w.r.t process counts) number of communication steps.It can be visualized as a matrix-like communication pattern, where each process sends data to all other processes in a collective manner.For  processes, each process communicates with the other  − 1 processes, resulting in a total of  × ( − 1) data exchanges.Based on the circular shift and bit-wise exchange operations, the Bruck algorithm, on the other hand, performs log 2  communication steps.Reducing communication steps (relative to spread-out) comes at the cost of sending more total data.Therefore, spread-out is used for the exchange of large-sized messages, where communication can be saturated by bandwidth, and Bruck is used for short-sized messages, where performance improvements due to reduction in communication rounds compensate for the cost of sending more data.Popular implementations of MPI, MPICH [1] and Open MPI [2,11], both rely on a decision tree, which helps choose between the two algorithms based on scale and workload.
In the last decade, we have seen a transition towards more heterogeneous HPC environments, where CPUs are coupled with highperformance coprocessors such as GPUs.For example, modern HPC systems such as Aurora [31], Perlmutter [19], and Frontier [4]  While the Bruck algorithm is known to yield better performance for small-sized messages in a multi-CPU environment [10,20,33], no study has been performed to understand its impact in a multi-GPU environment.In this paper, we investigate using the Bruck algorithm for multi-GPU all-to-all communication to understand if the algorithm benefits RDMA multi-GPU collective communication.
We report an experimental study that compares different MPI and NCCL implementations of all-to-all communication primitives.Our analysis ultimately finds that the Bruck algorithm implemented using the NCCL API does not offer the same performance improvements in multi-GPU settings shown in multi-CPU settings with MPI.Finally, we delve into details to explain why the Bruck algorithm is not suited for multi-GPU environments.The contributions of this work are the following: This study holds significant importance for the HPC community as it sheds light on the efficacy of the Bruck algorithm in the context of a multi-GPU environment, a domain where its performance had not been comprehensively evaluated before.The negative result, indicating that the Bruck algorithm does not offer the anticipated performance improvements for short-sized messages in multi-GPU scenarios, is valuable for the community and provides a deeper understanding of the complexities associated with collective communication in multi-GPU scenarios, guiding future research toward more efficient solutions.

BACKGROUND
In this section, we summarize important applications that require and use all-to-all communication and describe the basic and optimized versions of the Bruck algorithm.While the algorithm has been adopted by state-of-the-art MPI implementations for multi-CPU communication, there is no previous study that assesses its performance in multi-GPU settings.

All-to-all Communication
In parallel computing, there exist several fundamental collective communication patterns.An all-to-all operation refers to every process sending data to every other process and receiving data from every other process.There are many use cases for all-toall communication, with some simple examples including parallel FFT [32], computing matrix transposes, and accelerating parallel relational algebra at scale [15,16].
In uniform all-to-all, the amounts of data being sent and received by each process are fixed, whereas in non-uniform all-to-all, the amounts of data sent and received may be variable.Some possible approaches for achieving this pattern are the point-to-point, spreadout, and Bruck algorithms.In point-to-point communication, each process sends and receives an entire message directly to every other process in  −1 communication steps, where  is the number of processes.Destination processes are chosen in a round-robin fashion to avoid a bottleneck from multiple processes attempting to send data to the same destination at once.While simple to implement, this approach can lead to network contention on the receiver side as the number of processes increases.The spread-out algorithm is the standard linear-time implementation of all-to-all data shuffle adopted by production MPI libraries and is used for both uniform and non-uniform data exchanges.Unlike point-to-point, processes only send one data block directly to a target destination process per communication round.This algorithm also takes  − 1 communication steps.A diagram of the spread-out algorithm can be seen in Figure 1.
Libraries like Unified Collective Communication (UCC) [8] support distributed heterogeneous communication by selecting the best implementation available (e.g., using NCCL or MPI) for a specific use case based on various runtime heuristics.Our experimental study will shed some light on how some of those heuristics related to all-to-all collectives could be defined based on message size and scale.

Bruck Algorithm
The Bruck algorithm for all-to-all communication within messagepassing systems was first published in 1997 [7].Bruck differentiates itself from alternative all-to-all communication algorithms by minimizing the total number of internal communication steps involved  in the all-to-all transaction.It reduces them from O(P) to O(log P) communication steps, where  represents the number of processes or compute units.This is possible by transmitting a larger aggregate data size while distributing it over a reduced number of iterations.This strategy offers significant advantages when dealing with data messages of relatively small sizes (16b to 2K) [5].By leveraging the increased bandwidth available by handling smaller messages, the algorithm utilizes available computing resources more efficiently.This enables the Bruck algorithm to process small data messages efficiently, improving overall performance and reducing execution time in communication-bound scenarios.Figure 2 demonstrates how Bruck performs log(4)=2 communication steps for 4 processes, as opposed to the 3 spread-out communication steps shown in Figure 1.
As a testament to its effectiveness, the Bruck algorithm has been widely adopted in state-of-the-art MPI implementations, including MPICH [33] and Open MPI [12], specifically to implement the uniform all-to-all collective communication operation (MPI_Alltoall).The basic Bruck algorithm (see Figure 2) has three steps, including an initial local rotation, log(P) global communications, and a final local inverse rotation [7].The modified inverse Bruck algorithm enhances the basic Bruck algorithm by removing the final local inverse rotation step [34].This removal is possible through subtle adjustments in data copying within earlier phases, which preclude the need for the final local inverse rotation.Figure 3 illustrates the difference between the modified inverse Bruck and the basic variant.The zero-copy variant further improves the algorithm by eliminating the need for explicit data copying in uniform all-to-all communication, enabling in-place data access during communication operations.Specifically, when performing a uniform all-to-all operation in MPI_Alltoall, the zero-copy Bruck algorithm can achieve a significant performance improvement [34].

ALL-TO-ALL COLLECTIVE IMPLEMENTATIONS
Recent work based on the modified Bruck [10] has been shown to outperform the linear-step spread-out implementation in a multi-CPU setting.For this reason, we hypothesized that the Bruck algorithm could also be promising for achieving faster and more efficient all-to-all communication in multi-GPU scenarios.We were particularly interested to see how it performs at scale, given that modern HPC clusters are now being built with thousands of total GPUs.We begin this section by presenting the existing implementation of all-to-all data exchange within NCCL and then describe our implementation of the Bruck algorithm within NCCL.

Default NCCL All-to-all Implementation
As of the NCCL 2.18.1 documentation [23], there does not exist an explicitly named implementation of all-to-all data shuffle within NCCL.Rather, all-to-all communication is achieved by defining a for-loop of NCCL send and receive operations wrapped within a ncclGroup.This is conceptually equivalent to the spread-out algorithm, as it takes a linear number of communication rounds.For the sake of clarity, even though NCCL does not provide a named implementation for all-to-all data shuffle, we will refer to this as the default NCCL all-to-all for the remainder of this paper, expressing it in Algorithm 1.As can be seen in the algorithm, there is a linearstep loop (w.r.t the number of processes) in line number 5, where for each iteration, a process sends and receives data from some other process.A key point to note is the usage of ncclGroupStart and ncclGroupEnd, used to wrap the loop of communication rounds.
Algorithm 1 Default NCCL all-to-all implementation The implementation of this collective in NCCL is interesting due to every send and receive operation being wrapped into one nc-clGroup, a concept that MPI has no direct equivalent for.ncclGroups are defined by their start and end functions, which queue any intermediate NCCL operations to be executed after the group ends.This approach enables the NCCL runtime to capture the full communication scenario and apply optimizations.The NCCL documentation states that groups are used for 'managing multiple GPUs from one thread (to avoid deadlocks), aggregating communication operations to improve performance, or merging multiple send/receive point-to-point operations' [22].
A ncclGroup execution is treated as a single communication, avoiding the GPU kernel launch overhead that would be associated with executing each communication operation individually.Despite the default NCCL all-to-all appearing to use the spread-out algorithm, internal runtime optimizations may potentially merge the calls to improve performance.

NCCL Bruck Implementation
In Algorithm 2, we present the Bruck algorithm as implemented using NCCL.There are two key points to note in this algorithm: (1) The total number of iterations performed here is log .
(2) The ncclGroupStart and ncclGroupEnd wrap each send and receive operation individually (see line number 17 and 19), as opposed to encompassing the entire for loop in Algorithm 1.
The usage of ncclGroups can be explained by further examining the Bruck algorithm.As opposed to the spread-out (or point-to-point) implementation of Algorithm 1, Bruck is a store-and-forward algorithm that takes log() communication steps.This means that both send () and receive () data buffers are used for sending, receiving, and storing data during intermediate communication rounds.Unlike spread-out, buffers  and  are both involved in the communication step, as some received data blocks will have to be present for a later communication step.This store-and-forward nature of the algorithm imposes constraints on the ordering of the communication rounds.Unlike the linear-step implementations, Bruck must maintain an explicit communication ordering, where iteration  + 1 must occur after iteration  in physical time.The algorithm can also be seen in Figure 3, which shows that the different communication phases must be executed in a sequential order.As discussed earlier, once a ncclGroup enqueues a set of send and receive operations, the NCCL runtime will be responsible for scheduling those operations, and strict ordering cannot be enforced.Therefore, wrapping all of the send and receive operations produced by the Bruck algorithm into a single ncclGroup will lead to incorrect results.For this reason, we could only create a ncclGroup for each pair of send and recv operations (see Algorithm 2, line number 17 and 20).In the evaluation section, we discuss how this requirement affects the performance of the Bruck algorithm for multi-GPU allto-all collective communication using NCCL.

EVALUATION
In this section, we report experimental studies to assess the performance of uniform all-to-all collectives for small-sized messages using the Bruck algorithm in both multi-CPU and multi-GPU settings.Furthermore, we compare implementations of the Bruck algorithm using both NCCL (multi-GPU) and MPI (multi-CPU) to understand if and when this algorithm would be effective for multi-GPU collectives.We performed our experimentation on the Polaris supercomputer [17] operated by the Argonne Leadership Computing Facility at Argonne National Laboratory.Polaris consists of 560 nodes, each containing a single 2.8 GHz AMD EPYC Milan 7543P Algorithm 2 NCCL Bruck algorithm Figure 4: Weak scaling study comparing MPI all-to-all methods, our basis for investigating Bruck performance in NCCL.The Bruck implementation performs significantly better for small-sized messages and at larger scales than spread-out.This advantage, however, becomes much smaller for largesized messages.

NCCL-based Multi-GPU All-to-all
For our multi-GPU implementation of the Bruck algorithm, we used NVIDIA's NCCL library.NVIDIA provides an open source tool for benchmarking NCCL collectives called nccl-tests [25].It provides many useful features, such as a configurable number of warm-up and benchmark iterations, varying message sizes, result verification, and so on.For these reasons, we used nccl-tests to perform our experiments.The nccl-tests benchmark suite first prepares the GPU buffers and then passes them to a test function.The test function is a generic interface that links to a range of different implementations.This ensures that all of the tests receive the same data to start with.The process of adding new algorithms to the benchmark consists of creating a separate test function for each new algorithm.Since everything is implemented as a test function, the timer can start and stop in the same place across all algorithms.All existing tests conclude once GPU buffers contain the final result.We performed the following four sets of experiments: (1) Default NCCL All-to-All -this is the default implementation of all-to-all currently provided by NCCL.This linearstep implementation is directly based on Algorithm 1. (2) NCCL Modified Bruck -this is our implementation of modified Bruck using the public-facing NCCL APIs.It performs a logarithmic number of communication rounds and is directly based on Algorithm 2. (3) MPI Spread-out -an implementation of spread-out that relies upon the multi-CPU data exchange protocol.(4) MPI Modified Bruck -an implementation of modified Bruck that relies upon the multi-CPU data exchange protocol.
We note that for the latter two implementations, we performed a data offload (i.e., memcpy) between GPU and CPU before performing the MPI collective.This approach is useful to understand the trade-offs of using direct multi-GPU collective operations vs. offloading the data to CPUs to execute multi-CPU (MPI) collectives instead.
The experimental study reported in Figure 5 compared the MPI all-to-all implementations and the default NCCL all-to-all performance to our modified Bruck implementation with message sizes ranging from 16 to 2.The experiments were run on 16, 32, 64, and 128 nodes resulting in data collected with 64, 128, 256, and 512 GPUs.All experiments were performed using one MPI process per GPU, measuring average execution times across MPI ranks, using send and receive buffers of type ncclChar, non-blocking communication, and setting the NCCL_PLUGIN_P2P environment variable to UCX [28] (performance without the UCX plugin reported the same overall trends).As a note, although it is possible to use one MPI process per node to manage four GPUs each, a best practice is to let each MPI process be responsible for managing exactly one GPU.Default configuration values were left unchanged unless a different value resulted in a performance improvement, as in the case of setting a non-default NCCL_PLUGIN_P2P value.For all experiments, the number of warm-up iterations was set at 100, and the number of benchmark iterations was set at 500.Increasing the number of samples measured helped reduce the influence of outliers.
In Figure 5, we report the experimental results of our weak scaling study comparing both multi-CPU and multi-GPU performance for uniform all-to-all collectives.For multi-CPU settings, we also include the time to load and unload data between CPU and GPU.This is done to understand when it would be convenient to rely on direct multi-GPU collective vs. multi-CPU MPI collectives.We make two key observations: Direct GPU-to-GPU communication is slower at scale when compared to offloading the same data to the CPU and performing the same MPI all-to-all collective across CPUs.This trend is validated by the red trendline in all figures, which corresponds to multi-CPU-based modified Bruck-it consistently outperforms all other approaches at a larger scale.(ii) Surprisingly, the default NCCL-based all-to-all method demonstrates better performance than our Bruck implementation in NCCL.As highlighted in the next section, such performance loss can be attributed to the overhead associated with GPU kernel launches that take place during the execution of each separate ncclGroup.

DISCUSSION
As stated previously, the Bruck algorithm consists of various phases: an initial data rotation, a communication phase, and a final data rotation.Each of these phases is dependent on the result of the previous phase, and this is true for each of the communication steps as well.Each Bruck communication step is a sendrecv operation Figure 5: Weak scaling study comparing multi-GPU and multi-CPU all-to-all methods.In multi-GPU settings, the default NCCL all-to-all implementation always outperforms the Bruck implementation.Furthermore, at a larger scale, offloading data to the CPU and using an MPI multi-CPU implementation yields better performance for those message sizes.
that requires some amount of data to be copied from the receive buffer into a temporary buffer beforehand.This makes Bruck an inherently serial algorithm, and disqualifies our NCCL implementation from using a single ncclGroup to aggregate and optimize all of the communication operations at once.In this scenario, we lose the opportunity for the NCCL runtime to perform aggregate communication optimizations, and we also incur the overhead of repeatedly creating and executing separate ncclGroups, all while the operation progresses synchronously.In contrast, the default NCCL all-to-all is able to execute its entire communication scenario asynchronously with exactly one ncclGroup, incurring only a constant amount of kernel launch overhead.
Furthermore, for very small message sizes, our experimental results suggest that it is faster to copy device (GPU) memory into host (CPU) memory before using MPI to perform data exchanges.Message sizes are an example of a heuristic that can help communication libraries determine at runtime which API and collective implementation will perform best for the given scenario.
We contacted an NVIDIA employee who works on distributed multi-GPU applications to discuss our findings.They explained that NCCL may optimize collectives internally and that the process is not transparent to end-users.The conversation reiterated the importance of using ncclGroups as well as the fact that for such small message sizes, using MPI tends to be faster than using NCCL directly.The reason for this is that GPU kernel launches take a relatively long time compared to data transfer and MPI communication.Our primary takeaways from the conversation were that our results appear reasonable and that a new all-to-all implementation would generally have to be implemented within NCCL itself to be competitive with the public-facing API.

CONCLUSION
This work presents the first experimental study to assess the performance of the Bruck algorithm for uniform all-to-all communication in multi-GPU settings using NVIDIA's NCCL library.We described how the implementation of the Bruck algorithm in NCCL leverages ncclGroups, which is a mechanism that allows for multiple communication primitives (i.e., send and receives) to be aggregated, optimized, and executed asynchronously by the NCCL runtime.We presented an experimental study that also includes multi-CPU (MPI) all-to-all collective operations to understand when it is ideal to rely on multi-GPU RDMA collective communication vs. offloading data to the CPU and performing MPI collectives.Our experiments conclude that the Bruck algorithm for all-to-all communication does not outperform the default NCCL all-to-all implementation.We have demonstrated that this is clearly in contrast to Bruck's multi-CPU performance, which outperforms its point-to-point and spread-out alternatives for the same message sizes.This discrepancy is explained by the fact that the Bruck algorithm requires multiple phases of data exchanges that need to be executed in a strict sequential order.When implementing strict communication ordering using NCCL, it was required that we use a separate ncclGroup for each communication phase, eventually introducing significant overhead due to each ncclGroup launching a separate GPU kernel.
The insights from this study are important for understanding how collective operations perform in multi-GPU settings and will help the community set proper heuristics in future implementations to determine the best API and algorithm to use for a given communication workload and scale.
all rely heavily on GPUs to attain their peak FLOP performance.While MPI can meet the communication needs of GPU-based nodes using features like CUDA-aware MPI, specialized Remote Direct Memory Access (RDMA) communication libraries like the open source NVIDIA Collective Communication Library (NCCL) [24] have been on the rise.NCCL facilitates optimized data communication and synchronization among multiple remote NVIDIA GPUs, making it an attractive choice for researchers, data scientists, and engineers seeking to accelerate their applications by leveraging the immense parallel processing capabilities of GPUs.Libraries like NCCL enable high-performance inter-GPU communication by reducing the overhead incurred by unnecessary CPU/GPU data transfers.This is especially true as message sizes increase, which is the use case where GPU-to-GPU RDMA communication performs best.Being open source, AMD and Microsoft have each implemented their own NCCL-based multi-GPU communication libraries called RCCL [3] and MSCCL [18], respectively.

( 1 )
Development of an open-source implementation of the Bruck algorithm using the NCCL framework with reproducible performance tests using the nccl-tests benchmark suite 1 .(2) Performed a comparative study of the two main algorithms used in multi-CPU collectives: spread-out and Bruck for small message sizes.(3) Described a NCCL-based Bruck implementation and performed scaling studies to compare this against the default NCCL implementation and MPI multi-CPU implementations.(4) Discussed the challenges and benefits of using the publicfacing NCCL APIs to develop optimized communication algorithms.

Figure 1 :
Figure 1: A demonstration of the spread-out algorithm.This algorithm performs a linear number of communication steps wherein processes send one data block directly to a target destination process per communication round, following a round-robin sequence.The send-block for each communication round is indicated by a thick cell border.

Figure 2 :
Figure 2: A demonstration of the basic Bruck algorithm.P0-3 are each processes with their own send and receive buffers.The arrows represent information being passed (via send and receive operations), and thick cell borders indicate the send-blocks for that communication round.

Figure 3 :
Figure 3: A demonstration of how the modified Bruck algorithm omits the inverse rotation step but achieves the same final result.Avoiding the final rotation is possible due to slight tweaks in how data is copied during previous steps.