HongTu: Scalable Full-Graph GNN Training on Multiple GPUs

Full-graph training on graph neural networks (GNN) has emerged as a promising training method for its effectiveness. Full-graph training requires extensive memory and computation resources. To accelerate this training process, researchers have proposed employing multi-GPU processing. However the scalability of existing frameworks is limited as they necessitate maintaining the training data for every layer in GPU memory. To efficiently train on large graphs, we present HongTu, a scalable full-graph GNN training system running on GPU-accelerated platforms. HongTu stores vertex data in CPU memory and offloads training to GPUs. HongTu employs a memory-efficient full-graph training framework that reduces runtime memory consumption by using partition-based training and recomputation-caching-hybrid intermediate data management. To address the issue of increased host-GPU communication caused by duplicated neighbor access among partitions, HongTu employs a deduplicated communication framework that converts the redundant host-GPU communication to efficient inter/intra-GPU data access. Further, HongTu uses a cost model-guided graph reorganization method to minimize communication overhead. Experimental results on a 4XA100 GPU server show that HongTu effectively supports billion-scale full-graph GNN training while reducing host-GPU data communication by 25%-71%. Compared to the full-graph GNN system DistGNN running on 16 CPU nodes, HongTu achieves speedups ranging from 7.8X to 20.2X. For small graphs where the training data fits into the GPUs, HongTu achieves performance comparable to existing GPU-based GNN systems.

Recently, full-graph GNN training that trains on the entire graph, has emerged as a promising GNN training method for its effectiveness brought by full-neighbor aggregation semantic and from it because the graph propagation computation creates cross-partition data dependencies.To compute the gradients of a partitioned subgraph, gradients of all its dependent partitions from downstream layers must also be computed [56], which in turn requires storing intermediate data for these partitions.As a result, although partition-based processing enables loading the vertex data of a small partitioned slice at a time, a significant amount of GPU memory still needs to be reserved for maintaining the intermediate data.
Secondly, partition-based processing leads to increased host-GPU communication.In GNN training, graph propagation involves aggregating features from neighboring vertices.This requires loading the data of the entire neighbor set onto GPUs when processing each partitioned subgraph.However, when a large graph is split into multiple partitions, vertices with multiple outgoing edges may be replicated to multiple partitions as neighbors.As a result, it is necessary to transfer them multiple times during training, which increases host-GPU communication.Moreover, since high-dimensional vertex attributes can consume a substantial amount of memory (as shown in Figure 1), it is not feasible to store the frequently accessed vertex data entirely in GPU, as is done in GPU-accelerated graph analytical frameworks [34,47,63].
We present HongTu, a GPU-accelerated full-graph GNN training system that addresses the challenges of traditional partition-based processing through two critical functions.Firstly, HongTu employs a memory-efficient GNN training framework that reduces runtime memory consumption of both vertex and intermediate data.This framework integrates a GNN-friendly partition method and a cost-effective recomputation-caching-hybrid intermediate data management method.Inspired by the recomputation-based DNN training method that avoids storing intermediate data by releasing intermediate data in the forward pass and recomputing it in the backward pass [5].Based on the original method, our recomputation-caching-hybrid method further combines GPUbased recomputation and CPU-based data caching to reduce the recomputation overhead in GNNs.Secondly, HongTu employs a deduplicated communication framework that reduces host-GPU communication for duplicated neighbor access among partitions.We observe that duplicated neighbors access between sequentially and concurrently scheduled subgraphs can be efficiently handled through a single host-GPU communication and multiple inter/intra-GPU data accesses, rather than communicating them individually between CPU and GPUs.We leverage this observation to develop a communication deduplication method, and we also propose a subgraph reorganization method that enhances the effect of communication deduplication to improve performance.
In summary, we make the following contributions.• We propose a memory-efficient GNN training framework that reduces runtime memory consumption by integrating a partition-based GNN training method and a recomputation-caching-hybrid intermediate data management method.• We propose a deduplicated communication framework that reduces host-GPU data communication by optimizing the duplicated neighbor accesses between sequentially and concurrently scheduled subgraphs.• We develop HongTu, a GPU-accelerated system for full-graph GNN training that overcomes the memory limitation of GPUs and integrates an efficient communication implementation to achieve high performance.
Experimental results on four NVIDIA A100 GPUs show that HongTu reduces host-GPU communication by 38%-78% and achieves 1.3×-3.4×performance improvement over the vanilla approach that transfers the entire neighbor set for each partition.When compared to DistGNN [32] running on 16 CPU nodes, HongTu achieves speedups ranging from 7.8× to 20.2×.Furthermore, for small graphs that can fit into GPUs, HongTu achieves performance comparable to existing multi-GPU systems.The rest of the work is organized as follows, §2 describes the background and motivations.§3 gives an overview of HongTu.§4 describes the memory-efficient GNN training framework.§5 describes the deduplicated communication framework.§6 describes system implementation.§7 presents results.§8 concludes.

Multi-GPU Architecture
Modern GPUs are equipped with high bandwidth memory and massive streaming multiprocessors (SMs), making them suitable for memory-and computing-intensive applications.However, the limited device memory capacity, typically ranging from several to tens of gigabytes [38], poses a constraint on the size of data that can be efficiently processed.To address this issue, hardware researchers have developed fast interconnects to connect multiple GPUs, such as AMD Infinity Fabric [2] and NVIDIA NVLink [40]. Figure 1 provides an example of a 4×A100 GPU server, where the four GPUs are interconnected through 4×NVLink-3.0 with 200GB/s inter-GPU communication bandwidth, enabling low latency and high throughput inter-GPU data access.Every two GPUs are connected to a single CPU via PCIe 4.0 interconnect.Although CPUs are generally equipped with hundreds to thousands of gigabytes of host memory, the slow CPU-GPU communication bandwidth (up to 32GB/s in PCIe 4.0) often creates a performance bottleneck for GPU access to CPU memory.Moreover, the two CPUs are linked through a QPI bus, forming a two-socket Non-Uniform Memory Access (NUMA) architecture, where GPUs accessing remote CPU memory via QPI experience slower speeds than those accessing local CPU memory.Therefore, building high-performance multi-GPU applications requires careful optimization of heterogeneous communication, especially in reducing CPU-GPU data transfer.

GNN Basis
A GNN takes a graph and the vertex-associated property (feature) of all vertices as input and learns a representation vector for each vertex by stacking multiple GNN layers.In each layer, GNN models generally follow an aggregate-update computation pattern.
h   is the representation of  in the -th layer and h 0  is the input vertex feature.The AGGREGATE function collects the −1-th layer representations of 's in neighbors, i.e., {h  −1  )| ∈  ()}, to compute the neighbor representation of .The UPDATE function utilizes the neighbor aggregation result and 's representation in the −1-th layer to calculate 's vertex representation in the -th layer.Both the aggregate and update functions can be neural networks, which are updated during training.To illustrate, we present two examples: the graph convolutional network (GCN) [21] and the graph attention network (GAT) [52].GCN is a simple yet effective model that has neural network computation on vertices.--Hardware-assisted node property aggregation GPU DGL [8] single-GPU system PyG [10] single-GPU system CAGNET [51] 1.5D/2D/3D Graph Partitioning DGCL [4] Cost-based communication routine PipeGCN [53] Staleness-communication&Pipelining Sancus [42] Staleness-communication NeuGraph [30] SAGA abstraction & Partition-based training NeutronStar [30] Hybrid dependency management&Partition-based training ROC [18] Learned graph partitioning cost-based intermediate data management The aggregate function is a simple weighted neighbor convolution, where   is the normalized edge weight of edge <, >.The update function involves a linear transformation and a non-linear activation function (e.g., ReLU).GAT introduces a self-attention mechanism, which assigns different attention parameters on edges to distinguish which neighbors are more important.
The aggregate function first concatenates the parameterized representations of source  and destination , and applies   to compute the edge-wise attention coefficient, i.e., Then, it feeds the attention coefficients to a LeakyReLU activation (i.e., σ) and uses a neighbor-oriented softmax function (i.e.,

Full-Graph GNN Training
Full-graph GNN training uses the full-neighbor aggregation semantic and global gradient descent algorithm.It runs epochs repeatedly on the entire graph until reaching the target accuracy or epoch.Each training epoch consists of a forward and a backward pass, followed by parameter update, which uses the gradients computed in the backward pass to update the trainable parameters in every layer.In the forward pass, vertex representations are computed layer-by-layer using the AGGREGATE and UPDATE operations presented in Section 2.2.At every layer, each vertex aggregates the representations of the incoming neighbors and calculates the vertex representation by applying the learnable model parameters.The final layer's vertex representations are then sent to the downstream task where the loss value is calculated based on the ground truth labels.In the backward pass, GNN's computation starts from the last layer and proceeds back to the first layer, calculating the gradient of loss with respect to the model across all layers.In each layer, the gradients of vertex representations are computed using the chain rule, facilitating both intra-layer model gradient calculation and cross-layer gradient transmission [50,56].GNN models are distinct from traditional DNN models because the link relationship between vertices creates complex and non-uniform data dependencies.In a CNN, the convolution kernels are fixed and treat all pixels in the same way.However, in a GCN, the AGGREGATE and its backward operation handle data dependencies by gathering data along edges.This not only entails random data accesses but also introduces complexities in workload partition due to its irregular nature.Generally, achieving efficient vertex data access necessitates accommodating all vertex data within GPU memory.In GNN training, intermediate data generated in the forward pass needs to be reserved for gradient computation in the backward pass.For example, the update function in GCN [21] involves linear+Relu computations in the forward pass, i.e., h =  (a ×  ).Its backward pass computes the gradients of parameter  using the formula: ∇ =(a)  ×  −1 (a ×  ) * ∇h.Here,  −1 is the derivative function of  , which returns 1 for positive inputs and 0 otherwise.(•)  represents the transpose operation.a ×  is the intermediate data that needs to be reserved for gradient computation.While the effectiveness and high accuracy of full-graph GNN training have been widely demonstrated by academic studies [18,21,30,50], its practical application in industry is limited due to the significant memory requirements for maintaining large-scale vertex and intermediate data.

Existing Systems and limitations
Table 2 summarizes existing full-graph GNN systems and their major contributions.Early systems, such as DGL [55] and PyG [10] use full-graph training on a single GPU, and thus their efficiency and scalability are constrained by the limited GPU resource.To meet the high computation and memory requirements of full-graph GNN training, distributed-and multi-GPU-based systems have been proposed.CAGNET [51], DGCL [4], PipeGCN [53], and Sancus [42] are four multi-GPU GNN systems that maintain both vertex and intermediate data in GPU memory.In these systems, inter-GPU communication emerges as a performance bottleneck [4,42].CAGNET [51] proposes 1.5D, 2D, and 3D graph partitioning to optimize the data distribution among GPUs.DGCL [4] analyses the speeds of heterogeneous communication among devices and proposes an automatic routine algorithm to improve communication efficiency.PipeGCN [53] and Sancus [42] investigate staleness-communication in GNN training, which reduces communication times while sustaining a reasonable level of accuracy.Despite the high performance of these frameworks, they can hardly scale to large input graphs.As illustrated in Table 1, accommodating the data for training ogbn-paper graph needs at least 77 NVIDIA A100 GPUs (80GB), which is expensive and requires sophisticated design for managing communication and fault-tolerance on a distributed GPU cluster.Moreover, the relatively slow inter-node communication can also become a critical performance bottleneck [4].
Recently, some research work [18,30,56] try to relax the memory constraint of GPU by partially loading vertex or intermediate data during training.As illustrated in Table 2, NeuGraph [30] and NeutronStar [56] employ 2-D graph partitioning to split a large graph into multiple chunks, where each chunk contains a specific range of destination and source vertices (as shown in Figure 2).During training, the two frameworks store the vertex data in the CPU and sequentially load the vertex data of partitioned chunks to the GPU for training.On the other hand, ROC [18] utilize CPUmemory to manage the intermediate data.ROC includes a cost model to represent the host-GPU data transfer overhead, and utilizes dynamic programming to find the optimal communication plan.By doing so, ROC allows the GPU to store only part of the intermediate data.
In addition to GPU-based systems, researchers have also explored distributed CPU-based systems [13,32] to leverage the large memory capacity of CPU platforms.However, these systems generally exhibit inferior performance when compared to GPU-based solutions, and the monetary cost of using high-end CPU clusters is also high.Therefore, building a CPU-GPU heterogeneous system that fully utilizes the memory and computation resources of a single-node-multi-GPU architecture becomes a cost-effective option.
However, we observe that the existing out-of-GPU-memory processing systems still face two limitations that hinder their effectiveness and efficiency in handling large-scale GNN training.Limitation 1: Existing systems still suffer from the high memory consumption of either vertex or intermediate data.While NeuGraph [30] and NeutronStar [56] decrease the memory consumption of vertex data, they still require intermediate data to be stored entirely in the GPU.Conversely, ROC reduces the memory consumption of intermediate data, but still necessitates completed storage of vertex data in the GPU.More importantly, several critical limitations hinder the direct combination of these memory reduction methods.Firstly, the 2D partitioning in NeuGraph and NeutronStar separates a vertex's neighbors into multiple slices, making implementing full-neighbor aggregation challenging for complex GNNs like GAT [52] model, which involves a softmax() computation on the entire neighbor set.In these workloads, loading all neighbor-containing partitions is still necessary.This renders existing systems ineffective on training large-scale GATlike models.Secondly, ROC's caching-based method is inefficient on complex GNNs with large-scale intermediate data [25,26,52], as swapping large-scale intermediate data significantly increases host-GPU communication.Moreover, since the intermediate data are swapped at a whole-graph granularity, ROC's approach may fail if a single intermediate tensor is excessively large.and NeutronStar [30,56] use host-side filters to remove unnecessary data before communicating a partition, but still need to transfer the neighbor data entirely for each of them.

THE HongTu FRAMEWORK
We present HongTu, a GPU-accelerated full-graph GNN system that addresses the limitations outlined in Section 2.4 through two critical system components.First, HongTu provides a memoryefficient training framework that reduces the memory consumption of both vertex data and intermediate data.Second, HongTu provides a deduplicated communication framework that effectively reduces host-GPU communication for duplicated neighbor access among subgraphs.Figure 3 provides an architectural overview of HongTu.
Memory-efficient GNN training framework.HongTu adopts a graph partitioning method that groups edges incident on the same destination into a single chunk.This design facilitates full neighbor aggregation on each chunk individually, enabling HongTu to support complex GNNs (such as GAT [52] and GGCN [25]) efficiently while reducing memory usage.Moreover, HongTu extends the recomputation-based DNN training method [5] to GNN training, which avoids storing intermediate data by recomputing it in the backward pass.Taking the advantages that some graph operations involve only simple edge computation and do not generate intermediate data, we hybrid GPU-based recomputation and CPU-based data caching to reduce the additional processing overhead.

Deduplicated communication framework.
We observe that the duplicated data access between concurrently scheduled subgraphs on multiple GPUs and the duplicated data access between sequentially scheduled subgraphs on the same GPU can benefit from inter-GPU and intra-GPU data communication, both of which have higher speeds compared to PCIe-based host-GPU communication.We propose a deduplicated communication method that transfers the data of each duplicated neighbor only once between CPU and GPU, and converts redundant host-GPU communication into more efficient inter-GPU communication or intra-GPU data reuse.Moreover, considering the impact of vertex distribution on communication deduplication effectiveness, HongTu incorporates a cost-model guided subgraph reorganization method to minimize communication overhead.

Edge-Cut 2-Level Graph Partitioning
HongTu employs an edge-cut partitioning to split the graph into small execution units suitable for processing by a single GPU, as shown in Figure 5 (a).Initially, the input graph is split into  (the number of GPUs) partitions through Metis partitioning [20] to improve load balancing and group closely linked vertices into one partition.Each partition is subsequently divided into computationbalanced chunks through range-based partitioning [65], with each chunk containing a unique set of destination vertices and their associated edges.This partitioning method enables the full-neighbor aggregation to be implemented on each chunk individually.It is worth mentioning that only the in-edges of destinations need to be grouped, as the complex aggregations are executed only in the forward pass.In backward propagation, source vertices accumulate the gradient along the out-edges through summation.Leveraging the associativity of the sum operation, multiple source replicas in different chunks can independently calculate gradients and subsequently aggregate them.During GNN training, partitioned subgraphs are scheduled in a fixed order as shown in Figure 5 (b).Chunks belong to the same partition are sequentially scheduled on one GPU, and chunks with the same local position from different partitions are scheduled concurrently on different GPUs.For brevity, we use the term batch to refer to a group of concurrently scheduled chunks from different partitions.

Recomputation-Caching-Hybrid Intermediate Data Management
To reduce the memory consumption of DNN training, researchers have proposed a recomputationbased strategy that eliminates the need to store intermediate data for every layer by recalculating an additional forward pass in the backward computation [5].However, this method is designed for DNN training and assumes the training data of all layers can be entirely stored in GPU memory as the checkpoint.This makes it unsuitable for full-graph GNN training, where the training data of the entire graph can occupy a significant amount of memory.We generalize the recomputation-based approach to the CPU-GPU heterogeneous platform.Figure 4 (b) shows a graphical illustration of the recomputation-based method on a single layer of GCN training, where the AGGREGATE operation is the neighbor convolution and the UPDATE operation is the Linear+ReLU calculation.In the forward computation of each GNN layer, HongTu copies the output representations to CPU memory as checkpoint and releases the intermediate data to make room for training the next batch.
In the backward pass, HongTu loads the checkpoint from CPU, recomputes the forward pass, and computes the gradients based on the regenerated intermediate data.This method allows HongTu to store the training data of only one layer, thereby reducing the overall GPU memory consumption.Importantly, the recomputation-based approach maintains the accuracy of the original training method [5] as shown in Figure 4 (a), because the regenerated intermediate data are identical to that produced in the forward computation.
Recomputation-based training reduces memory consumption but entails an additional forward pass.However, not all recomputation is necessary.In the case of GNNs with simple arithmetic edge computation, where the AGGREGATE operation does not yield intermediate results required for gradient computation, caching the output of the AGGREGATE operation in CPU can eliminate the need for recomputation.For instance, the GCN model [21] in Equation 2 employs a weighted neighbor summation as the AGGREGATE operation.Recomputing the AGGREGATE requires loading representations of all neighbors from the CPU and redoing neighbor convolution on GPUs, resulting in  ( | |) CPU-GPU communication and  (||) GPU computation.Alternatively, caching the output neighbor representations of AGGREGATE in CPU memory and transferring them back when needed achieves the same functionality with only  (| |) host-GPU communication.Based on this observation, we propose a recomputation-caching-hybrid method shown in Figure 4 (c).In the forward pass, HongTu caches the neighbor representation (a  ) in the CPU as the recomputation checkpoint.In the backward pass, HongTu skips the AGGREGATE step, loads the cached neighbor representations from the CPU, and recomputes only the UPDATE stage.This hybrid design can benefit a broad range of popularly used GNNs, such as GCN [21], GraphSage [14], GIN [59], and CommNet [49].However, for GNNs with neural network computation on edges (e.g., GAT [52] and GGCN [25]), the overhead of caching the  (||) intermediate data can be higher than that of recomputation.In such cases, HongTu falls back to the recomputation-based method as depicted in Figure 4 (b).

Overall Execution Flow in HongTu
Algorithm 1 outlines the overall execution flow.To begin with, HongTu partitions the graph with 2-level partitioning (line 1).Each subgraph, represented by    , consists of a set of disjointly split vertices    and their incoming edges    , where  is the partition id and  is the chunk id.After graph partitioning, the communication deduplication module reorganizes the partitioned subgraphs, deduplicates the neighbor accesses, and generates the new partitions {G   |0≤<, 0≤  <} for parallel training (line 2).Since the initial partitioned graph    is no longer used in the subsequent computation, we use    and    to represent the vertices and edges of the new subgraph G   and denotes its in-neighbor set by    .After preprocessing, HongTu initializes the vertex representation buffer h  and gradient buffer ∇h  in the CPU memory (line 3).
In the training process, batches are scheduled sequentially, and subgraphs in each batch are processed in parallel.In the forward pass of each batch, neighbor representations of all subgraphs, i.e., {h     |0≤≤} are first loaded from CPU to GPUs through the deduplicated communication framework (line 6) which will be discussed in Section 5. Following this, each GPU performs the forward computation (lines 7-8), copies the newly computed vertex representation h +1    to CPU (line 9), and releases intermediate data to make room for training the next batch.After completing the forward pass, the downstream task takes the final layer output h  as input, computes the loss and the gradient of loss to the final layer representation, i.e., ∇h  (lines 10-11).In the backward pass, computation is scheduled from the last layer to the first layer (line 12).In each batch, HongTu reloads the checkpoint to GPUs (line 14), loads the gradients of destinations from CPU memory (line 16), recomputes the forward pass of the current layer (line 17), and computes the gradients (line Algorithm 1 Workflow of HongTu for a single epoch for batch with id  = 0 to  − 1 do 6: for GPU  = 0 to  − 1 do in parallel for GPU  = 0, 1, . . .,  − 1 do in parallel   3, every doubling of the number of partitions results in a 47%, 35%, and 32% reduction in the memory consumption of vertex data for the three graphs, respectively.The memory consumption of intermediate data varies, depending on the GNN model in use.Some models are dominated by the number of vertices [14,21,49,59], while others are dominated by the number of edges [25,26,52], and both decrease linearly as  *  increases.In practical training, memory consumption can be adjusted by tuning the number of partitions to adapt to different GPUs.

DEDUPLICATED COMMUNICATION FRAMEWORK
In this section, we present the design and implementation of deduplicated communication framework.

Basic Design
Inter-GPU duplicated neighbor access.Duplicated neighbors between concurrently-scheduled subgraphs cause the same vertex to be transferred to multiple GPUs.As indicated by the red dashed box in Figure 6 (a), vertex 0, 1, 3, and 4 are transferred to multiple GPUs in batch 0. The data of these vertices are redundantly communicated between CPU and GPUs.Instead, we can transfer the duplicated vertex to one GPU and handle the access requests from other GPUs through inter-GPU communication.Benefiting from the high communication bandwidth between GPUs (as described in Section 2), converting host-GPU communication to inter-GPU communication can significantly improve performance.Intra-GPU duplicated neighbor access.Duplicated neighbors between sequentially-scheduled subgraphs cause the same vertex to be transferred multiple times to the same GPU.As indicated by the purple dotted boxes in Figure 6 (a), vertex 2 and 5 in GPU 1 and vertex 3 in GPU 3 are loaded in both batch 0 and batch 1.For those adjacently-scheduled subgraphs, neighbor access to duplicated neighbors from the successor subgraph can directly reuse the already transferred data in GPU, converting host-GPU communication to intra-GPU data access.Communication Deduplication.We stack these two techniques to cooperatively reduce the communication for duplicated neighbor accesses, as illustrated in Figure 6 (b).Our method involves two steps.In the first step, it computes the union of neighbors in each batch (i.e., a group of concurrently scheduled subgraphs).This union is then deduplicated and stored in a transition vertex set denoted by N ∪  = ∪  =0    , where 0≤  < indicates the batch id.During computation, each vertex in the transition vertex set is transferred to exactly one GPU and shared among GPUs through inter-GPU communication.

Workflow of Deduplicated Communication
HongTu uses a transition data buffer h N   on each GPU to manage the data of transition vertices N   , based on which we can decouple host-GPU communication and inter-GPU communication.

Algorithm 2 dedup_comm_fwd
Input: batch.One the CPU side, the gradients of N    are accumulated to ∇h   with CPUs (line 7).Since the gradient accumulation only involves simple arithmetic addition, utilizing CPUs is faster than copying the data to the GPU computation, in which the data movement involves bidirectional host-GPU communication.

Cost-Effective Subgraph Reorganization
The effectiveness of communication deduplication is affected by the distribution of duplicated neighbors.To enhance communication efficiency, we quantify the deduplicated communication overhead and propose a subgraph reorganization method to minimize it.=0    is the transition vertex set of batch .By further applying intra-GPU communication duplication, the duplicated transition vertices of each pair of adjacent subgraphs, i.e., ∪  =0    ∩ ∪  =0    −1 , are no longer required to be transferred.Consequently, the host-GPU communication is further reduced to Finally, the total communication overhead can be formalized as where T ℎ , T  , and T  represent the throughput of host-GPU communication, inter-GPU communication, and intra-GPU data reusing, respectively.These parameters are environment-specific and depend on the used GPU platform.We can observe that the communication cost C is affected by the number of duplicated neighbors among subgraphs.Obtaining the minimal C requires careful adjustments of vertex distribution in the partitions.However, optimizing this goal in the partitioning stage is challenging because it involves a vast search space at the vertex granularity and couples with several constraints, such as load balancing and communication reduction.To simplify this problem, we propose a subgraph granularity optimization approach.Specifically, given an initialized load-balancing optimized graph partition {   | 0≤<, 0≤  <}, the objective is to find a reorganized partition {G   | 0≤<, 0≤  <} that minimizes the cost C in Equation 4, where each G   is a subgraph from the initial partition, e.g.,   .This combinatorial optimization problem is NP-hard as it can be reduced to a variant of the classical NP-hard traveling salesman problem [19], which aims to find a Hamiltonian circuit in a weighted undirected complete graph that minimizes the total weight of the circuit.Therefore, it is infeasible to obtain an optimal solution in polynomial time using an exact algorithm.Next, we propose a 2-phase heuristic to reorganize the partition.
Partition reorganization.We propose a 2-phase, greedy-based heuristic that optimizes communication overhead by maximizing the number of inter-and intra-GPU duplicated neighbors.The goal is to fully leverage the effect of communication deduplication.Algorithm 4 outlines the workflow of our approach.In the first phase, we reorganize subgraphs within each partition to group subgraphs with the maximum number of duplicate neighbors into the same batch.The objective is to maximize the number of inter-GPU duplicated neighbors while preserving the locality achieved by the Metis graph partitioning.The algorithm initializes the intermediate partition   and the transition vertex set N ∪ for every batch with subgraphs in partition 0 (lines 1-2), and then reorganizes other partitions in turn.Specifically, it iterates over the transition vertex set of all batches (line 5) and retrieves for each batch the subgraph that has the maximum number of duplicate neighbors in the currently-processed partition (line 6).The algorithm then writes the found subgraph to the    0 ←  0 ; N ∪  ←  0 3: for  = 1 to  − 1 do 4: K ← {0, 1, . . ., −1} //subgraphs that have not been processed 5: for  = 0 to  − 1 do 6: find  from K, s.t., ∀ ∈ K : for  = 1 to  − 1 do 14: corresponding batch in   (line 7), and updates N ∪  and K accordingly.In the second phase, we reorganize the partition at the batch granularity to maximize the number of intra-GPU duplicated transition vertices.The algorithm initializes G with batch 0 and records other batches in K (lines 8-10).During execution, it iteratively searches in K to find the batch that has the maximum number of duplicated transition vertices with the current batch (line 12), and writes the found batch to G.After processing all batches, we obtain the communication-efficient reorganized partition G.
Effectiveness with various interconnects.The proposed deduplicated communication framework offers benefits to GPU servers equipped with various interconnects.As discussed in Section 5.3, The vertex data to be transferred are divided into three subsets and handled with CPU-GPU communication, inter-GPU communication, and intra-GPU data reuse.Intra-GPU reuse consistently delivers benefits as its bandwidth   is associated with the GPU memory bandwidth and often much higher than  ℎ , which is associated with the GPU-CPU interconnect bandwidth (typically using PCIe).The effectiveness of inter-GPU data sharing depends on the bandwidth of inter-GPU interconnects.Fast interconnects such as NVIDIA NVLink [40] and AMD Infinity Fabric [2], inter-GPU communication provide substantial performance improvements through high-speed inter-GPU communication.Conversely, if GPUs are interconnected via slow PCIe, resulting in  ℎ being equal to   , inter-GPU communication does not bring enhancements.Nevertheless, employing the intra-GPU reuse optimization alone still yield considerable reductions in data transfer.As shown in Table 8, the intra-GPU duplication accounts for 36%-84% of the total duplication volume.

IMPLEMENTATION
The use of deduplicated communication can significantly reduce the volume of host-GPU communication, but achieving high performance requires careful implementation, particularly for irregular memory access during communication.First, transferring the data of N   and    among CPU and GPUs involves non-continuous memory access.Conventional communication methods, such as NCCL [39] and cudaMemcpy, are unsuitable for our task as they are designed to operate on contiguous memory.Designing additional data compaction modules can increase the CPU overhead [30,35].Second, switching data from h N   −1 to h N   needs to reserve the data of N  and load the data of N    from CPU to h N   .This process causes random memory manipulation on the two data buffers.To address these issues, HongTu provides a high performance communication implementation with two features: communicate-on-demand and -update-in-place.
On-demand communication.HongTu employs zero-copy memory access [34] and GPUDirect P2P access [60], which allow GPUs to directly access the memory of CPUs and other GPUs within the CUDA kernel by mapping them to the same memory address.Moreover, we implement the coalesced-and-aligned memory access optimization [34,35], which optimize the PCIe bandwidth utilization by enabling each warp of threads to access the contiguous dimension of data.In this way, irregular and non-continous data communication among CPU and GPUs can be performed efficiently.
In-place transition data management.HongTu uses a single data buffer to maintain the transition vertex data for all subgraphs in a partition, and  position indices for maintaining the write position of transition vertices {N   |0≤  <} in the buffer.When scheduling a new batch, the data of newly scheduled transition vertices (N   ) are write to the buffer according to the indices.In the preprocessing, we process the transition indices for all subgraphs, making the duplicated vertices of each pair of adjacently-scheduled subgraphs have the same write positions.This allows the data of N    to be reused in-place.The data of N    , which are loaded from the CPU, are inserted into the buffer based on their write positions in the indices.Figure 7 (a) shows a example of data loading in the host-to-GPU communication.Duplicated vertices between N   −1 and N   , i.e., {2, 6, 7} have the same positions in the transition data buffer.When updating h N   −1 to h N   , the data of these vertices are reused in-place.In contrast, the data loaded from CPU, {3, 5, 8}, are inserted into the positions of discarded vertices {1, 4, 9}.
In-place neighbor data management.HongTu uses a single data buffer to maintain the neighbor data for all subgraphs in a partition.It uses a neighbor index for each neighbor set    to track their read positions in the local/remote transition data buffer and the write positions in the local neighbor data buffer.When switching the neighbor data between subgraphs, data of    are exchanged between the transition data buffer and neighbor data buffer according to the indices.We notice that duplicated neighbors between each pair of adjacently-scheduled subgraphs (i.e.,    ∩    −1 ) are redundantly transferred.To address this issue, we extend the data reuse technique to inter-GPU communication.HongTu reorders the neighbor vertices of all subgraphs, making the duplicated vertices (   ∩    −1 ) have the same positions in the neighbor data buffer, and reuses the data of them during communication.
Host-GPU communication implementation.HongTu allocates both h  and ∇h  on pinned memory with cudaMallocHost() to support zero-copy memory access in the host-to-GPU communication.In the GPU-to-host communication, HongTu accumulates the gradients of N    to ∇h  in CPU and reserves the gradients of N    in GPU.To leverage GPU's high memory bandwidth, HongTu implements a GPU-based compaction module as shown in Figure 7 (b).The gradients to be moved out are first collected in GPU memory and then transferred back to CPU using cudaMemcpyAsync().
Inter-GPU communication implementation.HongTu GPUDirect P2P access through the cudaDeviceEnablePeerAccess() function, which facilitates direct memory access between GPUs.In the forward pass, HongTu uses pull-based communication, where each vertex in    reads its representation from the corresponding GPU.In the backward pass, HongTu employs a push-based communication scheme to accumulate the gradients of    back to the transition data buffer in the corresponding GPUs, utilizing the atomicAdd_system() function [40].To avoid resource contention caused by multiple GPUs accessing the data from the same GPU, we implement interleaved communication optimization [65] that avoids different GPUs accessing one GPU at the same time slot, as shown in Algorithm 2 (line 6) and Algorithm 3 (line 2).Data buffer deduplication.Maintaining the data buffer of transition vertices N   and neighbor vertices    separately leads to doubled data storage overhead on storing the data of N   ∩    .To avoid this issue, HongTu merges N   and    and maintains the data of N   ∪    with a single data buffer.Additionally, HongTu regenerates the position indices and modifies the topology of each subgraph to ensure that the computation engine can read and write the merged data buffer directly.
Computation engine.HongTu 's computation engine is based on cuSparse and Pytorch [41], operating independently from the communication engine, as shown in Algorithm 1 (lines 8, 17, and 18).Following existing frameworks such as Sancus [42] and DGL [55], HongTu organizes the topology of each subgraph chunk into the compressed sparse row/column (CSR/CSC) formats.These subgraph chunks are abstracted as blocks in the computation engine, facilitating GNN computations at each layer.Graph operations are implemented using cuSparse, while Pytorch serves as the backend for neural network computation.HongTu provides a GNN layer definition class with __init__ and forward methods, enabling users to specify the model configuration and forward computation using built-in graph operations and Pytorch functions.Users also have the option to train their self-implemented GNN models in Pytorch or DGL by overloading these functions with their single-process codes.In case a different graph input format is used, users are required to convert the partitioned subgraphs into their preferred format in the preprocessing stage.The dataflow graph and autograd libraries of Pytorch are used for gradient computation, relieving users from explicitly managing the gradient calculations.

Experimental Setup
Environments.The multi-GPU experiments are conducted on a GPU server equipped with 4 AMD EPYC 7543 CPUs, 512GB DRAM, and 4 NVIDIA A100 (80GB) GPU.Each GPU is connected to a CPU via PCIe 4.0 link, and each CPU contains 128GB of local memory.The four GPUs are connected through NVLINK-3.0,providing 200GB/s inter-GPU bandwidth.The server runs Ubuntu 18.04 OS with GCC-7.5, CUDA 11.2 and PyTorch v1.9 backend [43].The single-node CPU experiments are conducted on a Server contains two Intel Xeon 6246R CPU @3.40 GHz with a total of 32 cores Datasets and GNN algorithms.Table 4 presents the major parameters of the real-world graphs used in our experiments.For graphs without ground-truth properties (it-2004 and friendster), we use randomly generated features, labels, training (25%), test (25%) and validation (50%) set division.We use two popular GNN models with different computation patterns, GCN [21] has heavy-weight vertex computation and light-weight edge computation.GAT [52] has heavy-weight edge computation and light-weight vertex computation.The hidden layer dimensions for reddit and ogbn-products are set to 256, while for it-2004, ogbn-paper, and friendster, they are set to 128.In our evaluation, the number of partitions is set to 4. Since reddit and ogbn-products are small, their partitions are not additionally split.Each partition of it-2004, ogbn-paper, and friendster is divided into 8, 32, and 32 (resp.16, 64, 64) chunks in GCN (resp.GAT) training, respectively.
The systems for comparison.We compare HongTu with three full-graph GNN systems: single-GPU DGL v0.9 [55], multi-GPU-based Sancus [42], and CPU-based DistGNN [32], as well as a GPU-based mini-batch GNN system DistDGL [62].In DistDGL, the fan-out of neighbor sampling per layer is set to 10, and the batch size is set to 1024.We provide an in-memory version (HongTu-IM) that places all the training data in GPU to demonstrate the effectiveness of the GPU computation engine.We also provide a single-GPU HongTu with the inter-GPU communication disabled for comparison with DGL and single-CPU-based DistGNN.
Comparison with CPU data offloading techniques in DNN training.Certain DNN frameworks also employ CPU data offloading to mitigate GPU memory overhead.DeepSpeed [44][45][46] is a representative system that stores model parameters in CPU and offload computation to GPUs.However, these frameworks are designed for DNN training with large models and lack GNN-specific consideration, limiting their effectiveness in supporting GNN training (Section 8).To illustrate this, we compare HongTu with DeepSpeed [33] in Section 7.3.Since DeepSpeed does not support GNN training, we implement its data offloading method in HongTu as our baseline (denoted by Baseline in Figure 9), which transfers the neighbor data for each subgraph entirely.The host-GPU on-demand access optimization (Section 6) is enabled in the baseline to enhance CPU-GPU communication.Both DeepSpeed and HongTu employ recomputation-based training [5].For a fair comparison, we enable recomputation-cache hybrid intermediate data management (Section 4.2) in both frameworks, even though DeepSpeed does not have this optimization.
Accuracy and evaluation metric.Full-graph GNN can achieve theoretical accuracy in HongTu because its training semantic is not changed.per-epoch time indicates better time-to-accuracy performance, and all the results are averaged over 20 epochs to ensure consistency.In comparison with mini-batch training-based DGL, HongTu achieves higher test and validation accuracy on reddit, while mini-batch training performs better on ogbn-products.Both the mini-batch and full-graph training methods possess distinct merits.However, assessing their effectiveness requires a comprehensive analysis and consideration of various factors, including batch size, sampling fan-out, and characteristics of input graphs, which is out of the scope of this work.

Overall Comparison
First, we compare HongTu with single-CPU and single-GPU systems on small graphs to show the efficiency of GPU computation engine.Then we compare HongTu with multi-GPU systems on all graphs to evaluate its processing scale with limited GPU resources.Finally, we compare HongTu with a distributed CPU system on large graphs, evaluating its efficiency and low monetary cost.
Comparison with single-GPU and single-CPU systems.We compare HongTu and HongTu-IM with DGL [62] and single-CPU DistGNN by running GCN and GAT on the two small graphs (reddit and ogbn-products).Table 5 shows the runtime results and the speedups normalized to DistGNN.We observe that all three GPU-based solutions achieve more than one order-of-magnitude speedup over the CPU-based solution.HongTu-IM achieves performance similar to, or slightly better than, DGL.
HongTu is 1.3×-3.8×slower than DGL due to additional overhead on host-GPU communication and CPU-based gradient accumulation.Although the performance is slightly behind, only HongTu is capable of training complex GNN models with large-scale intermediate data (e.g., GAT).Comparison with multi-GPU system.We compare HongTu with Sancus [42] and DistDGL [62] by running GCN on all five graphs.For the two small graphs, we employ the model configurations with 2, 4, and 8 layers, while for the three large graphs, we employ model configurations with 2, 3, and 4 layers.The results are reported in Table 6.In comparison with Sancus, HongTu-IM delivers comparable performance to Sancus and is 1.2×-1.9×faster than HongTu on the two small graphs.However, both Sancus and HongTu-IM run out of memory on the three large graphs.In contrast, HongTu can effectively process them.DistDGL runs out of memory on ogbn-products, it-2004, and friendster when configured with 8, 4, and 4 layers, respectively.Furthermore, in cases where DistDGL successfully runs, the runtime exhibits exponential growth as the number of layers increases.These challenges arise from the neighbor explosion problem [18], where the computation and memory requirements for mini-batch GNN training increase exponentially with the number of layers.On successfully runs, HongTu outperforms DistDGL on reddit, ogbn-products, it-2004, and friendster.On ogbn-paper, DistDGL achieves a better performance due to its usage of only 1.2M vertices (1.1%) for the training, resulting in significantly lower computation volume compared to HongTu.In summary, HongTu exhibits advantages when training deep GNNs or when the input graph includes a large proportion of training vertices.However, when the training set and the number of model layers are small, DistDGL still holds certain advantages.
Comparison with distributed-CPU system.We compare HongTu with DistGNN [42] by running GCN and GAT on the three large graphs.The results are reported in Table 6.DistGNN runs out-of-memory for 4-layer GCN on ogbn-paper and all GAT workloads except the two-layer GAT training on it-2004.We can observe that training large-scale GNNs remains challenging, even with extensive host memory provided by multiple CPU nodes.Besides large-scale vertex and Table 8.The proportion of the two types of duplication access on the three billion-scale graphs.USD per-node), which has a similar configuration to our private GPU server.Therefore, HongTu provides a cost-effective solution for processing large-scale GNNs.

Communication Reduction Analysis
We enable inter-GPU and intra-GPU communication deduplication one-by-one to reveal how much HongTu can benefit from each of them.Table 8 illustrates the communication reduction volume normalized to the number of vertices (| |).The results show that these two optimizations reduce host-GPU communication by 25%-71% on the three graphs.Although it-2004 originally has less redundant communication (0.6 times | |), our proposed method still reduces 68% of the total redundant transfers (from 0.6| | to 0.2| |).Ogbn-paper benefits more from intra-GPU deduplication due to its co-author graph structure and exhibits good locality.
To demonstrate the practical improvement of communication deduplication, we conducted experiments on GCN and GAT models with 2-, 3-, and 4-layer configurations on the three large graphs.We start from the baseline approach (Baseline) that transfers the neighbor data for each subgraph entirely, then enable inter-GPU (denoted by +P2P) and intra-GPU communication deduplication (denoted by +RU) one-by-one.Figure 9 reports the results.Even with on-demand access optimization (Section 6), the performance of the baseline approach remains inferior, because it suffers from the duplicated neighbor data communication and cross-partition remote host-GPU data access.The inter-GPU data sharing reduces communication time (including host-GPU communication and inter-GPU communication ) by 23%-26%, 23%-27%, and 39%-42%, on the three graphs respectively.The reduction in transfer time is greater than the reduction in transfer volume because it eliminates the remote neighbor access across CPUs.The intra-GPU data reusing further reduces transfer time by 9%-12%, 39%-42%, and 36%-37% for the three graphs, respectively.Overall, HongTu that uses deduplicated communication can achieve speedups ranging from 1.3× to 3.4× compared to the baseline approach.
Overhead of communication deduplicaton.As communication deduplication has the cost to preprocess the input graph after graph partitioning, we evaluate the overhead and show in Table 9, where the preprocessing time is denoted as "Preprocessing".We compare it to the execution time of running GCN for 100 epochs with and without communication deduplication (CD).We observe that communication deduplication brings up to 1.5% overhead into HongTu while significantly improving performance over the baseline.The low overhead of communication deduplication comes from two folds.First, the preprocessing uses a heuristic design and is executed in parallel.Second, as full-graph GNN training follows the same execution pattern in different layers, the preprocessing only needs to be performed once.

Performance Breakdown
We provide a performance breakdown to analyze the time consumption of different components, including the host-GPU communication (H2D), inter-GPU communication (D2D), GPU-based computation (GPU), and CPU-based gradient accumulation (CPU).Figure 9 shows the experimental results.The GPU computation time varies among different GNNs due to their varying computation complexities.In GCN with simple arithmetic edge computation, GPU computation accounts for 10%-14% of the overall runtime.In contrast, in GAT with parameterized edge computation, the GPU computation time is 4.5 times longer than that of GCN and accounts for 54%, 28%, and 35% of the total runtime.The communication (H2D+D2D) time varies among different GNNs.GCN benefits from recomputation-caching-hybrid training, reducing its communication time by 21%-29% compared to GAT training.Overall, the communication time accounts for 58%-61% and 36%-50% of the overall runtime on GCN and GAT, respectively.As HongTu utilizes CPUs to accumulate the neighbor gradients, the CPU computation time is proportional to the volume of transferred neighbors, which accounts for 8% to 30% of the overall runtime.

Sensitivity Study
Performance with varying layers.Since the computation pattern is exactly the same, communication deduplication is equally effective for all GNN layers.Therefore, increasing the number of layers will not decrease the optimization effect.As shown in Figure 9. HongTu achieves 1.4×-1.5×,2.5×-2.7×,3.2×-3.4×,1.3×-1.3×,2.3×-2.4×,and 2.6×-2.8×speedups over the vanilla approach under different layer configurations.The optimization effect is stable.Fig. 10.Runtime and memory analysis of HongTu with different chunks.We run GCN on each graph from an initial chunk size and increase the chunk size to 2×, 3×, and 4×.Performance with varying chunks.The chunk size in HongTu is a configurable parameter that controls the memory consumption of training data.However, increasing the chunk size also leads to increased duplicated neighbors, which increases the volume of host-GPU communication.To evaluate the impact of chunk size, we run GCN on three large graphs, starting from the initial chunk size and increasing it by a factor of 2, 3, and 4. The experimental results in Figure 10 show that as the chunk size increases by 4×, the memory consumption decreases by 51%-65%, and the runtime increases by 1.5× to 2.2×.The increase in runtime is either linear or sublinear, and is proportional to the decrease in memory consumption.Additionally, the performance of HongTu can be improved by using GPUs with larger memory capacity, although our approach can be adapted to GPUs of different grades.

Scalability of HongTu
We evaluate the scalability of HongTu by varying the number of GPUs used in training.Figure 11 shows the normalized speedups of GCN and GAT training on it-2004, ogbn-paper, and friendster.The execution time of HongTu is reduced when using more GPUs.Specifically, when the number of GPUs increases from 1 to 4, HongTu achieves 3.3-3.7×speedups for GCN training and 3.4-3.8×speedups for GAT training.The speedups from 1 to 2 sockets is lower than that from 2 to 4 sockets because we do not have enough CPU memory to enable the NUMA-aware vertex data allocation.When using two or fewer GPUs, we must use the memory from all sockets, resulting in remote memory access overhead.

RELATED WORK
As the size of DNN models increases, the traditional data parallel (DP) training method, which replicates model parameters across all training processes, faces scalability issues due to the increasing memory consumption [1,36,54].To address this, various parallel training methods have been proposed, including model parallelism [6,54], pipeline parallelism [17,36,37], partitioned data parallelism [44] (which partitions model states among data-parallel workers to eliminate memory redundancy), and 3D parallelism [33] (which combines model, pipeline, and data parallelism to leverage the aggregated GPU memory of a cluster).However, the scalability of these frameworks remains constrained by the available GPU resources.To address the limitation of GPU memory, DeepSpeed [33] incorporates CPU-based data offloading techniques [44][45][46].It achieves this by partitioning the model state into smaller slices and storing them in CPU DRAM or NVM.During training, the required slices are loaded into the GPUs sequentially as they are accessed.While DeepSpeed and our design share similarities in offloading memory-intensive data to CPU memory, they differ in their optimization objectives.DeepSpeed primarily focuses on DNNs with large model parameters.In DNNs, model parameters consist of dense matrices that can be partitioned into disjoint slices without interdependencies.These slices can be efficiently communicated between the CPU and GPUs due to their regular data access patterns.In contrast, HongTu is tailored for GNN training, where the memory overhead primarily arises from vertex data, and the model data typically have small sizes.Due to the inherent complexity of graph structures, vertex data are randomly distributed and duplicated across partitions, resulting in irregular and increased host-GPU data communication.DeepSpeed's approach does not adequately address these challenges.However, HongTu effectively resolves this problem through its deduplicated communication framework.

CONCLUSION
We present HongTu, a scalable and efficient system for training full-graph GNNs on limited GPU memory.Our system leverages two key components to achieve its performance, including a memory-efficient GNN training framework that combines the partition-based GNN training and recomputation-cache-hybrid intermediate data management, a deduplicated communication framework that converts the redundant host communication for duplicated neighbors to inter-GPU and intra-GPU data access.Our experiments demonstrate that HongTu can efficiently train on billion-scale graphs using just 4 GPUs by fully utilizing CPU, GPU, and interconnects.

Fig. 2 .
Fig. 2. 2D graph partitioning on a 8 vertices toy graph.Each partitioned subgraph is represented by a colored box.Solid and dashed circles denote the master and mirror vertices, respectively.

Fig. 4 .
Fig. 4. A Graphical illustration of the original, recomputation-based, and recomputation-caching-hybrid training methods on the GCN model.Here we give the example of a single layer, while other layers have the same calculation mode.A box represents an operation, a circle represents a tensor, and a tensor surrounded by a black frame indicates that the it needs to be cached in the CPU memory (checkpoint).Solid arrows indicate dependencies between tensors and operations and dash arrows indicate host-GPU communication.

Fig. 5 .
Fig. 5.An illustration of 2-level graph partitioning and the task scheduling on multiple GPUs.

Figure 6 (
b) provides a graphical example.The deduplicated vertex sets, N ∪ 0 , i.e., {0, 1, 2, 3, 4, 7}, and N ∪ 1 , i.e., {2, 3, 4, 5, 6} are transferred only once, reducing host-GPU communication times from 19 to 11.To share communication workload among GPUs, we divide N ∪  into  subsets {N 0 , . . .N −1 }, and assign the communication of N   to GPU , where N   is the subset of N ∪  belonging to partition , as shown in Figure 6 (b).In the second step, we perform the intra-GPU deduplication on the transition vertex set for each pair of adjacently-scheduled subgraphs, e.g., N   −1 and N   .We divide the successor transition vertex set N   into two disjoint subsets N    and N    , where N    represents the duplicated vertices, i.e., N   ∩ N   −1 , and N    represents the remaining vertices N   \ N   −1 .When loading the data of N   from the CPU, vertices in N    are directly reused from the GPU, and vertices in N    are loaded from the CPU memory.Proc.ACM Manag.Data, Vol. 1, No. 4 (SIGMOD), Article 246.Publication date: December 2023.

Figure 6 (
Figure 6 (b) shows a graphical illustration with purple dashed arrows.When processing batch 1, the data of N  11 ({2, 3}) and N  12 ({4}) are directly reused from batch 0, and the data of N  12 ({5}) and N  13 ({6}) are loaded from CPU memory.This step further reduces host-GPU communication times from 11 to 8.

Fig. 9 .
Fig.9.Performance breakdown of HongTu on GCN and GAT with different hidden layers, where 'Baseline' for the baseline approach, 'P2P' for the inter-GPU communication, and 'RU' for the intra-GPU data reusing.'GPU' represents the GPU computations, 'H2D' represents the host-GPU communication, 'D2D' represents the inter-GPU communication, and 'CPU' represents the CPU-based gradient accumulation.

Table 2 .
Summary of existing full-graph GNN systems.The 'VD' represents the vertex data and the 'ID' represents the intermediate data.

Table 3 .
Neighbor replication factor  under different partitions.
4:for layer  = 0 to  − 1 do 5: +1    ,    , DtoH) 10: loss= downstream_task(h  ) 11: ∇h  =loss.backward()12:for layer  =  − 1 to 0 do 13:for batch with id  = 0 to  − 1 do the CPU.Output: load {h    | 0≤<} to the  GPUs separately 1: for GPU  = 0 to  − 1 do in parallel //host-to-GPU 2:h N   ← GPU().reuse(hN−1,N for GPU  = 0 to  − 1 do in parallel //GPU-to-GPU 6:for GPU  =  + 1 to ( +  ) mod  do 7:h    ← GPU().fetch_from_gpu(,hN  ,    ∩ N  , DtoD) 8: synchronize()Algorithm 2 outlines the workflow of deduplicated communication in the forward pass.It loads the neighbor representations h     from the CPU data buffer h  to the  GPUs separately, based on four distinct vertex sets: {   , N   , N In the first step, each GPU  loads the data of transition vertices to the transition data buffer h N   , by reusing the data of N    from the CPU (lines 2-3).In the second step, GPUs communicate with each other to fetch the data of each    ∩ N  from remote transition data buffers and assemble the neighbor data h     in local memory (lines 5-7).Accumulate {h    |0≤<} to the CPU gradient buffer ∇h  .
Input: {   , N   , N    , N    |0≤<}, { ∇h     |0≤<} on the  GPUs Output: 1: for GPU  = 0 to  − 1 do in parallel 2: for GPU  =  + 1 to ( +  ) mod  do back to the CPU and accumulates them to the gradient buffer ∇h  .Algorithm 3 outlines this process.In the first step, GPUs accumulate the neighbor gradients ∇h    back to the gradient buffer of transition vertices (lines 1-3).Subsequently, each GPU moves the gradients of N    out to CPU memory (line 6) and reserves the gradient of N    in GPU to accumulate the gradients of the next Proc.ACM Manag.Data, Vol. 1, No. 4 (SIGMOD), Article 246.Publication date: December 2023.

Table 4 .
Dataset description.| |, || , #F, and #L represent the number of vertices, edges, features, and labels, respectively.The distributed CPU experiments are conducted on a 16-node Aliyun ECS cluster.Each node (ecs.r5.16xlarge instance) is equipped with 56 vCPUs and 512GB DRAM.The network bandwidth is 20 Gbps.All these machines run Ubuntu 20.04.

Table 5 .
Comparison with DGL and DistGNN on two small datasets.

Table 7 .
Comparison with DistGNN on a 16-node ECS cluster.

Table 9 .
Analysis of cost of communication deduplication.