A Coordinated Strategy for GNN Combining Computational Graph and Operator Optimizations

Graph Neural Networks (GNNs) have garnered significant interest across various domains due to their efficacy in learning from graph-structured data. In pursuit of heightened performance, numerous GNN frameworks have emerged recently. However, recent work tends to study performance optimization at the computational graph level and operator level separately, and the existing optimization techniques rely on pattern matching and manual intervention, driven by human expertise. Consequently, their performances remain sub-optimal and sensitive to input graphs and GNN models. In this work, we develop an efficient coordinated strategy named AlphaGNN, which achieves an effective combination of computational graph optimization and operator optimization. To render this coordinated optimization impactful, a rule-based computational graph optimization and a performance-driven operator optimization are proposed. The experimental results confirm that AlphaGNN achieves up to 12.39 × (2.94 × on average) performance improvement over the state-of-the-art methods on diverse GNN models.


INTRODUCTION
Graph neural network (GNN) is designed to extract rich information from graph data such as social networks and knowledge graphs.GNN holds state-of-the-art performance across a wide range of prediction tasks on graphs, such as graph classification [29], and link prediction [23].In the last several years, GNN has been used in many real-world applications in diverse domains from biology to medicine [8], social networks, personal recommendations [10], and so on.Thus, the performance of GNN training and inference is essential for the overall speed of the real-world application.As the need arises to accommodate deeper GNN models for more extensive graph datasets, it has become an urgent demand to study the improvement of GNN performance.
GNNs comprise a combination of graph operations and neural operations.However, the mere integration of graph computing frameworks with deep neural network (DNN) frameworks is inadequate to support efficient GNN executions.This approach fails to effectively handle complex interactions between graph operations and neural operations, which is crucial for optimizing GNN performance.For instance, it is hard to achieve high efficiency by performing neural operations based on the graph structure, due to the mismatch of the dense computational property of neural operations and the sparse patterns in graph operations.
To achieve high performance, various GNN frameworks have been developed [4,11,16,24,27,28,33], among which Deep Graph Library (DGL) [24], Pytorch-Geometric (PyG) [4], Seastar [27], and Graphiler [28] represent the state-ofthe-art technologies.These efficient frameworks establish a computational graph abstract based on GNN structure, develop message-passing primitives for graph operations, and implement the primitives as operators.In general, the current efforts to improve performance can be classified into two main categories: computational graph optimization and operator/kernel optimization.Computational graph optimization for GNNs involves rewriting the computational graph by identifying specific pattern operators and replacing them with more efficient alternatives, such as operator reordering [28] and sub-graph fusion [27,28,32].It shares similarities with traditional loop or pipeline tuning in compilers [14], aiming to reduce redundancy in computation and optimize memory access.On the other hand, operator optimization centers around designing efficient kernels, like optimizing Sparse Matrix-Matrix Multiplication (SpMM) and Sparse-Dense Matrix Multiplication (SDDMM) through adaptive tiling of input matrices into dense and sparse tiles.
Both computational graph and operator optimizations play crucial roles in achieving high performance for GNNs.However, the recent work involves initially rewriting the computational graph and subsequently designing the operator implementation, which results in segregation between the two levels of the optimizations.Moreover, the present strategies for computational graph and operator optimizations are primarily manual, heavily reliant on human expertise, and confined within a limited search space.For instance, many sub-graph fusion strategies are based on recognizing and substituting predefined fusion patterns, derived empirically.Due to these factors, the performance of recent GNN frameworks often falls short of optimal levels and exhibits sensitivity to diverse inputs, encompassing graphs and GNNs.
To address the performance bottleneck for GNNs, we review both computational graph and operator optimizations in detail, and find that operator/kernel optimization frequently creates multiple efficient kernels, resulting in kernel split that can consequently alter the computational graph's structure.As a result, there is a promising opportunity to push the envelope of performance further by synergistically integrating computational graph and operator optimizations.Motivated by this basic idea, we propose a novel coordinated strategy called AlphaGNN, which supports the optimization alternately between the computational graph level and operator level, and yet directly generates high-performance codes from input data graphs and GNN models.To accomplish the coordinated optimization, we propose rule-based computational graph optimization and performance-driven operator optimization strategies, which could automatically and adaptively adjust based on specific inputs.These strategies contrast with the prevailing pattern-based computational graph optimization and manual kernel design methods, effectively liberating GNN optimization from the constraints of human expertise.
AlphaGNN is an efficient strategy that implements effective coordination of computational graph and operator optimizations.It features the following that differs from previous works.(1) It develops rules for computational graph transformation and sub-graph fusion, as opposed to looking for patterns with a specific combination of operations.The rule-based optimization enables the discovery of a broader spectrum of opportunities to enhance performance by minimizing redundancy in computation and memory access.(2) It establishes an automatic code generation mechanism instead of the manual kernel implementation employed by recent works, which makes it possible to facilitate the seamless execution of coordinated optimization efforts.(3) It adopts an innovative iteration by alternating between the implementation of computational graph optimization and operator optimization to systematically search and generate the highperformance GNN code.Specifically, our contributions can be summarized as follows: • Develop an efficient coordinated strategy, AlphaGNN, which achieves an effective combination of computational graph optimization and operator optimization (Section 3).• Propose a novel ruled-based computational graph optimization approach and design a performance-driven operator optimization approach, which makes it possible to coordinate the optimizations at the computational graph level and operator level (Section 4).• Demonstrate that AlphaGNN improves GNN performance by up to 12.39× (2.94× on average) compared to four state-of-the-art frameworks.(Section 5).

BACKGROUND AND MOTIVATION 2.1 GNN Operators
On a data graph  = ( , ) with the vertex set  and the edge set , a GNN layer consists of the operators with four basic patterns [32], Scatter, ApplyEdge, Gather and ApplyVertex: In the equations above,   and   are the -th and -th vertex, and    is the edge connecting   and   .  represents the feature vector attached to   , and  ℎ , refers to the feature vector of    , where ℎ , is the edge index of    .

Computational Graph of GNN
The major GNN frameworks usually utilize computational graphs to express the computation process, where operators or functions are abstracted as computational nodes and data dependency is represented by edges.In the computational graph for GNNs [28], each operator is typically associated with a single kernel and encompasses additional GNN-related metadata, including the pattern type and state description of the operator, such as Scatter and fused.It is noteworthy that data residencies of operands are marked on the incoming data flow to avoid ambiguities.As shown in the top part of Figure 1, the dataflow labeled with dst on edges signifies that the result of the sum operator is accessed by edges through the destination node index.Furthermore, operators with Scatter require not only copying features from the source or destination vertex but also executing a binary operation.Specifically, the div operator performs an elementwise vector division on the edge feature produced by the exp operator and the vertex feature produced by the sum operator, which is broadcast from the destination node to each edge.

GNN Optimization
Operator Optimization.In the research on GNN optimization, operator optimization has been a focus of early attention and is currently the most prolific field.Some of the reason lies in that some common combinations of operator patterns take up most of the execution time of the whole network.g-SpMM and g-SDDMM [24] are two important kernels because they represent the most common patterns in GNN, namely Scatter-ApplyEdge-Gather and Scatter-ApplyEdge.Additionally, the parallelism of g-SpMM and g-SDDMM matches the vertex-centric parallelism and edge-centric parallelism in graph algorithms.Conclusions on workload balance in graph algorithms also hold for these kernels.Sparse kernels have the inherent property of performance sensitivity to inputs, thus most research works adapt the inputs based on the sparsity of the graph and the variability of feature dimensions [11,12,25].Another effective optimization is preprocessing the graph data to further exploit the irregularity of graph [6,11,22].Computational Graph Optimization.Optimizing the computational graph of GNNs has emerged as a recent research focus.Works in this area analyze the trade-offs made by existing GNN frameworks to accommodate a wide range of GNN models, such as redundant computations and memory accesses, and propose graph transformation techniques to avoid such redundancies.These graph transformation rules can generally be categorized into reordering and fusion.First, the operator reordering involves rearranging the order of sparse and dense operators based on mathematical rules to reduce redundant computations and intermediate results [28,32].Second, the sub-graph fusion involves replacing combinations of multiple operators with a fused operator, which is usually implemented by a fused kernel, to reduce redundant memory accesses and memory consumption [27,33].

Motivation
To enhance the performance of GNNs, computational graph optimization and operator optimization have been developed respectively and independently in recent years [12,27]  the operator level, the implementation of customized sparse kernel designs holds the potential to augment floating-point computation efficiency.As shown in Figure 1, when we apply sub-graph fusion to rewrite the computational graph and subsequently conduct operator optimization to derive final kernels for implementation, we attain significant performance improvements compared to solely relying on computational graph optimization.While the combination of computational graph optimization with operator optimization may yield performance benefits, directly utilizing these two optimizations usually falls short of achieving optimal performance for GNNs.This is attributed to two factors.First, as shown in Figure 1, the performance of different optimization methods is sensitive to inputs, such as input data graphs.Second, the selection of a superior strategy for a computational graph does not always result in higher overall performance after operator optimization.As evident from Figure 1,  1 outperforms  2 for the input graph 0 without using operator optimization.However,  2 with operator optimization could unlock more performance potential compared to  1 with operator optimization.
As a result, it is challenging to coordinate computational graph and operator optimizations to achieve extreme performance.This difficulty arises because computational graph optimization is unable to perceive or evaluate real performance when transforming the computational graph without the operator kernels.In this work, we endeavor to address this challenge by proposing a coordinated strategy.This strategy searches for improved designs alternately at the computational graph level and operator level, even though empirical computational graph optimization may not directly perceive final performance.

Basic Idea
This work is based on two attractive observations.Observation 1.An operator is usually linked to a single kernel, representing a specific implementation or code block.Nevertheless, we observe that the efficient implementation of a sparse operator may require more than one kernel, as illustrated in Figure 2.This finding suggests that kernel optimization potentially involves the division of an operator kernel into multiple kernels.In this study, Figure 2(a) and (b) summarise two types of kernel splits for GNNs, namely vertical kernel split (Vsplit) and horizontal kernel split (Hsplit).
Definition 3.1.For a given GNN and its data graph  ( , ), assume  can be divided into (1) A kernel  represents a specific code implementation of one operator or multiple operators for GNN.
(2) A Vsplit of  is to replace  with   kernels K that could be executed concurrently (where K represents a customized kernel design equivalent to the original kernel  confined to   . (3) A Hsplit of  is to replace  with   kernels K that are executed sequentially, satisfying that  = K1 • • • K  where the output of K is the input for K+1 .
From Figure 2, it is evident that both Vsplit and Hsplit could yield performance improvements in certain cases.This observation holds significance in the endeavor to achieve optimal performance for GNNs.Observation 2. To the best of our knowledge, recent efforts for GNNs have independently studied computational graph optimization and operator optimization for several years [12,27].To achieve high performance, it has been conventionally presumed that leveraging existing computational graph and operator optimization methods independently would suffice.The established practice involves the rewriting of the computational graph and the design of efficient kernels for the operators within the modified computational graph.This optimization idea has become commonplace in high-performance domains for artificial intelligence, not only for GNNs.However, this work makes a pioneering observation that Vsplit and Hsplit could lead to alterations in the computational graph (Figure 3).This revelation provides fresh opportunities for further advancements in computational graph optimization.Recent studies have not uncovered this aspect, likely attributed to the conventional separation between computational graph optimization and operator optimization in current research efforts.

Coordinated Optimization Strategy
Based on the analysis above, we propose AlphaGNN, a coordinated optimization strategy combining both computational graph transformation and operator design (Algorithm 1).The iterative workflow of AlphaGNN is shown in Figure 4. Initial Process: Computational graph rewriting.Initially, computational graph rewriting is employed to eliminate both computation and memory-access redundancy.Specifically, AlphaGNN rewrites the computational graph with the given patterns in Table 1.Based on basic mathematical laws, patternbased rewriting could significantly reduce the overhead of computation and memory access.In AlphaGNN, hundreds of rewriting patterns are summarized, which cover prevalent computational graph rewriting approaches, such as operator reordering and combining found in current works [4,24,27,28].For example, if an ApplyEdge operator  and its successor Gather operator  satisfy both commutative and distributive laws, a redundancy may exist in the ApplyEdge-Gather procedure.According to the laws in Table 1, ({ ( 0 ),  ( 1 ), ... (  )}) =  (({ 0 ,  1 , ...,   })), while executing ({ ( 0 ),  ( 1 ), ... (  )}) involves more computational cost.The operators concat and mm match the fourth rewriting pattern in Table 1, as Figure 4 shows.The pattern substitution reduces the number of floating-point operations.
Step 1: Computational graph optimization.Building upon the coordination concept of computational graph and operator optimizations, the computational graph optimization aims to exploit operator fusion opportunities by constructing fusion groups composed of appropriate candidate operators, which provides the basis for the efficient operator kernel generation in Step 2 (the subsequent step).In AlphaGNN, sequential operators are viewed as fused candidates, and they are grouped based on a sub-graph fusion mechanism (Section 4.1.2).The sub-graph fusion mechanism comprises two fusion rules that respectively focus on the characteristics of each operator and the relationship of two adjacent operators.The fusion process involves a sequential scan of operators in a computational graph.When either the current operator or two adjacent operators meet the conditions of any rule, the corresponding rule is applied to construct fusion groups.It is important to note that all operators within the same fusion group serve as fused candidates, while the ultimate determination of how to fuse them depends on the operator kernel design in Step 2. As shown in Figure 4, after the computational graph rewriting, the computational graph optimization is executed.According to the rules of the sub-graph fusion mechanism, the dense matrix-matrix multiplication operator mm is not allowed to participate in the fusion process.In the case of the GAT model (Figure 4), the fusion process bypasses the operator mm and proceeds to scan the subsequent operators.As no rule is triggered anymore, all the operators are selected into the same fusion group.
Step 2: Operator optimization.To achieve high performance, the operator optimization is to design and tune kernels for each fusion group with multiple candidate operators.Specifically, all the operators in a fusion group are considered together to generate a kernel at first.Next, the generated kernel would be split from the vertical and horizontal directions to explore more efficient implementation.As the Vsplit and Hsplit are two mutually orthogonal optimizations, they need to be executed only once.
In AlphaGNN, we propose the kernel split mechanism and automatic code generation mechanism for efficient operator optimization (Section 4.2).The kernel split mechanism describes the precise execution process of Vsplit and Hsplit.To effectively enhance performance, profiling data is utilized to ascertain the eligibility of extracting an operator from the fusion group.Simultaneously, the automatic code generation mechanism ensures the generation of any kernel required in the operator optimization and guarantees its efficiency.As shown in Figure 4, following the computational graph optimization, the operator optimization sequentially applies vertical and horizontal kernel splits to the fusion group named norm-SpMM.The corresponding efficient kernels are automatically generated, and the real performance is evaluated and recorded for Step 3.After kernel generation and optimization, if we view each generated kernel as an individual operator, the computational graph would change, as shown in Figure 4.This transformation provides an opportunity to re-execute computational graph optimization, potentially leading to further improvements in performance.Step 3: Iteration.The workflow proceeds to Step 1 (Figure 4).

Both
Step 1 and Step 2 are iteratively executed until there is no performance improvement and the computational graph becomes stable.Remark.The iterative design searches high-performance GNN implementations alternatively between the computational graph level and operator level, which offers two key advantages.First, based on kernel split in both vertical and horizontal directions, the computational graph thoroughly explores positive opportunities for fusing sub-graphs.Secondly, as the computational graph optimization alone may not accurately gauge real performance without the final implementation of the operator kernel, the coordinated strategy utilizes operator optimization, rather than computational graph optimization, to determine the ultimate transformation of the computational graph at each iteration.This approach enables the possibility of performance-based operator fusion.

IMPLEMENTATION
In AlphaGNN, we aim to maximize the impact of coordination optimization by introducing two key approaches: a rule-based computation graph optimization approach (Algorithm 2) involving computational graph rewriting and sub-graph fusion mechanisms, and a performance-driven operator optimization approach (Algorithm 3) that includes kernel split and automatic code generation mechanisms.

Rule-based Computational Graph Optimization
4.1.1Computational Graph Rewriting Mechanism.In the first iteration of the coordinated optimization strategy, we need to perform graph rewriting on the original computational graph.AlphaGNN applies the rewriting patterns that are less affected by the irregularity of the data graph, and follows the graph-rewriting process (Line 1-8 in Algorithm 2).
Stage 1: Reorder computation-intensive dense and sparse operators.These reordering strategies are based on mathematical laws such as commutativity, associativity, and distributivity.By moving computation-intensive dense operators before Scatter or after Gather, the input size of dense operators can be reduced from  (||) to  (| |), as shown in Table 1.These operators are not involved in sub-graph fusion processes [27,32], so adjusting their order does not affect subsequent sub-graph fusion.
Stage 2: Eliminate redundant data movement.In an unoptimized computational graph, there are sometimes 'empty' Scatter that copy vertex features to the edge, which leads to redundant intermediate results.A simple but efficient solution for this issue is to fuse the empty Scatter with the following ApplyEdge operator using a g-SDDMM kernel [24,33], as shown in Table 1.Although this optimization strategy affects the implementation of operators, the fused implementation does not bring major side effects, as the g-SDDMM kernel takes an edge-centric parallelism.

Sub-graph Fusion Mechanism.
The sub-graph fusion aims to reduce the overhead of kernel launch, as well as the memory access overhead.In AlphaGNN, the sub-graph fusion mechanism (Line 9-27 in Algorithm 2) seeks to fuse as many operators as possible, yielding a minimum number of operators.Further, the performance of operators is affected by the irregularity of input graphs and GPU architecture, and it is necessary to exploit these properties to obtain the efficient GPU implementation of fused operators.

Fusion rules. The sub-graph fusion mechanism includes two rules.
Rule 1: The compute-intensive operators (such as dense matrix-matrix multiplication operator) and fused operators are not allowed to participate in the fusion process.
Supporting fact (a): In the computational graph of a GNN, each operator is classified as either compute-intensive or memory-intensive, and their tuning strategies are quite different.The compute-intensive operators, such as mm, are often highly optimized in cuDNN [2] and widely adopted.
To maximize the fusion opportunities between memoryintensive operators, the compute-intensive operators are not allowed to participate in the fusion process.

Supporting fact (b):
In our coordinated strategy, if a fused operator is generated and determined at the operator optimization once, fusing it with other operators will not yield performance improvement.Theorem 4.1.In the coordinated strategy, executing fusion again on operators that have already been fused will not lead to performance improvement.
Proof.Assuming that a fused operator  and an independent operator  are fused again, this can lead to improved performance.The performance at iteration i is represented by   .(1) In the first iteration, the fusion groups obtained are as large as possible, thus  and  will not be fused.(2) In the iteration  ( ≥ 2), there are two possible scenarios: a).In the iteration  − 1,  and  are in the same fusion group.This implies that there is an Hsplit triggered during the operator optimization step, resulting in the separation of  as an independent operator.In this case, if   >   −1 , performing fusion again in iteration  will not yield better performance.b).In the iteration  − 1,  and  are in the different fusion groups.This implies that the fusion group consisting of  and other operators performs better than the group consisting of both  and , where the attribution of  is vital to the performance.Despite undergoing the operator optimization step, the fusion of  and  can only achieve sub-optimal performance.Therefore, performing fusion again in iteration  will not yield better performance.□ Rule 2: For adjacent operators, if a global synchronization is required before executing the successor, then these two operators need to be assigned to different fusion groups.
Supporting fact: Considering the execution mode of GPU, synchronizing the outputs of preceding operators within a fusion group may incur significant overhead.Thus, the effective sub-graph fusion needs to minimize global synchronization as much as possible.
Algorithm Generation process of fusion groups.Figure 5 presents the generation process of fusion groups in AlphaGNN.The operators in the computational graph are sequentially scanned.First, we initialize a candidate set  as an empty set.Second, each operator is selected and added to  based on their orders.For the -th operator   , if Rule 1 is triggered, indicating a breakpoint,   won't be put into .All the operators present in  are then organized into a fusion group, and  is cleared.The operator scanning process continues after the breakpoint.Third, if two adjacent operators trigger Rule 2, the former operator is added into , and all the operators in  form a fusion group.Once  is emptied, the successor operator is added to , and the operator scanning process continues.This process repeats until the end of the computational graph is reached, at which point all operators in  constitute the final fusion group.
Remark.The two rules above only exclude two obvious nonprofitable fusion scenarios, which involve compute-intensive operators and high-overhead global synchronizations.In computational graph optimization, the larger fusion group provides more room for the following operator optimizations.Therefore, the fusion mechanism aims to create each fusion group as large as possible.

Performance-driven Operator Optimization
AlphaGNN coordinates the computational graph optimization and operator optimization iteratively, in which process operator with volatile semantics would be generated.In this section, we propose the kernel split mechanisms that also could change the computational graph (Algorithm 3).Furthermore, a template-based code generation strategy is adopted to tune kernels produced in the kernel split process.

Kernel Split Mechanism.
The kernel split mechanism aims to optimize the execution efficiency of fusion groups.
As stated in Section 3.1, it involves two types: vertical kernel split (Vsplit) and horizontal kernel split (Hsplit).
Vsplit.As the properties of the input data graph have a great impact on the computation efficiency, the basic idea of Vsplit is to divide the data graph vertices into several parts according to the vertex degree and customize kernels for each of them individually.More specifically, we divide the data graph into three types of subgraphs, i.e., dense subgraph, middle subgraph, and sparse subgraph.For each type, the individual kernels are produced automatically to adapt to the irregularity of the subgraph.The original kernel executed on the whole graph data is replaced by several small kernels that are in parallel execution on each of the subgraphs.To the best of our knowledge, AlphaGNN is the first to apply the data graph partition to the operator optimization for GNNs.
Hsplit.The process of Hsplit (Line 15-41 in Algorithm 3) tries to replace one kernel with several sequential kernels.In computational graph fusion optimization, a default assumption is that the bigger fusion group leads to better performance.However, as Gather operators in the fusion group may result in serious atomic operations with the size of the fusion group increasing, a bigger fused kernel usually involves worse workload balance, more intermediate results, and more memory accesses.Hsplit trends to trade off between them to get a higher performance.atomic operations, which also decelerates the performance of the fusion group.Therefore, we attempt to split the Apply and Gather type operators from the fusion group in pursuit of better performance.

Rules of Hsplit.
Hsplit traverses all the operators in the fusion group sequentially from either the head or the tail.When Hsplit encounters Apply or Gather operator, it determines whether to split the fusion group into two parts based on the potential performance gain (Line 20-34 in Algorithm 3).Hsplit continues until no more splits with performance gains can be identified.
Performance gain metrics.We evaluate the potential performance gain of a candidate horizontal kernel split based on the actual running time.In detail, we generate all the kernels before and after executing the candidate split, such as K, K1 , and K2 in Algorithm 3. Next, the performance of each kernel is measured by executing them on the target GPU.Compared with the actual running time of kernels, we could know whether the candidate split achieves the potential performance gain.Furthermore, to reduce assessment overhead, we preserve the performance profiling data for reuse in the subsequent coordinated optimization process.

Automatic Code Generation
Mechanism.We utilize a template-based method to implement the code generation.As stated in previous sections, the operator optimization process would produce single kernels and fused kernels.For single kernels, we create an individual template for each of them.This is feasible due to the limited number of single operators in GNN models.For fused kernels, each of them usually consists of various kernels that cover a large design space.Fortunately, existing works [3,11,15,27] have demonstrated the feasibility of generating efficient sparse kernels with rich semantics through code generation techniques.AlphaGNN's automatic code generation is inspired by these works.The code generation mechanism extracts the computation and memory access behaviors of the target kernel.By utilizing predefined code fragments containing tunable parameters, we construct a sparse kernel that aligns with the semantics of a fusion group.
During the construction of kernel templates, AlphaGNN directly extracts necessary information from the fusion group.It performs a topological traversal of the operators in the computational graph, and selects appropriate code fragments to build the kernel template for each operator.It is important to note that for inference optimization, write-back code for intermediate results is not generated.However, for training optimization, write-back code is generated for intermediate results based on the process of backward propagation and their usage in corresponding operators.

Implement Details and
Infrastructure.We use PyTorch for implementation and utilize the high-performance dense operator implementation in cuBLAS.Meanwhile, all the sparse operators are generated by AlphaGNN.AlphaGNN constructs a series of templates, and the code generation is to join these templates together, which involves heuristically searching the appropriate templates and selecting optimal configurations for each template.
For portability, it takes very little effort to integrate Al-phaGNN into the existing frameworks, such as PyG and DGL.
For example, we only need to package our computationalgraph rewriting rules and automatic code generation functions into PyG or DGL, which are utilized to describe the coordinated optimization strategy.It's worth mentioning that, this work develops a map function using C++, which facilitates data transformation between PyG and AlphaGNN, enhancing interoperability.

Experimental Setup
We choose three widely used GNN models, GCN [19], GAT [23], and APPNP [7] for evaluation, for the reason of their diversity and complexity evolution.Meanwhile, we provide experiments with two additional models: the RGCN [20] and SAGE [10] for their inference processes.For the baselines, we select DGL, PyG, Graphiler, and Seastar for their popularity as state-of-the-art GNN frameworks, both frameworks received updates recently, we evaluate the following versions of these frameworks: DGL-2.0.0,PyG-2.4.0,Seastar-0.0.1, and Graphiler-0.0.1.For the datasets, we choose graphs from a variety of domains, as shown in Table 3.We also conduct experiments on some large graphs from the OGB (Open Graph Benchmark) dataset (Section 5.4).We implement AlphaGNN with a CUDA-11.1 backend and a PyTorch-1.8.2-based front end.The dense operator implementation with cuBLAS-11.1.Most of the evaluation results are tested on NVIDIA Tesla V100 GPU.To evaluate the portability of AlphaGNN, we further evaluate different architectures, such as RTX 2080 and Ampere A100 GPU.AlphaGNN utilizes the whole data graph for both GNN training and inference without sampling sub-graphs, so the batch size equals 1.Moreover, the configuration of different GNN workloads is shown in Table 2.

Overall Performance
We first evaluate the overall performance of AlphaGNN on model inference and training.The result shows that Al-phaGNN achieves an average of 2.94× speedup over state-ofthe-art GNN frameworks on training and inference.The most significant speedup of 12.39× is achieved when compared with PyG on APPNP model inference.general, AlphaGNN demonstrates competitive performance on all the models and is faster in most of the graph datasets.Three observations from the results are presented as follows.
First, AlphaGNN achieves 1.96×, 4.57×, 3.40×, and 1.89× speedup on average on GCN, compared with DGL, PyG, Seastar, and Graphiler, respectively.On the APPNP model, AlphaGNN achieves 5.11×, 3.17×, 2.03×, and 1.90× speedup on average.These improvements are attributed to optimizations at both graph and operator levels.Specifically, DGL uses cuSPARSE for graph operations in GCN, whereas AlphaGNN generates kernel automatically by searching a large configuration space for higher performance.PyG's operators for message passing expand the feature matrix, increasing memory redundancy.However, reduces the redundancy with subgraph fusion.Both Graphiler and Seastar use pattern-based reordering and fusion, which ignores the trade-off between workload balance and the number of operators, whereas AlphaGNN effectively integrates sub-graph fusion with kernel split for enhanced performance.Second, on the GAT model, our improvement is 2.97×, 2.78×, 2.28×, and 1.40× speedup on average over DGL, PyG, Seastar, and Graphiler respectively.Additionally, AlphaGNN achieves a 1.8× to 15.4× speedup (Table 4), compared with the latest version of PyG (2.5.1).As GAT has a complicated computational graph, the coordinated optimization of AlphaGNN fully exploits optimization opportunities and outperforms the baselines in all the cases.Third, AlphaGNN performs better for graphs with higher average vertex degrees because of the adaptive selection of the vertex-centric and edge-centric parallel manners in the sub-graph fusion and kernel split processes.On GNN Training.As Graphiler does not support GNN training, this evaluation focuses on the comparison of Al-phaGNN with DGL, PyG, and Seastar.The optimization of AlphaGNN is general both in inference and training of the  GNN model, which has brought significant performance improvements in several experiments.As shown in Figure 7, on the GAT, GCN, and APPNP models, AlphaGNN achieves 1.31∼2.70×,1.49∼3.70×,1.11∼2.77×speedup over DGL, PyG, and Seastar respectively, which shows the superiority of AlphaGNN on GNN training.

Detailed Analysis
On Graph Transformation.Figure 8 illustrates the benefits of the graph transformation in reducing computation.DGL outperforms PyG and Seastar because DGL applies the operator reordering to reduce computational redundancy.As shown in Figure 8, AlphaGNN reduces the calculation amounts by up to 35% compared with DGL, which is because the graph transformation of AlphaGNN is rule-based and able to efficiently reduce the computational redundancy, while the pattern-based operator reordering is manual and limited by human experience.As for the GCN and APPNP models, the simplicity of their model structure causes the  computational redundancy of AlphaGNN to be close to others, and the improvement of performance mainly originates from the increase in hardware utilization.
On sub-graph fusion and kernel split.Memory consumption serves as a crucial metric for evaluating the effectiveness of sub-graph fusion and kernel split in improving GNN training performance.Figure 9   The overhead of AlphaGNN.AlphaGNN is an offline strategy, and its configuration space is large.Table 5 presents the overhead of AlphaGNN for optimizing the inference of the GAT model.The overhead of AlphaGNN is less than 7 minutes, which is a one-time expense.It is worth noting that each optimized result would be consistently utilized in applications.Sensitivity for GPU Architecture.To show the portability of AlphaGNN on GPU architectures, we evaluate it on different platforms.Figure 12 shows that compared with DGL, PyG, Seastar, and Graphiler, AlphaGNN achieves at least 1.45×, 1.47×, 1.16×, and 1.05× speedups on RTX 2080, and 2.23×, 1.14×, 1.03× and 1.10× speedups on A100, respectively.

RELATED WORKS
GNN Frameworks.PyG [4] and DGL [24] are vital GNN frameworks offering user-friendly interfaces and are widely utilized by GNN researchers.To address the challenges of executing GNNs on larger graphs, frameworks like Neu-Graph [16] introduce methods such as pipeline processing and graph partitioning to scale computations across multi-GPU and multi-node systems, aiming to minimize communication times during distributed training.Recent works also include Legion [21], which enhances multi-GPU training; TC-GNN [26], focusing on implementing GNNs on tensor cores; Betty [31], which introduces a new sampling method for training; and BLAD [5], which speeds up dynamic GNNs.These developments, however, are orthogonal to the approach of AlphaGNN.In the area of compiler-based optimization, frameworks like Seastar [27], Graphiler [28], and HGL [9] exploit sub-graph fusion techniques to optimize computational graphs and generally achieve high performance.AlphaGNN distinguishes itself by generating efficient GNN code directly from the graphs and models.Sparse operation optimization.Sparse operations exist in many cases, such as a graph, SpMV [17], SpMM [30], and SpGEMM [13].spECK [18] optimizes sparse general matrixmatrix multiplication using light-weighted analysis.Tensor Algebra Compiler (TACO) [15] uses compiler techniques to achieve competitive performance with hand-optimized kernels for both sparse tensor algebra and sparse linear algebra.These optimizations all focus on a single primitive kernel with simple semantics.AlphaGNN develops an automatic code generation mechanism based on the uniform abstract for GNN operations and action-based modeling.
Deep learning compilers.Recent works in deep learning compilers, such as TVM [1] have introduced various optimizations.TVM optimizes at the graph level and automates operator optimization, producing kernels comparable to highly optimized libraries like cuDNN [2].Most existing compilers optimize at the graph and operator levels while AlphaGNN coordinates optimization across both levels, enhancing overall computational efficiency.

CONCLUSION
This paper proposes AlphaGNN, a coordinated strategy combining the computational graph and operator optimizations to search efficient GNN implementations iteratively.It liberates GNN optimization from the constraints of human expertise and effectively reduces redundancy in computation and memory access.The evaluations confirm AlphaGNN achieves up to 12.39× (2.94× on average) performance improvement over the state-of-the-art methods on diverse GNNs.

Figure 1 :
Figure 1: Performance improvements of different optimization methods over the direct implementation calling cuDNN.

FeatureFigure 2 :
Figure 2: Vsplit and Hsplit.The execution time is normalized against the longest experimental case.

Figure 3 :
Figure 3: Computational graph change due to kernel splits.

Figure 4 :
Figure 4: Overview of AlphaGNN.In the GAT model, the coordinated optimization strategy alternately executes computational graph (CG) and operator optimization.

2 Figure 5 :
Figure 5: The generation process of fusion groups.

Figure 6 :
Figure 6: Performance improvement of AlphaGNN over DGL, PyG, Seastar, and Graphiler on GNN inference.On GNN Inference.Figure 6 shows the inference performance comparison between AlphaGNN and the baselines.In Figure 6: Performance improvement of AlphaGNN over DGL, PyG, Seastar, and Graphiler on GNN inference.On GNN Inference.Figure 6 shows the inference performance comparison between AlphaGNN and the baselines.In

Figure 7 :Figure 8 :
Figure 7: Performance improvement of AlphaGNN over DGL, PyG, and Seastar on GNN training.The missing bars indicate OOM.

Figure 10 :
Figure 10: GM transaction counts for model inference.Missing bars indicate OOM.

out Original Kernel for edge in Graph
.edges in parallel: edge.mid= ApplyEdge(edge.input)return Graph.edges.

mid for node in Graph.nodes in parallel: for neighbor in node
.neighborhood: node.out+= neighbor.

mid return Graph.nodes.out Kernel 1 Kernel 2 for edge in Graph
.edges in parallel: edge.mid= ApplyEdge(edge.input)return Graph.edges.

mid for node in Graph.nodes in parallel: for neighbor in node
.neighborhood: node.out+= neighbor.

out Original Kernel for edge in Graph
.edges in parallel: edge.mid= ApplyEdge(edge.input)return Graph.edges.

mid for node in Graph.nodes in parallel: for neighbor in node
.neighborhood: node.out+= neighbor.midreturn Graph.nodes.out

Table 1 :
Representative computational graph rewriting patterns.In this table, symbols +, || and ⊙ stand for vector addition, vector concatenation and arbitrary binary vector operation.,  and  stand for the features of the source node, destination node, and edge with length   .  stands for linear operators with shape [  ,   ], and  stands for any operation that satisfies communicative and distributive laws. and  stand for operators with Scatter and Gather patterns. stands for copying data from the source or destination vertex to the edge, which is equivalent to   .|| and | | stand for the number of edges and vertices.

Table 2 :
The configuration of different workloads.

Table 3 :
Representative graphs for benchmarking.

Table 4 :
The comparison between AlphaGNN and latest PyG on the inference of GAT model.

Table 5 :
The overhead (seconds) of AlphaGNN for optimizing the inference of GAT model.