Bridging the Gap between Spatial and Spectral Domains: A Unified Framework for Graph Neural Networks

Deep learning’s performance has been extensively recognized recently. Graph neural networks (GNNs) are designed to deal with graph-structural data that classical deep learning does not easily manage. Since most GNNs were created using distinct theories, direct comparisons are impossible. Prior research has primarily concentrated on categorizing existing models, with little attention paid to their intrinsic connections. The purpose of this study is to establish a unified framework that integrates GNNs based on spectral graph and approximation theory. The framework incorporates a strong integration between spatial-and spectral-based GNNs while tightly associating approaches that exist within each respective domain.


INTRODUCTION
Deep learning's performance in various machine learning tasks [96,120,145,168,170,203] has been extensively recognized in recent decades, with amazing success on Euclidean data.In recent decades, a slew of new applications have emerged in which effective information analysis boils down to the non-Euclidean geometry of data represented by a graph, such as social networks [119], transportation networks [21], spread of epidemic disease [156], brain's neuronal networks [148], gene data on biological regulatory networks [58], telecommunication networks [66], and knowledge graph [137].Previous deep learning algorithms, such as convolutional and recurrent neural networks, couldn't handle such non-Euclidean problems on graph-structured data.Modeling data using a graph is difficult because graph data is irregular, i.e., each graph has a different number of nodes and each node in a graph has a varied number of neighbors, making some operations like convolutions inapplicable to the network structure.
There has recently been a surge growing interest in applying deep learning to graph data.Inspired by deep learning's success, principles from deep learning models are used to handle the graph's inherent complexity.This growing trend has piqued the interest of the machine learning community, and a huge number of graph neural networks (GNN) models have been proposed based on diverse theories [14,38,62,87,114,188] and grouped into coarse-grained groupings like the spectral [88,155,204,230,234] and spatial [14,87,188] domains.GNNs have seen rising popularity recently for learning graph representations and quickly spread out to many application domains, such as physics [7,113], chemistry [35,57], knowledge graph [13,211,220,229], recommender systems [49,70,193,202], computer vision [78,104,131], natural language processing [20,182,200], combinatorial optimization [41,80,174], traffic network [39,45,103,219], program representation [64,198,236], social network [24,163,202].However, current research in methodology has not translated into a clear understanding of the mechanisms involved, nor has it given us insight into GNNs' effectiveness or physical meaning.As a result, several consequences will occur: (1) There is no underlying principle that connects all GNNs, which also limits their growth.(2) In high-stakes applications such as drug development, GNN models may carry potentially hazardous unknowns since they are black boxes.Consequently, the necessity of dissecting GNNs is highlighted, thereby driving academics to hunt for a more universal framework.The main problem is that existing GNN models use a variety of techniques, including random walks [71,85,164], PageRank [30,54,115,116], attention models [112,118,188], low-pass filters [29,158], and message forwarding [81,82].Some preliminary research can only explain a few GNNs methods [82,209,222], leaving the majority of GNN unaccounted for.Previous GNNs surveys have dealt mostly with classifying several existing models into multiple categories and expanding on each category individually, with no regard to the interrelationships between them [88,155,204,230,234].
This research1 aims to provide a coherent framework for generalizing GNNs by bridging the divide between seemingly unrelated works in the spatial and spectral domains, as well as by linking methods within each domain.The study will build a unified theoretical framework that encompasses diverse GNNs.Our research is novel in that it connects disparate GNN models, allowing for direct rethinking and comparison of all GNN models.

Graph Neural Networks in Spatial and Spectral Domain
Over the past several years, GNNs have gained a lot of attention.However, the existence of numerous GNNs complicates model selection because they are not easily understood within the same framework.Specifically, one uses spectral theory to implement early GNNs [62,89], whereas spatial theory is used to propose others [46,87].The mismatch inherent to spectral and spatial approaches means that direct comparisons are difficult.Even in each area, there are numerous models, which makes it difficult to examine their strengths and weaknesses.
To untangle the mess, we present a unified framework that connects the spatial and spectral domains and reveals their intricate relationship.Furthermore, both domains' subcategories are proven to have a hierarchical link.The focus on a unified framework adds to the knowledge of how GNNs operate.The goal of this research is to use a combination of spectral graph theory and approximation theory to investigate the relationship between important categories, such as spatial and spectral-based approaches.We give a detailed analysis of GNNs' current research findings in this paper, as well as a discussion of trending topics such as over-smoothing.Many well-known GNNs will be used to demonstrate the universality of our architecture.This article's main motivation is twofold: (1) Connecting the spectral and spatial domains.The fundamental concepts, principles, and physical implications of spectral-and spatial-based GNNs are significantly different due to their distinct features.As a result, we present an overview of the fundamental principles and properties of spectral-and spatial-based GNNs in order to help people better grasp the problems, potential, and necessity of GNNs.Formally, a rigorous equivalence is established, indicating that spatial connection function is comparable to spectral filtering; (2) Dissecting spectral and spatial domains, respectively.In spectral techniques, filtering functions on eigenvalues are examined, and the filtering function choice can be matched with various tactics in approximation theory.While spatial methodologies are used to explore attribute aggregation, which may be understood from the size and direction within a predetermined region.
The structure of the article is summarized as follows: Basic concepts, distinctive principles, and properties of graph neural networks are covered in Section 2, as well as ways for encoding the graph, spectral-based GNNs, spatial-based GNNs, and essential fundamentals.The proposed framework is summarized in Section 3, which emphasizes the relevance of hierarchy.From Section 4 through Section 5, we explore exemplary GNN models in each domain using our proposed taxonomy structure.In Section 6, we go over the advantages and disadvantages of each domain in detail, as well as practical guidance on GNN model selection.Section 7 also includes a case study of our techniques, demonstrating our proposed framework with current and relevant GNNs concerns.

Related Surveys and Our Contributions
Existing works can be divided into three groups: Existing Works Group 1 (Comprehensive Collection): Recently, many extensive surveys on graph neural networks have been compiled [37,88,204,230,234,241]. Instead of studying their hierarchical and underlying mechanisms, most existing surveys focus on gathering newly published works and categorizing them into separate categories.A detailed survey, in particular, provides an overview of many examples of graph neural networks, classifying them as spatial or spectral-based techniques [37].A taxonomy of graph types, training methods, and propagation processes was recently published in [234].Another survey [230] categorized graph neural network advances as semi-supervised (graph convolution), unsupervised (graph auto-encoder), and latest advancements (graph recurrent neural network and graph reinforcement learning).Graph convolution, graph auto-encoder, graph recurrent neural network, and spatial-temporal graph neural networks are all included in [204].These existing surveys, on the other hand, fail to integrate their categories into a cohesive framework.Existing Works Group 2 (Particular Perspectives): The second thread of surveys for graph neural networks is from diverse theoretical perspectives.For example, in the field of graph neural networks with an attention mechanism, a comprehensive and concentrated survey was undertaken [121].Another example demonstrated how many graph neural networks with negative sampling might be merged into an analytical matrix factorization framework [167].One similar study offered a general view proving that network embedding techniques and matrix factorization are equal in terms of two objectives: one for similar nodes and the other for distant nodes [139].One specific survey created four benchmark datasets with diverse features and user-friendly interfaces for 10 common algorithms, providing a unified paradigm for systematic categorization and analysis on several existing heterogeneous network embeddings approaches [213].A recent work analyzed anonymous and degree-aware message-passing to study the distinguishing power of different classes [81].However, this research is limited to a subset of the GNNs family and lacks a global perspective.Existing Works Group 3 (Post-Hoc Explanation): Building post-hoc models and then identifying the underlying patterns from a statistical standpoint is another technique to analyze GNNs [18,122,136,224].Because neural networks are employed without any theories or domain expertise, this methodology is referred to as "black box".For this reason, post-hoc models have the potential to be biased, subject to adversarial attacks, and difficult to verify.Our research focuses on interpretable graph neural networks, which have a strong theoretical foundation.Previous surveys either categorize several disparate groups or only address a few GNNs using a certain methodology.Following the overall goals of our framework, we want to create one framework that unifies GNNs across the spatial and spectral domains as well as within each domain via theoretical support.It should be noted that the majority of the work presented is related to Graph Convolution Networks (GCN) [114], which is the most common type of GNNs, and that many other varieties of GNNs are still based on GCN.As a result, we do not differentiate between GNNs and GCN in this context and refer to GNNs in the following sections.

PROBLEM SETUP AND PRELIMINARY
In this section, we outline basic concepts, necessary preliminary, and problem setup of learning node-level representation which is the major task in the GNN literature.A simple graph is defined as G = (V, E), where V is a set of n nodes and E represents edges.An entry   ∈ V denotes a node, and  , = {  ,   } ∈ E indicates an edge between nodes  and .The adjacency matrix A ∈ R  × is defined by if there is a link between node  and , A , = 1, and else 0. Node features X ∈ R  × is a matrix with each entry   ∈ X representing the feature vector on node .Another popular graph matrix is the graph Laplacian which is defined as L = D − A ∈ R  × where D is the degree matrix.Due to its generalization ability [32] , the symmetric normalized Laplacian is often used, which is defined as Another option is random walk normalization: L = D −1 L. Note that normalization could also be applied to the adjacency matrix, and their relationship is L = I − Ã.Most GNNs focus on node-level embeddings, learning how a graph signal is modified by

V
The set of nodes in a graph.

E
The set of edges in a graph.

A, Ã
The adjacency matrix and its normalization.

L, L
The graph Laplacian matrix and its normalization.
Eigenvector matrix and its transpose.U  ∈ U, u ⊺  ∈ U ⊺ Single eigenvector and its transpose.

D
The degree matrix of A and The feature matrix of a graph.Z ∈ R  × New node feature matrix.
The node hidden feature matrix.
The hidden feature vector of node  . node number  dimension size of hidden feature ⊙ Element-wise product.

N (𝑣)
Directed neighbors of node  a graph topology, and outputting a filtered feature as: where we aim to find a mapping which can integrate graph structure and original node features, generating a update node representation Z.  represents the graph connectivity, and many options are available as listed in Table 3, and most popular are symmetric normalized graph matrices.
In this survey, we use the graph Laplacian, adjacency matrix, and their modifications described in Table 3 to represent a graph.So yet, no experimental or theoretical evidence has been shown that any filter has a consistent advantage [195].This survey is investigating two specific groups of GNNs, namely spectral-and spatial-based GNNs, which are defined as below: Definition 2.1 (Spatial Method).By integrating graph connectivity  and node features X, the updated node representations (Z) are defined as: Random walk column normalized adjacency where  is often implemented with A or L in existing works.Therefore, spatial methods focus on finding a node aggregation function  (•) that learns a proper aggregation with node features X to obtain a updated node embedding Z.
Before introducing another definition, the necessary preliminary background is offered: (1) graph Fourier transform: The graph Laplacian L can be diagonalized [180,244] as L = U Λ U ⊺ , where Λ is the diagonal matrix whose diagonal elements are the corresponding eigenvalues (i.e., Λ  =   ), and U represents eigenvectors.Further, the graph Fourier transform of a signal X is defined as X = U ⊺ X ∈ R  × and its inverse as X = U X. (2) spectral convolution: According to the Convolution Theorem (i.e., Fourier transform of a convolution of two signals is the element-wise product of their Fourier transforms) [161], the convolution is defined in the Fourier domain such that where ⊙ is the element-wise product, and  1 / 2 are two signals defined on the time or spatial domain.
Definition 2.2 (Spectral Method).A node signal  2 = X is filtered by spectral function g = U ⊺  1 as: where g is known as frequency response function.If g is polynomial, rational or exponential function, then we can reduce the equation above to: In short, the objective of spectral methods is to learn a frequency response function g(•).
The goal of both methods is to learn how to approximate node aggregation or a frequency response function using the data.A great deal of approximation techniques are utilized, and thus  or  can be efficiently estimated.Approximation theory is a branch of mathematics dedicated to discovering and quantifying the errors caused when functions are approximated using smaller functions.Despite the fact that polynomials have a more convenient form than rational functions and are more popular due to its efficiency, rational functions are better at approximating functions at singularities and on unbounded domains.The basic characteristics of rational functions are outlined in complex analytic literature [4,6,34,56,149,162,165,166,169,186,246].As an important polynomial approximation, Chebyshev approximation is first introduced as spectral filtering for graph convolution: A real symmetric graph Laplacian L can be decomposed as Chebyshev approximation on spectral filter g is applied [62,89] so that is can be approximated with a polynomials with order k: A most popular graph convolutional network [114] further simplifies this approximation by reducing the order to 1: As a result,  0 is the only parameter to learn.The learnable parameters in many different GNNs can vary based on the model design.

FRAMEWORK OVERVIEW
The development of GNNs is briefly discussed below with representative studies before we introduce our proposed theoretical framework.Table 4 depicts many sample models that focus on node-level graph convolution.The spectral perspective was previously explored (SGWT [89]), and it serves as the technical foundation for all subsequent spectral methods, including spectral convolution and approximation.Researchers continue to add to this thread, demonstrating that spectral methods have the ability to handle graphs (SGNN [38], ISGNN [94], ChebNet [62]).Furthermore, GCN [114], and GraphSAGE [87] create effective training methodologies, gaining considerable attention from various communities.After that, spectral techniques development stagnated, with the exception of a few publications on rational filtering (RationalNet [51], AR [129], ARMA [25]).Meanwhile, focus shifts to the spatial domain, which has dominated GNNs to this point.Random walks (ParWalk [201], DeepWalk [164], LINE [184]) and CNN (DCNN [14]) were used in early spatial approaches.Following that, MPNN [82] solidified the message-passing mechanism in spatial techniques.Highorder polynomial approximation has been studied [117,173,199,209], but only within the context of ChebNet or DCNN.It's worth noting that while numerous publications described their suggested methods from both spatial and spectral perspectives, only a few GNNs are covered [117,155].Until recently, spectral research has demonstrated a resurgence.

Inside the Spatial and Spectral Domain
The hierarchical link between the spatial and spectral domains is depicted in this subsection.Spatial-based techniques can be divided into three types, with specialization and generalization relationships existing: where it is a generalization from left to right, and specialization from right to left.Specifically, Linear Aggregation (A-1) comprises all algorithms for learning a linear function among neighbors in the first-order.Higher-order neighbors are included in Polynomial Aggregation (A-2), and the order number is defined by the polynomials.Additionally, Rational Aggregation (A-3) utilizes self-aggregation.Since the inclusion of higher-order neighbors causes linear aggregation (A-1) to transform into polynomial aggregation (A-2), and polynomial aggregation (A-2) to transform into rational aggregation (A-3) if self-aggregation is added.The approaches falling under the general category of spectral analysis can be grouped into three distinct groups: Eigenvalues Connectivity  which includes left-to-right generalization and right-to-left specialization.Concretely, (B-1) outlines all models that aggregate frequency components using a linear function, while (B-2) uses polynomial approximation, and (B-3) applies rational approximation.Therefore, (B-1) can be generalized as (B-2) if replacing linear approximation with polynomial approximation, (B-2) is generalized as (B-3) if replacing polynomial approximation with rational approximation.

Between the Spatial and Spectral Domain
In this section, we go through the link between the spatial and spectral domains over the border.
There is also equivalency by modifying the analytical form of these subcategories, as shown below.Linear Aggregation (A-1) and Linear Approximation (B-1) are the initial equivalences: (A-1) Linear Aggregation ⇔ (B-1) Linear Approximation, which means that adjusting weights on neighbors in linear aggregation equates to adjusting weights on frequency components in linear approximation using a linear function.Linear Aggregation (A-1) and Linear Approximation (B-1) have the same linear function and can seamlessly convert to each other in closed form.The main difference between them is that Linear Aggregation (A-1) recovers the signal as a linear function of the frequency component, whereas Linear Approximation (B-1) models the target signal as a linear function of neighbor nodes.Both Linear Aggregation (A-1) and Linear Approximation (B-1) aggregate the representations of neighbors by tweaking the weights of each neighbor, or uses a linear filter on eigenvalues with a negative slope, i.e., g(Λ) = − Λ +.
Because the low-frequency components are given a higher weight by  than their original values, this is referred to as low-pass filtering (i.e., eigenvalues).This group's main advantages are (1) its low computational cost and (2) the large number of real-world scenarios that are subject to the homophily assumption (i.e., neighbors are similar).The fundamental disadvantage is that not every network guarantees homophily.Polynomial Aggregation (A-2) and Polynomial Approximation (B-2) are identical in terms of actual operation, i.e., This means that in polynomial aggregation, aggregating higher orders of neighbors can be expressed as the sum of different orders of frequency components in polynomial approximation.Both use higher-order neighbors in addition to first-order neighbors, increasing the capacity to simulate a more complex relationship among the neighbors.It is theoretically more powerful than (A-1)/(B-1) from a spectral standpoint, because (A-2)/(B-2) is a polynomial approximation as a spectral filter, whereas (A-1)/(B-1) is linear regression.As a result, one flaw is the cost of border neighborhood, which leads to higher computational complexity than (A-1)/ (B-1).Another flaw of them is that if the order is set too large, it will over-smooth (i.e., all nodes will look the same), and there is no golden rule for order size because it is based on data.Note that K-layer (A-1) or (B-1) is equivalent to K-order of (A-2)/(B-2), hence stacking K-layer (A-1) or (B-1) causes over-smoothing (B-1).
Similarly, the last equivalence relationship is in which rational aggregation defines a label aggregation with self-aggregation, while rational approximation adjusts filter function with rational approximation.Both alleviate the over-smoothing issue by introducing self-aggregation, which limits the intensity of uni-directional aggregation in the spatial domain.From a spectral perspective, this advantage can be interpreted as the superiority of rational approximation (A-3/B-3) over polynomial approximation (A-2/B-2).In particular, the rational approximation is more powerful and precise, particularly when estimating some abrupt signals like discontinuity.[4,34,56,149,165,166,186] As a result, we may summarize the advantages and disadvantages of each combination as illustrated in Figure 2. The category selection is based on the data complexity and the efficiency required, as there is a trade-off between computational efficiency and generalization capability.Table 5 shows the structure of sections 4 and 5, where we will discuss details of these two threads respectively and exemplify using several representative graph neural networks.The second column denotes spatial perspective (section 4, A-0), which can treat popular GNNs as learning a function of adjacency matrix, or node aggregation function.Similarly, the third column means spectral view (section 5, B-0), which sees GNN as learning a function of eigenvalues or frequency response functions.The cell at the intersection of the second row and second column means that this category of GNNs can be treated as a linear function of adjacency matrix, or say, its node aggregation function is linear.The other intersection cells follow a similar schema.Note that categories within the same row have the same format and function.For instance, in the second row, A-1 and B-1 share the format of a linear function, but their parameters need not be identical.

SPATIAL-BASED GNNS (A-0)
In the current literature, spatial approaches such as self-loop, normalization, high-order neighbors, aggregation, and node combination are often explored.We established a new taxonomy for spatialbased GNNs based on these operations, dividing them into three separate groupings as below.

Linear Aggregation (A-1)
A lot of research has gone into understanding the aggregation among first order neighbors (i.e., direct neighbors) [82,87,164,188,208,209].The supervisory signal patterns are revealed by altering the weights for the node and its first order neighbors.The revised node embeddings, Z(  ), can be sub-section 4.3 (A-3) represented in the following way: where   denotes a neighbor of node   , h(•) is their representations, and Φ/Ψ indicate the weight functions.The first item on the right hand side denotes the weighted representation of node   , while the second represents the update from its neighbors.By applying random walk normalization (i.e., dividing neighbors by degree of the current node), Equation 5and its symmetric normalization can be written as: where   represents the degree of node   .Normalization has better generalization capacity, which is not only due to some implicit evidence but also because of a theoretical proof on performance improvement [107].In a simplified configuration, weights for all the neighbors are the same and is a scalar value  , while the weight for self node  is another scalar value.Therefore, they can be rewritten in matrix form as: Equations 7 can be generalized as an identical form: where Ã denotes normalized A, which could be implemented by random walk or symmetric normalization.As shown in Figure 3, the new representation of the current node (in red) is updated as the sum of the previous representations of itself and its neighbors.(A-1) may adjust the weights of the neighbors.The following are a few state-of-the-art approaches that were chosen to demonstrate this schema: As the one state of the art, GCN [114] adds a selfloop to nodes, and applies a renormalization trick which changes degree matrix from D  =  A   to D =  (A + I)   .Specifically, GCN can be written as: where Â = A + I, and Ã is normalized adjacency matrix with self-loop.Therefore, Equation 9 is equivalent to Equation 8 when setting  = 0 and  = 1 with the renormalization trick.Besides, Fig. 3. Illustration of A-1: the current node (red) updates itself with its original representation plus neighbors.
GCN takes the sum of each node and average of its neighbors as new node embeddings.Note that the normalization of GCN is different from the others, but the physical meaning is the same.
4.1.2GraphSAGE.Computing intermediate representations of each node and its neighbors, GraphSAGE [87] applies an aggregation among its neighbors.Take mean aggregator as example, it averages a node with its neighbors, i.e., where h indicates the intermediate representation, and N denotes the neighbor nodes.Equation 10can be written in matrix form after implementing MEAN using symmetric normalization: which is equivalent to Equation 8with  = 1 and  = 1.Note that the key difference between GCN and GraphSAGE is the normalization strategy: the former is symmetric normalization with renormalization trick, and the latter is random walk normalization.

Graph Isomorphism Network (GIN).
Inspired by the Weisfeiler-Lehman (WL) test, GIN [209] developes conditions to maximize the power of GNNs, proposing a simple architecture, Graph Isomorphism Network (GIN).With strong theoretical support, GIN generalizes the WL test and updates node representations as: which is equivalent to Equation 8 with  = 1 +  and  = 1.Note that GIN does not perform any normalization.

Graph Attention Model (GAT)
. GAT [188] applies attention mechanism by adjusting neighbors' weights, instead of using uniform weights in many related works: where   ∈ R  × is a matrix, ⊗ denote element-wise multiplication, and calculated by a forward neural network   (, ) =  (h  , h  ) with a pair of node representations as input.GAT can be treated as learning dynamic weight on the neighbors since their weights are not uniform.MoNet [155] is similar to GAT, since its update follows: where u is a d-dimensional vector of pseudo-coordinates u(, ), and where   are learnable d × d and d × 1 covariance matrix and mean vector of a Gaussian kernel, respectively.Let   =  ( (•)) as a weight function of a pair of node representations representation, then it is also a attention model: These works do not consider updating nodes with their original representations, i.e.,  = 0 and  is replaced with matrix parameter  in Equation 8.However, it is easy to extend them with self node.

Polynomial Aggregation (A-2)
Several research use higher orders of neighbors to integrate deeper structural information [14,62,85,184,199].Because direct neighbors aren't always enough to describe a node's surroundings.
Excessive order, on the other hand, generally averages all node representations, resulting in oversmoothing and a loss of emphasis on the immediate neighborhood [129].Many models are motivated by this to fine-tune the aggregation strategy based on different orders of neighbors.As a result, adequate constraint and order flexibility are essential for node representation.Challenging signals, such as Gabor-like filters, have been shown to have a high order of neighbors [3].Define the shortest distance between node  and  as   (, ), and N (, ) to be the set of nodes  that satisfies   (, ) = , i.e., -order neighbors.Formally, this type of work can be written as: where Ψ ( ) indicates a scalar parameter for all -order neighbors.Setting the same order neighbors to share the same weights, Equation 17 can be rewritten in matrix form: where P  (•) is a polynomial function with order number k. Applying symmetric normalization, Equation 18 can be rewritten in matrix form as: where  =  0 , and A could also be normalized by random walk normalization: Ã = D −1 A. As shown in Figure 4, the new representation of the current node (in red) is updated as the sum of the previous representations of itself, its first and second-order neighbors.Note that the weights among those representations are learnable.Several existing works are analyzed below, showing that they are variants of Equation 18 or 19.

4.2.1
ChebNet.To bridge the gap, spectral convolutional operation is generalized, which requires expensive steps of spectral decomposition and matrix multiplication [69,94].Introducing truncated Chebyshev polynomial for estimating wavelet in graph signal processing, ChebNet [62] embeds a novel neural network layer for the convolution operator.Specifically, ChebNet can be written as: where   (•) denotes the Chebyshev polynomial and   is the Chebyshev coefficient.θ is the coefficient after expansion and reorganization.Since L = I − Ã, we have: which is exactly Equation 19.The predefined  is the order number of Chebyshev polynomial, and a larger K mean higher approximation accuracy in estimating the function of eigenvalues.Equation 21shows that K also can be explained as the highest order of the neighbors.

4.2.2
DeepWalk .Applying random walk, DeepWalk [164] first draws a group of random paths from a graph and applies a skip-gram algorithm to extract node features.Assuming the number of random walk is large enough, the transition probability of random walk on a graph can be represented as: Let the window size of skip-gram be 2 + 1 and the current node is the (t+1)-th one along each sampled random walk path, so the farthest neighbor current node can reach is the first one and the last one.A node and its neighbors are likely to appear in the same random walk path, and the neighbors follow the transition probability (Equation 22) to appear in the same path.Therefore, the updated representation is as follows: where I means that DeepWalk always considers previous representation of the current node as one element, Ã represents the direct neighbors' transition probability, and Ã2 denotes that of the second-order neighbors.It will have at most -order neighbors depending on the predefined length of the random walk (i.e., 2 + 1).

Diffusion convolutional neural networks (DCNN)
. DCNN [14] uses a degree-normalized transition matrix (i.e., renormalized adjacency matrix: Ã = D A) as graph representation, and performs node embedding update as: where Ã * denotes a tensor containing the power series of Ã, and the ⊙ operator represents elementwise multiplication.It can be transformed as:

Scalable Inception Graph Neural Networks (SIGN)
. By generalizing GCN [114], GAT [188] and SGC [199], SIGN [173] constructs a block linear diffusion operators along with non-linearity.For node-wise classification tasks, SIGN has the form: where [•, •, . . ., ] is concatenation, and  denotes the power number.Then, SIGN can be rewritten as: where  (A  XΘ  ) could be treated as refined representation of each order of label aggregation by non-linear function  and fully-connected layer Ω, i.e., A  X. Replacing  with normalized adjacency matrix, it can be rewritten as:

Graph diffusion convolution (GDC)
. GDC [117] defines a generalized graph diffusion via the diffusion matrix: where   are the weighting coefficients with ∞ =0   = 1,   ∈ [0, 1],  is a generalized undirected transition matrix which includes the random walk transition matrix   = A D −1 , and the symmetric transition matrix In the general case, it can be written as : 4.2.6 Node2Vec.Node2Vec [85] defines a second-order random walk to control the balance between BFS (breath-first search) and DFS (depth-first search).Consider a random walk that traversed an edge between node  and , denoted as (, ), and now it resides at node .Then, the transition probabilities to next stop  from node  is defined as: where  (, ) denotes the shortest path between nodes  and . (, )=0 indicates a second-order random walk returns to its source node, (i.e.,  →  → ), while  (, )=1 means that this walk goes to a BFS node, and  (, )=2 to a DFS node.The parameters  and  control the distribution of the three cases.Assuming the random walk is sufficiently sampled, Node2Vec can be rewritten in matrix form: which can be transformed and reorganized as: where transition probabilities Ã = D -1 A is random walk normalized adjacency matrix.

LINE[137]
/ SDNE [192] .These two models consider first order and second-order neighbors as the constraints for learning node embeddings.first order: the nodes representation is forced to be similar to its neighbors, which is equivalent to: Second-order: the pair of nodes are forced to be similar if their neighbors are similar, which is equivalent to make second-order neighbors similar, therefore we can get the second-order connectivity by taking the power of the original adjacency: Then the final learned node embeddings are formulated as: Since LINE uses concatenation between the representations constrained by first and second-order,  = 1; For SDNE,  is pre-defined.

Simple Graph Convolution (SGC).
To reduce the computational overhead, SGC [199] removes non-linear function between neighboring graph convolution layers, and combines graph aggregation in one single layer: where Ã is renormalized adjacency matrix, i.e., Ã = D-1 2 A D-1 2 , where D-1 2 is degree matrix with self loop.Therefore, it can be easily rewritten as: which only has the highest order term.

Rational Aggregation (A-3)
Most works merely consider label propagation from the node to its neighbors (i.e., gathering information from its neighbors) but ignore self-aggregation.Self-aggregation means that labels or attributes can be propagated back to themselves or restart propagating with a certain probability.This reverse behavior can avoid over-smoothing issue [115].Note that Polynomial Aggregation (A-2) may manually change the order number to relieve the over-smoothing issue, but Rational Aggregation (A-3) can do so automatically.Theoretically, rational function approximation is more effective than polynomial and has been researched in machine learning problems [33,165,185].Several works use a rational function on the adjacency matrix to perform self-aggregation, either explicitly or implicitly [25,51,102,115,124,130,143,187].
Because generic label propagation is achieved by multiplying the graph Laplacian, self-aggregation may be achieved by multiplying the inverse graph Laplacian as follows: where P and Q are two different polynomial functions, and the bias of Q is often set to 1 to normalize the coefficients.As shown in Figure 5, the new representations of the current node (in red) are updated as the previous one with probability P, and as that of neighbors with probability (1-P).The difference of A-3 beyond A-2 is that A-3 can avoid over-smoothing issue in an automatic manner [129,232].Over-smoothing issue happens when GNNs go deep, which would drive node features to a stationary point and average all the information from raw node representations.Graph .Illustration of A-3: These current nodes (red) are using the representation that predates this iteration and the surrounding nodes to compute the total.In A-3, the ratio of the original representation remains stable, whereas A-1 dose not control the ratio.
convolution can be described as an optimization problem [130,158,232,241], e.g., (1) minimizing the supervised loss and (2) keeping the local neighborhood similar: (1) supervised loss where  is the controlling weight between the two constraints.The problem has analytical solution: However, because ℎ increases with the number of times graph convolution is done, it is prone to over-smoothing.Over-smoothing is addressed in a variety of ways [50,98,172,173,210], including Rational Aggregation (A-3), which does so by retaining a portion of the original representation no matter how many iterations it does, greatly reducing over-smoothing.

4.
3.1 Auto-Regressive.Label propagation (LP) [22,233,243] is a widely used methodology for graph-based learning.The objective of LP is two-fold: one is to extract embeddings that match with the node label, the other is to become similar to neighboring vertices.The label can be treated as part of node attributes, so we have: which is the closed-form solution and also equivalent to the form of Equation 39, i.e., P = I and Q = (1 + ) I − Ã.

Personalized PageRank (PPNP)
. Obtaining node's representation via teleport (restart), PPNP [30,115,223] keeps the original representation (self-aggregation) X with probability .Therefore, 1- is the probability of performing the normal label propagation: where Ã = D -1 A is random walk normalized adjacency matrix with self-loop.Equation 43 is with a rational function whose numerator is a constant.

ARMA filter.
ARMA [25] filter approximates any desired filter response function with updates as: Note that ARMA filter is an unnormalized version of PPNP.When a+b=1, ARMA becomes PPNP.A partially absorbing random walk is a second-order Markov chain with partial absorption at each state.[201] shows that with proper absorption, the absorption probabilities can well capture the global graph structure.Note that the concept "absorption" in [201] is similar to "teleport" or "restart" in PPNP [115].ParWalks defines the aggregation as: where  is defined as a variable to control the level of absorption,    and   indicate non-negative matrix of pairwise affinities between vertex  and , and degree of vertex , respectively.
where  is redefined as a regularizer in the original paper [201].When  = 1, all nodes follow the same absorbing behavior.Otherwise, each node has an independent absorbing policy.Also, ParWalks model is equivelent to ARMA filter ( =  = 1 2 ) when  = 1 and with normalized Laplacian: The author also discussed the over-smoothing issue: when Λ = I and as  → 0, a ParWalk would converge to the constant distribution 1/, regardless of the starting vertex.4.3.5 RationalNet.To leverage higher order of neighbors, RationalNet [51] proposes a general rational function with a predefined order number, and it is optimized by Remez algorithm.The analytic form is exactly Equation 39.The major difference beyond PPNP or ARMA filter is that RationalNet generalized them, and the order can be any number.

Remark:
The optimization towards rational aggregation (A-3) is exactly the same as the residual learning that was first and widely used in image recognition [91].As shown in the work PPAP [164], the author proposed an iterative algorithm called APPAP, which is where Z  means the intermediate representations at -th layer.This format is exactly the same as residual learning as illustrated in Figure 6: sum of multiple graph convolutions serve as  (), and each time identify input will be added.The slight difference is that Equation 48 has normalized the weights, i.e., (1 − ) +  = 1.Because Rational Aggregation (A-3) includes the inverse of a matrix, it has a high computational cost.Iterative methods are commonly used to efficiently determine the inverse of a matrix [25,115,124].In this subsection, we only demonstrate one layer or iteration of Auto-Regression, PPNP, ARMA, and ParWalks.Multiple iterations will result in a more sophisticated rational function.Higher orders are achieved by several iterations or layers.The main distinction between rational and polynomial aggregation is whether or not the inverse graph Laplacian polynomial exists.In each cycle of rational aggregation, a fixed ratio for the original representation is always reserved, whereas polynomial aggregation does not.On the other hand, calculating the inverse of the graph Laplacian is expensive, making iterative fashion a key object in online learning algorithms [95].By leveraging the concept of conductance, with  as a heat distribution over the vertexes, L( ) indicates the flux induced by  over the graph.Then based on the representer theorem [12,176],  (V  ) = L −1 (L( )) could be interpreted as the heat at each vertex been expressed concerning or derived from the flux through every vertex.Thus, when L sends a heat distribution f over each node to flux through each vertex, L −1 sends some of the fluxes over the graph back to the original heat distribution (i.e., keep part of fluxes itself).Going back to the graph learning application, we first translate our updated "heat distribution" to flux through all of those nodes by calculating P(L( )).M-th degree of P(•) means that each vertex can update M-th neighbors at most.Then using another updated flux in the reverse direction, Q(L( )) −1 will adjust or reduce flux within N-th neighbors.Polynomial aggregation with more layers or a higher degree tends to involve more neighbors, increasing capacity.When utilizing too many layers or degrees, over-smoothing is almost always unavoidable (e.g., all nodes are similar).However, unless all of the layers or degrees are tried, determining the appropriate number of layers or degrees is difficult.The over-smoothing problem is largely overcome by the "sending back" (i.e., teleport) behavior of rational aggregation, in which the out-degree flux is restrained even if excesses of graph convolutional layers or approximation degrees are added.[115].Three groups of spatial methods introduced above (i.e., A-1, A-2, A-3) are strongly connected under generalization and specialization relationship, as shown in Figure 1.Generalization: By adding more neighbors of higher rank, Linear Aggregation (A-1) can be expanded to Polynomial Aggregation (A-2).By adding reverse aggregation, Polynomial Aggregation (A-2) can be advanced to Rational Aggregation (A-3); Specialization: Linear Aggregation (A-1) is a special case of Polynomial Aggregation where the order is set to 1. (A-2).Rational Aggregation (A-3) degenerates into Polynomial Aggregation when reverse aggregation is removed (A-2).

SPECTRAL-BASED GNNS (B-0)
The use of eigen-decomposition and analysis of the weight-adjusting function (i.e., frequency filter function or frequency response function) on eigenvalues of graph matrices are both parts of graph spectral theory.In spectral-based GNNs (B-0), weights are applied to frequency components (eigenvectors) in order to recover the target signal using the filter's output.Accordingly, we propose a new taxonomy for graph neural networks, dividing spectral-based GNNs into three subgroups depending on the types of response filtering functions.In addition, the same set of representative models discussed in Section 4 will be analyzed under spectral view.To facilitate comprehension of the analysis, their spatial and spectral analytical forms are listed in Table 6.The detailed transformation of equations in category B-0 is deferred to the appendix.

Linear Approximation (B-1)
Changing the weights of frequency components in the spectrum domain has been the subject of several research.The filter function's objective is to suit the intended output by adjusting eigenvalues.Many of them have been shown to be low-pass filters [130], which implies that only low-frequency components are highlighted, i.e., the first few eigenvalues are increased, while the rest are decreased.There are several studies that may be understood as changing frequency  component weights during aggregation.A linear g function is used in particular: where u  is the i-th eigenvector, and g is frequency filter function or frequency response function controlled by parameters  , with selected  lowest frequency components.The goal of g is to change the weights of eigenvalues to fit the target output.As shown in Figure 7, B-1 updates the weights of eigenvectors (u 1 , u 2 , u 2 . ..) as g  () which is a linear function.Several state-of-the-art methods introduced in Section 4 are analyzed to provide a better understanding of this scheme.

Remark:
The aforementioned methods apply linear low-pass filtering, and the only difference among them is that the bias is different (i.e., 2 for GCN, 2 for GraphSAGE, and 2+ for GIN).Therefore, we study the influence of bias on the filter function, and define a metric: which indicates the overall proportion change of each eigenvalue after applying the response function.A large adjusted value means that the filtering will enlarge the influence of the corresponding eigenvector.The range of the eigenvalue is in [0, 2) for the normalized Laplacian matrix [55].If let  be larger than or equal to 2, we have: when  is larger or equal than 2, the slope is negative, which means that the filter function is low-pass filtering: as the bias increases, the slope becomes larger, and larger weights are assigned to low-frequency spectral components.Therefore, the bias of all studies in this subsection is larger or equal to 2.

Order of Approximation (B-2)
Considering higher order of frequency, filter function can approximate any smooth filter function, because it is equivalent to applying the polynomial approximation.Therefore, introducing higherorder of frequencies boosts the representation power of filter function in simulating spectral signal.Formally, this type of work can be written as: where g(•) = P  (•) is a polynomial function.As shown in Figure 8, B-2 updates the weights of eigenvectors (u 1 , u 2 , . ..) as P  () which is a polynomial function.
Remark: Polynomial approximation, in theory, gets more accurate as the order grows [6,56,162,166,186].It's worth noting that Linear Approximation (B-1) can be thought of as a polynomial approximation of order 1.We look into polynomial approximation on the () function, comparing and contrasting all of the cases in Polynomial Approximation (B-2).Because it is difficult for any straight line to suit a jump signal, as shown in Figure 9a, linear functions cannot accurately approximate ().The situation improves dramatically when polynomial approximation is used, as demonstrated in Figure 9b.The variance will be greatly decreased if the order of the polynomial function is increased (Figure 9c).To recapitulate, higher-order polynomial approximation is more accurate than lower-order polynomial approximation, but it comes at the expense of increased computational complexity.Node2Vec/LINE/SDNE with an order of 2 have lesser approximation power than those with more than 2 layers/orders because the latter's order is predefined and can be as large as possible (e.g., ChebNet [62], DeepWalk [164], Diffusion CNN [14], Simple Graph Convolution [199]).

Approximation Type (B-3)
Despite its widespread use and experimental success, polynomial approximation only works when applied to a smooth spectral signal.Real-world signals, on the other hand, cannot be guaranteed to be smooth.As a result, the rational approximation is employed to improve the accuracy of non-smooth signal modeling.An example of a rational kernel-based technique is as follows: where g(•) = P  (•) Q  (•) is a rational function, and P, Q are independent polynomial functions.Spectral methods process graph as a signal in the frequency domain.As shown in Figure 10, B-3 updates the weights of eigenvectors (u 1 , u 2 , . ..) as g  () which is a rational function.
Remark: When the function to approximate contains discontinuities, rational function has overwhelming advantage over the polynomials or linear functions.Figure 11 illustrates the difference between rational and polynomial approximation.Theoretically, rational approximation only needs exponentially less orders than that of polynomial functions [51].

THEORETICAL ANALYSIS
In terms of volume, spatial-based approaches outnumber spectrum-based methods in the literature [2,37,204,230,234], owing to the following reasons: (1) Spectral-based methods have a much higher computing overhead than spatial-based methods, and spectral methods are less intuitive than spatial methods.(2) Spatial-based approaches are convenient for model construction and scalability.However, from a spatial and spectral perspective, there is a trade-off; neither has a major advantage over the other.This section outlines numerous viewpoints that demonstrate the merits and limitations of such views.

Uncertainty Principle: Global v.s. Local Perspectives
Spectral-based approaches decompose data into orthogonal frequency components and examine graph filtering from the spectral domain with a global perspective.Each frequency indicates a global basis: low-frequency components emphasize local weights with little variation, whereas high-frequency components are linked to significant variance in neighborhood.In other words, the Laplacian spectrum reflects topological properties: the first few eigenvalues are related with substantial community structure, whilst the last few eigenvalues indicate the graph's bipartiteness [60,61].A typical low-pass filtering function for eigenvalues is shown on the left of Figure 12, which raises small eigenvalues while decreasing adjusted values for large eigenvalues.Only low-frequency components are maintained in this scenario, and neighbors have little variance.Filtering patterns from the local neighborhood are characterized by spatial-based approaches.Most GNNs assume homophily among neighbors, so signals traversing across the neighborhood is smooth or with little variation, which is exactly the same concept as low-pass filtering.The relationship between low-pass filtering in the spectral domain (left) and its effects in the spatial domain (middle and right) is depicted in Figure 12 [60].It appears at first glance that global and local viewpoints differ greatly, but on closer inspection, they depict the same signal in very different ways: filtering in the spectral domain that does low-pass/high-pass filtering is analogous to learning which neighbors are similar/dissimilar.While it is true that the two types of observations yield similar results, this does not entail that they are identical.It is impossible to know an unknown quantity's value with absolute confidence in quantum mechanics because of the Heisenberg's uncertainty principle [73].Specifically, where Δ  and Δ  denote time spread and frequency spread, respectively.Signal concentration can also be impacted by the concentration of time and frequency.A graph representing the trade-off between a signal's localization on a graph and in its spectral domain is created, which is influenced by the uncertainty principle of quantum mechanics [5].A lower bound on the product of the two spreads is obtained by quantifying the spreads in the vertex and the spectral domain of a graph signal .This result suggests that applying spatial-based GNNs results in accuracy loss in the spectral domain, while using spectral-based GNNs results in accuracy loss in the spatial domain.In sum: (1) Global and local perspectives are strong connected: Global observation is the process of generalizing details in local settings, while localized understanding provides details of the global picture.
(2) Global and local perspectives outperform each other in their own domains: Clear local context tends to muddy the big picture, but a focus on global context diminishes the smaller picture.GNNs are able to expand the range of options with the addition of other preexisting works, The high space complexity of spectral approaches is associated with the massive memory storage required to load the full graph.If memory is sufficient, the full graph can be covered by using spatial methods.However, for a smaller graph, you can choose to cover it using samplings such as sampling of regions or paths.Stability: To create accurate, consistent results, spatial analysis methods need to apply iterative algorithms, therefore the outcomes will vary.Eigen-decomposition is a unique feature of spectral approaches if no same eigenvalues.Guidance of Choosing Spatial and Spectral Methods.The user can choose between spatial or spectral GNNs depending on their previously described properties.Distributed and Online Learning: spatial method is easily converted to distributed learning [135,178], whereas spectral is difficult to transfer.Even if it is possible to approximate spectrum technique using a neural network model [177] and then use distributed learning on a neural network [190], new nodes and edges must be retrained from scratch.Alternatively, the spatial technique can effectively manage online learning with streaming data [79].Global View: The spectral method may provide a global perspective that the spatial method lacks.In situations where the group form is not spherical, the spatial method may disregard this influence and continue to follow the circular shape as a prospective group [15,90].This can be remedied by employing clustering before to the spatial technique [179].This also renders spatial methods more locally interpretable and obscures their global perspective.

Comparison between Linear, Polynomial and Rational Methods
Linear methods (A-1 and B-1) have a time complexity of O ( 2  ) due to the matrix multiplication of A X. Accordingly, polynomial and rational method are analyzed in Table 8 where K is the order number.To compare their expressive power, the convergence rate on challenging jump signal is employed as a benchmark [51] (a smooth signal cannot distinguish them).As shown in Table 8, rational methods (A-3 and B-3) converge exponentially faster than linear methods (A-1/B-1), and polynomial methods (A-2/B-2) converge linearly faster than linear methods (A-1/B-1).Therefore, there is a trade-off between the expressive power and computational efficiency.linear methods (A-1/B-1) have the best efficiency but only capture the linear relationship.Rational methods (A-3/B-3) consume the most considerable overhead but could tackle more challenging signals.Experiments are undertaken to highlight our theoretical analysis of spatial and spectral approaches by comparing the differences between the three underlying groups.We chose one typical technique for linear filter [114], polynomial filter [62], and rational filter [25].Note that we only distinguish them in the function of A or Λ, keeping all the other configurations the same.The dataset includes representative homophily and heterophily datasets [43,75,105,132,144,214].The evaluation code is released2 .The implementation is based on the official Pytorch Geometric [1].Each model on each dataset is evaluated 50 times, and the results are averaged.
As shown in Figure 13 (Left), there is no significant difference in classification accuracy in the homophily dataset between the three models, with the exception of ChebNet, which performs marginally better in PubMed and encounters an out-of-memory error in physics dataset.ChebNet exhibits inferior accuracy on the Computers dataset, whereas GCN is significantly superior.Figure 13 (Right) illustrates that ARMA consistently outperforms the others when performing the same task on the heterophily dataset, while ChebNet consistently outperforms GCN.This demonstrates the benefits of the advanced filter function, as theoretically analyzed in Section 3.2.In terms of runtime, as shown in Figure 14, ChebNet is often beyond log 1, while GCN never goes beyond log 1. Rational method involve more matrix operation as ChebNet does, so rational method sometimes goes beyond log 1, but due to optimization in implementation (iterative algorithm), which makee it a little faster than ChebNet, but still consistently slower than GCN.In terms of runtime, as shown in Figure 14, ChebNet frequently exceeds log 1, whereas GCN never does.Rational method involves more matrix operations than ChebNet, so it sometimes exceeds log 1.However, due to optimization in implementation (iterative algorithm), the rational method is slightly faster than ChebNet but consistently slower than GCN.This verifies our analysis in Section 6.2.

EXEMPLIFY THE PROPOSED FRAMEWORK
Over-smoothing and large-scale difficulties are two of the most difficult issues for existing GNNs, and numerous recent publications have proposed various approaches to address them.We'll show in this section that all of the enhancements are still covered by our framework.

Sampling Point of View
The sampling mechanism is used as a spatial method for managing large graphs.Subgraph sampling and random walk are popular approaches.
7.1.1Subgraph Sampling.For an early work, GraphSAGE [87] applies uniform node sampling for each batch, which is equivalent to subgraph sampling.The likelihood of transfer therefore follows random normalization (i.e., Ã = D −1 A), which makes it part of Linear Aggregation (A-1).
In the majority of follow-up works, the same methodology is used [227]: (1) build a local graph convolution for the input graph.(2) sample nodes in each layer, and (3) optimize parameters in graph convolution.Steps (2) and (3) proceed iteratively to update the weights via stochastic gradient descent [46,48,77,100,218,227].
To avoid the recursive neighborhood expansion, FastGCN [46] treats graph convolutions as integral transformation of embedding functions and proposes Monte Carlo approach to estimate the integral.FastGCN employs importance sampling independently for each layer and reduce variance cutting down the number of sampling nodes to constant size for all layers, exponentially shrinking the computational cost.FastGCN is proved to be importance sampling, which is better than uniform sampling, but still suffers from unstable learning when no neighbors is selected for one node and activation is zero.To avoid taxing calculation of activation, Stochastic GCN [48] further uses the historical activation in the previous layer to avoid redundant re-evaluation.With adaptive sampling, nodes on subsequent layers are sampled in order to speed up GraphSAGE and FastGCN [100].Learnable graph convolutional layer (LGCL) [77] selects a fixed number of neighboring nodes for each feature based on value ranking, and transform graph into 1-D data which is compatible with normal convolution networks.Similarly, A scalable GCN samples a fixed number of nodes, with different sampling policy called frontier sampling (FS).FS maintains a constant size frontier set consisting of several vertices which is randomly popped out with a degree based probability distribution [226].Cluster-GCN [53] samples a community of nodes determined by a graph clustering algorithm, and compute the graph convolution within each community.

Random Walk.
To derive node-level representations with word2vec [151][152][153], various random walk algorithms are proposed [85,164,184,192,217,227].Paths are viewed as complete sentences, and nodes are viewed as individual words.Transition probability among nodes approximates to a random walk normalized adjacency matrix if enough random walks or uniform sampling have been performed on the paths.
As analyzed in previous section, DeepWalk [164] draws a number of random paths from the graph, which makes the transfer probability of random walk is, i.e., Ã = D −1 A. Let the window size of skip-gram be 2 + 1 and the index of current node is  + 1.Therefore, the updated representation is as Z = 1  +1 P( Ã) X (Equation 23).One popular word2vec configuration, i.e., skip-gram with Negative Sampling (SGNS), assumes a corpus of words w and their context c.Following the work by Levy and Goldberg [125], SGNS is implicitly factorizing: where || and  represents edge number and step size, respectively.Therefore, the target matrix to decompose is still a polynomials of A. Node2Vec [85] defines a 2nd order random walk to control the balance between Breath First Search (BFS) and Depth First Search (DFS).Assuming the random walk is sufficiently sampled, Node2Vec's second order can be rewritten to decompose matrix [167]: where Node2Vec is demonstrated to be polynomial methods.LINE [184] and SDNE [192] learn the node representations within the first-and second-order neighbors, which can be treated as unconstrained version of Node2Vec: log P( Ã) − log() = log || Ã − log .
GraphSAINT [227] employs multiple sampling polices but the best one is random walk.Pin-SAGE [223] improves the efficiency of GraphSAGE [87] by taking the top several neighbors with highest normalized visit counts.
Remark: Sampling methods, as a spatial methodology, seek both variance reduction and efficiency.Subgraph and random walk are equal with enough samples since they traverse the entire network with the transition probability associated with graph connection.A-3 and B-3 do not, however, include any sample methods, mainly due to their higher computational complexity.Space complexity may be improved when sampling methods are used, but there is no guarantee that the time complexity will be much reduced because of the enormous number of steps may be needed before convergence.

Over-smoothing Point of View
We carve out two conditions under which neighborhood aggregation is not helpful: (1) when a node's neighbors are highly dissimilar and (2) when a node's embedding is already similar to that of its neighbors.Most GNNs perform poorly when stacking many layers, which is called the over-smoothing issue.Many related works aiming to solve the over-smoothing issue [28,44,50,99,126,127,138,154,160,172,210,212,232,235,240] can be reduced to one category of our proposed framework.H2GCN [240] proposed a method that combines direct neighbors with higher-order, which is equivalent to Polynomial Aggregation (A-2).Deep GCN [126,127,212] developed a model with a residue module, dense connection, and dilated aggregation, which learns the weights of all different orders of neighbors.This is equivalent to Polynomial Aggregation (A-2).GPR [54] generalize PageRank and found the equivalence of GPR and Polynomial Approximation (B-2).JKNet [210] also follows the same residue methodology as Deep GCN.DAGNN [138] stacks multiple layers which uses different orders of propagation with the learnable weights, which makes it belong to Polynomial Aggregation (A-2).PairNorm [232] presents a two-step method that includes centering and re-scaling, which mitigates the over-smoothing from graph convolution.Therefore, PairNorm is equivalent to Rational Propagation (A-3), since re-scaling is similar to do propagation and restart at the same time.[28] design an adaptive method to dynamically adjust the weights between low-frequency and high-frequency components, resulting in two peaks in the spectral domain.This could also be modeled by Rational Aggregation (A-3) with its accuracy in jump signals.DropEdge [99,172] randomly drops a certain number of edges to avoid over-smoothing, which can be categorized as Rational Aggregation (A-3) since dropping edge prevents the propagation and thereby provides a probability of keeping the original values of nodes.GCNII [50] applies initial residual which combines the smoothed representation with an initial residual connection to the first layer, and identity mapping, which adds an identity matrix to the weight matrix.Initial residual is a trick that PPAP [115] uses, which enables itself to retard the over-smoothing by keeping partial previous representations.Identity mapping further remains one original representation to slow down the spreading of over-smoothing propagation.Remark: A-1 lacks state-of-the-art methods, implying it is vulnerable to over-smoothing.Applying A-1 many times with learnable weights for different ordering equates to A-2.So A-2 might balance low (raw representations) and high orders (smoothed representations).However, too many A-1 operations may result in over-smoothing.So, for A-2, precise order configuration is required.No matter the number of orders or layers, A-3 reserves a proportion of the final representation as raw representation, making itself robust to over-smoothing.

LIMIT, OPEN CHALLENGES AND CONCLUSION
In this paper, we present a unifying paradigm for comprehending GNNs created under various processes.Our study shows that the subcategories are closely related via generalization and specialization links within their domains, as well as equivalence ties across domains.We show the framework's generalization power by reformulating existing GNNs models.As introduced in the sections above, spatial methods are designed by various ideas, and they can be interpreted well by the unified spectral theory.Therefore, we will discuss the potential that spectral theory, as a unfied theoretical framework, may also extend to emerging directions.
Despite the an increasing number of emerging GNNs models [68] made in recent years, spectral methods so far has been intensively studied in node-level graph convolution only, leaving the other graph learning problem uncovered.In recent years, graph learning has been successfully extended to various tasks such as (sub)graph-level tasks, combinatorial optimization, explainability, domain application (brain, PDE solver, circuit, molecule, protein), generative graph, graph transformer, contrast learning, heterogeneous graph [18,86,142,159,222].However, a unified framework is lacking, which is due to the underlying theories from spectral graph theory and graph signal processing applied in node-level graph convolution application by the first few pioneer work [62,89,114], but little attention has been paid to the theoretical study on the emerging topics rather than node-level convolution.Therefore, most new topics in graph machine learning is based on intuitive design or isolated theory, lacking a unified framework to compare and understand these separate learning models.In this section, we will list the possibility that all the selected topics can be looked at through a unified framework.
Theoretical Understanding -The vast majority of recent research has examined spatial and spectral approaches individually, and even from a theoretical standpoint.[111,194,209].However, it is unclear how these two distinct interpretations can be connected.This survey demonstrates that the theoretical advantage of rational function over the others may be proved, however it is currently unclear how the learning approach can be optimally structured to execute this advantage.Using the partial differential equation [183], in which the diffusion equation and wave equation are analogous to polynomial and rational filters, provides an alternate viewpoint to integrate spatial and spectral views.Directed Graph -Most contemporary work, especially spectral methods, only handle undirected graphs, due to symmetric graph matrix is readily available with off-the-shelf techniques.Many related works integrate or sum up asymmetric adjacency matrices from bi-direction into symmetric matrix, which avoids decomposing asymmetric matrix directly.It is possible to decompose asymmetric matrix by some techniques such as the Jordan norm [108], asymmetric matrices can be used to express properties such as graph and filter complexity.Directed graphs and their decomposition can also be achievable in a certain type of geometric space called a graph manifold.As shown in section 2.4 of [123], spectral filters can be defined on directed graphs represented by non-symmetric adjacency matrices with Hermitian transpose.Dynamic Graph -Existing work model with GNNs and RNNs is built using graph convolutional networks and recurrent neural networks, which lack transparency.Due to the limited expressive power of RNNs, the task is constrained to prediction, and it is unable to perform long-term sequence processing well.Graph dynamics has a large number of valuable tasks, such as inferring the structure of a graph, constructing a joint dynamic of structure and attributes, and exploring the connection between structure and mass flow.It is possible to use graph spectra to detect the patterns in dynamics of graph [36,147,181].Due to the challenge of the long-dependency of a path, the spectral method can also provide a potential way to model trajectory prediction, as the spectra of trajectory implicitly reveal the relationship with the whole graph [40].Higher-Order Interaction and Combinatorial Optimization -Most current work falls into firstorder relationship at node-level.For example, the most spatial method is to learn the relationship between the current node and neighbors, but higher-order interaction is either ignored or implicitly included.Existing explainable learning also focuses on the neighbors' identification [225].Also, the existing work pays the most attention to neighbors, but remote connection or higher-order relationship with the other nodes receives little attention.Hypergraphs provide a possibility to model the combinatorial effect [16,72,221] Hodge Laplacian [134,171] and simplicial complex [19,23,27] provide more theoretical tools for modeling this combinatorial effect.Multilayer Network -In numerous realistic biological and engineering systems, however, the units can be interconnected and interdependent via multiple interdependent and heterogeneous networks.Failure of interdependent nodes between linked networks may result in cascading failures inside and across the networks.Similar interactions exist in many cyber-physical systems, where the spread of misinformation about infectious diseases via social media can result in risky daily plans at the group level, resulting in an epidemic outbreak.Such interconnected networks can be represented by multilayer networks that produce new degrees of freedom via coupling interactions.Such "new physics" is prevalent in multilayer systems, but they are still poorly understood [8,26,59,65].Graph neural network research on this crucial area is scarce [83].One popular technique to generalize principles from monolayer networks to multilayer networks is to "flatten" adjacency tensors into matrices (called "supra-adjacency matrice"), and spectral theory is available and worth researching.

Fig. 1 .
Fig. 1.Illustration of major graph neural operations and their relationship.Spatial and spectral methods are divided into three groups, respectively.Groups A-1, A-2, and A-3 are strongly-correlated by generalization and specialization, so are groups B-1, B-3, and B-3.The equivalence among them is marked in the same color.

Fig. 4 .
Fig. 4. Illustration of A-2: This current node (red) is using its original representation plus its first and secondorder neighbors to update itself.

Fig. 5
Fig.5.Illustration of A-3: These current nodes (red) are using the representation that predates this iteration and the surrounding nodes to compute the total.In A-3, the ratio of the original representation remains stable, whereas A-1 dose not control the ratio.

Fig. 8 .
Fig. 8. Illustration of B-2: A polynomial function P maps the eigenvalues to new values.

Fig. 10 .
Fig. 10.Illustration of B-3: A rational function maps the eigenvalues to new values.

Fig. 12 .
Fig. 12. Connection between the Spatial and Spectral Perspective

Table 1 .
Comparing this study with previous studies

Table 3 .
Representations for graph topology

Table 4 .
Chronological list of notable GNNs in spatial and spectral domains

Table 5 .
Structure of section 4 and 5.

Table 6 .
Summary of Representative GNNs

Table 7 .
[29,133,189,228,240,245]patial (A-0) and Spectral (B-0) Methods gap between global and local views or between spectral and spatial information, to improve the expressive potential of GNNs[29,133,189,228,240,245].According to the information above, we can state that no model can be flawless from a global or local perspective.It is only possible to have a proper trade-off between.As shown in Table7, four aspects are compared: Methodology: Spatial approaches describe local regions while working bottom-up, and identify global patterns using a graph frequency approach.On the other hand, spectral methods work top-down, beginning with a graph and ending with a global observation.Computation: To use spatial approaches, one has to carry out a number of steps on their local region before convergence is achieved.With spectral approaches, you can get a critical component with a single-step computation.Space Complexity:

Table 8 .
Comparison on Time Complexity and Expressive Power