Low-bit Quantization for Deep Graph Neural Networks with Smoothness-aware Message Propagation

Graph Neural Network (GNN) training and inference involve significant challenges of scalability with respect to both model sizes and number of layers, resulting in degradation of efficiency and accuracy for large and deep GNNs. We present an end-to-end solution that aims to address these challenges for efficient GNNs in resource constrained environments while avoiding the oversmoothing problem in deep GNNs. We introduce a quantization based approach for all stages of GNNs, from message passing in training to node classification, compressing the model and enabling efficient processing. The proposed GNN quantizer learns quantization ranges and reduces the model size with comparable accuracy even under low-bit quantization. To scale with the number of layers, we devise a message propagation mechanism in training that controls layer-wise changes of similarities between neighboring nodes. This objective is incorporated into a Lagrangian function with constraints and a differential multiplier method is utilized to iteratively find optimal embeddings. This mitigates oversmoothing and suppresses the quantization error to a bound. Significant improvements are demonstrated over state-of-the-art quantization methods and deep GNN approaches in both full-precision and quantized models. The proposed quantizer demonstrates superior performance in INT2 configurations across all stages of GNN, achieving a notable level of accuracy. In contrast, existing quantization approaches fail to generate satisfactory accuracy levels. Finally, the inference with INT2 and INT4 representations exhibits a speedup of 5.11 × and 4.70 × compared to full precision counterparts, respectively.


INTRODUCTION
Data analytics and machine learning on large graphs encompass a wide array of applications, including recommender systems, social networks, web analysis, and computational biochemistry.Some tasks within this scope include node classification, community detection, link prediction, reachability analysis, and influence optimization.Recently, Graph Neural Networks (GNNs) have shown to be effective for learning over graphs [59].GNNs utilize an iterative process, aggregating features from neighboring nodes through learnable parameters, thus generating rich and informative embeddings.
The versatility of GNNs often comes at a price -elevated memory and computation demands.This poses challenges when scaling up to larger graphs and deeper models.Large-scale graphs naturally increase the storage costs and the neighborhood size during the aggregation phase.While deeper models, with more iterative layers, add computational strain, they do capture intricate relationships by broadening the nodes' receptive fields.To counter these, recently, quantization approaches were developed to compress both the model and graph, aiming to reduce storage, computation, and power requirements for inference workloads [11,16,51,67].
Quantization is the process of mapping continuous numerical values into smaller sized representations (e.g., using 8-bits).There are a variety of quantization methods developed for data-intensive tasks, ranging from multi-dimensional indexing for range and similarity queries [6,14,17,52,56,57,61] to processing convolutional and recurrent neural networks [33,55,60].For GNNs, quantization is useful in many practical settings, such as resource-efficient representation learning, reducing energy and communication in sensors, IoT and mobile devices, on-device and embedded learning, managing data/models in distributed and edge computing, and recommender systems which commonly employ graph neural networks.
Unlike conventional applications of quantization, GNNs present unique challenges due to their intrinsic characteristics, which are not, effectively, addressed by the current methods.(i) The process of neighborhood aggregation in GNNs can lead to significant variance in high in-degree node embeddings, thereby exacerbating the quantization error, especially in low-bit cases [51].(ii) As GNNs deepen, they tend to experience the "oversmoothing" issue where each embedding loses its discriminative information due to the repeated, unregulated message passing [19].It is important to understand if this problem remains or is aggravated with the introduction of model quantization.Thus, while reducing GNN size and enabling compressed processing are pivotal for performance efficiency, addressing oversmoothing is crucial to ensure accuracy, especially in deeper models.
While recent studies [27,51,67] have delved into GNN quantization, the problem is far from being solved and there is no effective solution for low-bit quantization that scales for deeper GNNs.Our paper underscores this challenge, revealing that state-of-the-art GNN quantization methods undergo significant degradation at low bit counts (INT4 and INT2).This is more pronounced in deeper GNNs, due to accumulated layer-by-layer quantization errors.We aim to address these intricacies and develop an end-to-end solution.
Our solution involves a quantizer that learns the quantization ranges (QLR) along with a skewness-aware bitwise truncation (BT * ) mechanism.Additionally, we introduce a smoothness-aware message propagation scheme (SMP) to counter the oversmoothing issue in quantized models.This quantizer determines an optimized, data-aware learnable range grounded in the input data distribution, thereby minimizing model redundancy.It is shown to retain its effectiveness with low-bit representations, which makes it apt for large deep GNNs.The skewness-aware truncation embedded within the quantizer improves the accuracy particularly in low-bit (INT2) scenarios.Our message propagation scheme aims to mitigate oversmoothing in deep GNNs by constraining the layer-wise shifts in similarities among neighboring nodes.Furthermore, we prove that by using SMP, the quantization error can be suppressed to a bound.Finally, we demonstrate the efficiency and accuracy of our solution through node classification accuracy on quantized GNN models.
Experimental results demonstrate improvements over the stateof-the-art approaches across various performance measures and workloads.Specifically, our quantizer (QLR) demonstrates remarkable advancements in low-bit quantization, outperforming existing quantization methods while resulting in reduced model sizes.For deeper GNNs, our SMP method delivers more accurate classification compared to other deep GNN approaches both in full-precision and quantized versions.The low-bit quantized SMP, using QLR, achieves greater improvement over alternative deep quantized GNN approaches with the help of the quantization error bound with SMP.BT * improves node classification accuracy on large datasets with INT2 representation, making it comparable to INT8 accuracy.We also show that the INT2 quantization model can yield an inference speedup of 5.11 × compared to the full-precision model.

RELATED WORK
Quantization has been commonly employed for neural network (NN) models [20].NN training is bottlenecked by high memory requirements to handle large data involving intermediate results and feature maps [2].NNs can be trained with low precision using dynamic fixed point arithmetic [9].
Quantization for neural networks can be performed during or after training.The post-training approaches quantize weights or activation of neural networks on a pre-trained full-precision model [4,24].Their low-bit quantization performance incurs significant accuracy degradation.The quantization-aware training aims to avoid this performance degradation [5,12].A useful technique is to expose errors from the quantization operation to the forward pass at model training and use straight-through estimator(STE) to compute the gradients [5].Banner et al. [3] provide a theoretical analysis showing considerable room for quantization under Gaussian weight assumption leading to 8-bit DNNs with comparable accuracy.The success of quantization has led to binary NNs (BNN) drastically reducing computation and memory requirements using hardware-supported bitwise operations with strong precision performances [28].The efficacy of high-order bit representations, involving bitwise truncation applied to 32-bit word embeddings, has been demonstrated in previous studies [8].
GNN quantization have started to receive attention in recent years.Tailor et al. [51] propose quantization-aware training for GNNs, where high in-degree nodes are selected for full-precision training while all other nodes are converted to INT8/INT4.This can achieve reasonable accuracy especially on INT8 models.Huang et al. [27] employ product quantization to compress input data but do not address the more challenging task of quantization of parameters.A recent GNN quantization approach [67] addresses low-bit representation of the weights and input features by learning the parameters that are equal with the weight dimension and the number of input nodes, respectively, while leaving the core message propagation unquantized.However, this approach necessitates the learning of parameters that scale proportionally with the number of input nodes, resulting in considerable storage and space overheads.Neural Architecture Search (NAS) is used to span possible quantization levels suggesting an INT4 weight and INT8 activation as an effective strategy for GNNs [64].Recent studies adapt binary NN methods for GNNs [1,54] offering a trade-off between time/space efficiency and classification accuracy.These methods typically either need an additional teacher model for knowledge distillation or learn binary weights for each layer's input message, which require higher storage and computational load than a typical quantization based approach.
Towards addressing oversmoothing in deep GNNs, Liu et al. [35] propose Elastic Graph Neural Network with long-range information propagation using ℓ 1 and ℓ 2 -based graph smoothing.APPNP [19] addresses the oversmoothing with a propagation scheme based on an approximation of personalized PageRank.Zhu et al. [66] proposed low-pass and high-pass filtering kernels which have empirically reduced the effect of oversmoothing.DropEdge [42] aims to address the oversmoothing by dropping a number of edges, which can be interpreted as both a data augmentation method generating random deformed graphs and message passing reducer by sparsifying edge connections.PairNorm [63] quantifies the oversmoothing and proposes a two-step center-and-scale normalization layer to prevent nodes converging to similar representations.Compared to enforcing local smoothness, our method, constrains the layerwise message propagation to counteract oversmoothing, which achieves performance improvements over the prior approaches as also demonstrated in our experiments.

PRELIMINARIES AND ANALYSIS
We first provide the technical background, covering quantization for GNNs, analysis of quantization errors, and the oversmoothing problem in GNNs.

GNN Basics
⊤ is the node feature (embedding) matrix for layer  ∈  where  represents the number of layers in G, and h  ∈ R   is the feature vector for   ∈  node with initial H 0 = X.The adjacency matrix of G is a binary matrix A ∈ R × , where A(, ) = 1 if the edge between nodes   and   exists (   ∈ ), and 0 otherwise.
GNNs comprise a sequence of layers with three main functions for each layer: message, aggregate and update.This framework is generally called Message Passing NNs (MPNN) [21].Each message, which is a flow of data from nodes' neighbors, is aggregated and joined with existing embedding to form a new one for the respective node as given in Equation 1, where N  are the neighboring nodes of node .The feed-forward iteration starts with the initial embeddings, h 0  =x  .
Various GNN architectures are proposed in the literature, essentially, varying the message, aggregate and update functions [59].We consider, the popular, GCN (Graph Convolutional Network) architecture where the update function given in Equation 2 with activation function  and learnable weight matrix W  [32].
Recently, a different perspective on common GNN models was proposed by Ma et al. [36], where the authors unified different GNN models, such as GCN, GAT, PPNP, and APPNP, by posing them as solutions to the graph denoising problem.argmin where Following this, EMP (Elastic Message Passing) [35] method was proposed enabling ℓ 1 based smoothing constraints on GNNs.

Challenges with Deeper GNNs
While in traditional ML, deeper models can extract more powerful representations, for GNNs this inherently leads to several major challenges.First, as the depth increases, GNNs demand exponentially more computations and larger storage to be managed and processed, which makes their deployment on resource constrained platforms more challenging.We seek to design an inference-friendly quantizer, i.e., performing inference directly on quantized elements with high accuracy.Second, deeper GNNs suffer from the oversmoothing problem, where node representations converge to indistinguishable embeddings, degrading accuracy of downstream tasks.It was shown that GCN exponentially loses its expressive power for node classification tasks in many practical cases [39].
There are some proposals towards mitigating the oversmoothing problem for full precision models [19,34,35,42,63] as discussed in Section 2, including DropEdge, PairNorm, APPNP and EMP.Our experiments confirm that DropEdge and PairNorm are particularly ineffective for low-bit quantization.These methods do not consider the smoothness of message propagation amongst layers, resulting in accuracy drops and unrestricted quantization error especially in low-bit cases.In contrast, we seek layer-wise smoothness by enforcing constraints at message propagation, restricting the quantization error, and denoising the message passing procedure which lead to enhanced accuracy in low-bit quantization.

Quantization Basics
Quantization is the process of mapping continuous data, e.g., parameters, weights and activations of neural networks, to smaller sized representations.In the scope of our analysis, we denote  (e.g.H l ) as a high-precision tensor-valued random variable with probability density function   ().A tensor is commonly quantized between its maximum and minimum observed values [29].Considering observed values as   ∈ [, ] and the corresponding -bit quantized values as   ∈ [  ,   ], the quantization function is given by where is the scale, ⌊•⌉ denotes the round function, and The range [, ] is usually partitioned into 2  equal interval regions with a quantization step Δ=  − value in Equation 6 is represented as Û , the mean squared error (MSE) between  and Û is given by The MSE consists of three terms.The first two items are overload distortion caused by clipping the values of  beyond [, ].The third term means granular distortion led by the quantization step Δ.For any  ∈ [, ], its granular distortion is in ]. Therefore, it can be reduced by setting an appropriate Δ based on the distribution of  .This becomes particularly critical in GNN quantization, as GNNs show large variance at aggregated values [51].

Challenges with GNN Quantization
Compared to quantizing CNNs, GNNs involve more types of elements to be quantized with complex interdependencies.These elements include inputs of each layer, weights, messages between nodes, inputs and outputs of aggregation stage, and outputs of update stage.Since the variance of the updated features after propagation in GNNs is high due to the varying number of neighbors [51], it is particularly challenging to design a low-bit uniform quantizer.Tailor et al. [51] use percentiles to manually decide the quantization range [, ], and a  parameter to perform weighted average of the statistics of tensors during training.We empirically observe that the accuracy is highly sensitive to the setting of percentiles and , which increases the difficulty of obtaining accurate results especially using low number of bits.Zhu et al. [67] learn the quantization step size for each node of input features and each dimension of weights, respectively.However, as it does not quantize the message propagation part, the resulting model size and computations are significantly larger.Moreover, learning parameters per each node yields a higher model parameterization and limits its inductive capabilities including mini-batch training.It is akin to applying  times learned step size [15], where  is the number of nodes in the graph.To address these challenges, we introduce a quantizer with learnable ranges (QLR) which determines the quantization range and is also friendly for mini-batch training on large datasets.

LOW-BIT QUANTIZATION FOR GRAPH NEURAL NETWORKS
This section describes our solution for quantization with learnable ranges (QLR) and a skewness-aware bitwise truncation (BT * ) that captures the underlying data distribution to preserve accuracy with low-bit representations.

Quantization with Learnable Range
GNN involves various components such as layer activations, weights, messages, inputs/outputs of the aggregation and update stages.We aim to quantize all of the aforementioned components to reduce the model size and maintain high accuracy during inference.
According to the quantization error analysis in Section 3.With the learnable quantization range [,  ], the quantization function can be updated as where = is the updated scale, while zero point stays the same   =⌊    −    − ⌉ =.The de-quantization function is modified accordingly To optimize  at the backward propagation, Straight-Through Estimator [5] can be used to calculate the gradient of  as QLR learns a scale () relative to the quantization range of observed values, allocates the limited quantization budget to the remaining observed data points while accounting for the final task.Notably, this is different from learned step size quantization (LSQ) [15], which optimizes the step size over the full observed values.We have empirically observed that LSQ tends to be highly sensitive to the learning rate.This means that achieving satisfactory accuracy often requires an exhaustive search for the proper hyperparameters.The challenges with LSQ are further amplified because, in GNNs, the value ranges can differ significantly across layers, leading to uneven convergence rates between them.Quantization error analysis for QLR.Given a value  ∈ [, ], its quantization error can be written as It comes from two sources,   and    − ⌊    ⌉, which are the quantization level and distortion caused by rounding operation, respectively.Figure 1 shows the distortion error   and total quantization error   A basic approach for INT2 can be simply keeping the most (two) significant bits of the higher precision output (e.g., INT8, INT4) as where  1 and  2 ( 1 ≥ 2 ) are the number of bits used for quantization,  2 ← 1 means the  1 -bit quantized representation being truncated into  2 bits;   denotes the  1 -bit representation obtained with Equation 8, and  0 is the scale for truncating the low-significant bit representation depending on  1 and  2 . 0 can be obtained as where are the quantization levels for  1 -bit and  2 -bit quantization, respectively.While such formulation implicitly assumes the uniformity of  , for GNNs this can significantly vary depending on the graph topology.Measures such as kurtosis () and skewness () [22] can be employed to better capture information about normality and symmetry of the distribution respectively.While prior methods assumed that the neural networks activations follow close-to-normal distributions, we have empirically observed that, for GNNs, these have relatively large kurtosis and are rather asymmetrical on lowbit quantization (Figure 2).All these bring further challenges in employing bitwise truncation (BT) for GNNs.
We formulate a data-aware truncation mechanism that accounts for skewness of input data.Skewness-aware BT (BT * ), defined in Equation 14, can capture the abnormal distribution of quantized elements even under low-bits.
where  is the skewness of input tensor  .values for BT * fall in (−1.0, 1.0) range.As a result, using skewness () in the bitwise truncation process, as provided in Equation 14, maintains the symmetry of the quantized elements while ensuring their normality.

LAYER-WISE SMOOTHNESS-AWARE MESSAGE PROPAGATION
In GNN learning, each node's feature consists of a true signal, which relates to its class, and a noise component.The essence of message passing is to increase the signal-to-noise ratio by adaptively aggregating node features.However, unexpected or out-of-distribution features from a neighboring node, possibly of different class, can adversely affect the goal of enhancing signal-to-noise ratio.In the asymptotic case, aggregating features from different classes can cause a blending of true features, resulting in oversmoothing.The layer-wise smoothness that preserves locality between layers of GNN, can be helpful in achieving deeper GNNs [35,36].In Section 4.1, we introduced a quantizer that reduces the quantization error by learning an optimal quantization range.We also need to ensure its efficiency with respect to increasing model depth.From our empirical analyses, it is evident that the observed quantization range ( -) for low-bit setting expands as the number of layers grows (Figure 3).Based on Equation 7, an expanded quantization range directly influences its error.This suggests that quantization may further compromise the accuracy of deeper models which already suffer from over-smoothing.This potential degradation of accuracy is also reflected in Figure 3, where we measure it on GCN with INT2 quantization.
Motivated by the above observations, we devise a layer-wise smoothness approach that brings forth two primary benefits.Firstly, it facilitates smooth message propagation, thereby mitigating the problem of oversmoothing.Secondly, it helps to address the challenges of obtaining satisfactory, low-bit representations caused by substantial and abrupt updates during message propagation.Outline: In this section, we present our Smoothness-aware Message Propagation (SMP) solution which aims to reduce the oversmoothing effect and suppress the quantization error to a bound.We first quantify the layer-wise smoothness and analyze the local smoothness of existing GNNs at message propagation.We then present the SMP mechanism that smooths the message propagation with a graph denoising approach.After transforming the optimization problem into a Lagrangian function, we develop an optimal solution involving a differential multiplier method (BDMM).We also prove the existence of the quantization error bound for quantized SMP.The results presented in Figure 3 (GCN+SMP) confirm that SMP can also help improving the general GCN in INT2 quantization and deeper layer settings.

Layer-wise Smoothness
We quantify the smoothness objective in Definition 5.1 by measuring the layer-wise local smoothness during message propagation between each GNN layer.(Layer-wise Smoothness) Given a graph G=( , , X), the -th layer-wise smoothness is the change of connected nodes ∀(  ,   ) ∈ with a degree normalization from layer -1 to layer .
The layer-wise smoothness can be formulated as Specifically, S  can also be represented as where L represents the normalized Laplacian matrix, The tr(H ⊤ LH) is the Laplacian regularization to make H smooth over graph G, similarly, the -th layer-wise smoothness tr((H l − H l−1 ) ⊤ L(H l − H l−1 )) can be explained as smoothing the changes from layer -1 to  over G.

Smoothness-aware Message Propagation
SMP is designed to guide the training process to achieve local smoothness at message propagation of each layer, by utilizing the smoothness measure presented in Definition 5.1.Intuitively, SMP aims to avoid drastic correlation/similarity changes for connected nodes to achieve local smoothness at each message update.
We formulate the SMP objective based on graph denoising formulation (at each layer  ∈ ) with degree normalization, where The optimization objective aims to find an optimal H * which we assume to be the correct feature embedding for the particular graph.We impose three different priors to extract this optimal embedding.The first term minimizes the distance to measure the original feature matrix (X), the second term imposes neighborhood similarity in Equation 17.These two objectives have been used in different methods in the literature.In SMP, we impose a new constraint (with S  ) which limits the change of embeddings between layers of the GNN and makes smooth transitions at each message passing iteration. 0 is the threshold for controlling an allowed variation between the correlations/similarities for the connected node between layers, || is the number of edges in the graph.
This formulation shows that the constraint aims to mitigate the abrupt changes in the relations between connected nodes due to possibly interfering signals coming from neighboring nodes.Alternatively, S  can be configured to capture the changes of the smoothness with different distance measures, e.g., ℓ 1 norm.
The Lagrange function for the objective function at layer  has the following form where  is slack variable,  is Lagrangian multiplier, and (H, Equation 18 is differentiable with respect to H,  and .However, Lagrangian multiplier method does not directly work with gradient descent optimization and the derivation of optimal solution from KKT (Karush-Kuhn-Tucker) conditions will be cumbersome when the constraints are complex as in SMP case.Therefore, the optimal solution to Equation 18 can be derived with the basic differential multiplier method (BDMM) [41], which has been proved to optimize Lagrange multipliers in conjunction with the objective argument in a sequential manner.The BDMM updates are given below When we calculate the respective gradients in Equation 18and incorporate into Equation 19, we easily reach the formulation as where  H ,   , and   are the respective step sizes.
Variation of   for existing GNNs.To provide an intuitive understanding of layer-wise smoothness (S), we measure S for SMP and existing GNNs to quantitatively show how the measure varies with different GNN solutions.We compute S on 10-layer GCN, SMP, and several existing deep GNN solutions, including DropEdge, APPNP, EMP, and PairNorm.Figure 4 shows the average layer-wise smoothness of 10-layer GNNs (S) at each epoch, where ).The GCN, without any safeguard mechanism for oversmoothing, creates extremely large layer-wise variations (higher S  ) when compared to deep GNN solutions.This example illustrates that deep GNN methods mitigating against oversmoothing are effective to control and increase layer-wise smoothness (decreasing S  ) when compared with the general GCN.Additionally, the S of SMP drops continuously when compared with other comparable deep GNNs.These experiments show that the iterative solution in Equation 20is effective in controlling the change between layers by enforcing both node-wise smoothness and layer-wise smoothness., where H , is the quantized representation of H  .Accordingly, for the quantized SMP, the smoothness constraint can be written as  , =tr((H , −H  −1, ) ⊤ L(H , −H  −1, )).We can prove    is smaller than a bound as provided in LEMMA 1, which underlines the superiority of SMP in terms of quantization.Lemma 1.For the -th layer representation H  , the quantization error is Proof.The Laplacian matrix L is eigendecomposable, i.e., L = UΛU ⊤ , where U is orthogonal matrix (UU ⊤ = I ).S , can be represented as S , = ∥Λ , where 1≤i≤  .

EXPERIMENTS
This section presents our experiments on benchmark datasets that illustrate the effectiveness of QLR and QLR with BT (BT * ) quantizers under low-bit settings.We also compare our SMP with comparable deep GNN baselines, highlighting the capability of SMP in addressing the oversmoothing issue.

Experimental Setup
Datasets and Baselines.Our experiments are performed on five datasets, Cora, PubMed, CiteSeer [47], CS [49] and Reddit [23] in a semi-supervised node classification setting.The statistics of the datasets are summarized in Table 1.We start by comparing QLR against two state-of-the-art GNN quantizers, Degree-Quant [51] and Aggregate-Quant [67], on GCN [32].
We present the average accuracy and standard deviation over 10 random data splits for Cora and CiteSeer, and 5 for PubMed, CS and Reddit.For Reddit dataset, owing to its size, we have employed mini-batch training with a batch_size of 20000.All of the experiments are based on Pytorch [40] and PyTorch Geometric [18].The experiments are ran on Ubuntu 20.04 with 64GB RAM.

Comparison with different quantizers
We compare QLR with the state-of-the-art GNN quantization solutions, Degree-Quant and Aggregate-Quant.Results are summarized in Table 2.
We notice that Aggregate-Quant in default maintains a fixed quantization level of INT4 for weights, while having smaller bits for input features.Moreover, it does not quantize the messagepassing blocks of GCN, whereas QLR and Degree-Quant quantize all the elements equally.Hence, for fairness, we also add quantizers for its message-passing blocks and removed the INT4 constraint on its model weights.Also important to note that Aggregate-Quant maintains a step size parameter for each node, which can be viewed as an extension of learned step size quantization (LSQ) [15], which makes it highly inflexible for inductive tasks.Due to that, it does not support mini-batch training, as the topology of the input graph changes with each mini-batch training iteration.
We observe that performance of QLR significantly outperforms those of its competitors irrespective of the quantization level.The approach of optimizing the quantization range in the backward pass makes QLR more robust and effective, especially in low-bit cases.Aggregate-Quant demonstrates superior performance for CiteSeer when applied to INT8 quantization, which can be courtesy of its significantly larger parameter size.However, the accuracy of low-bit cases degrades significantly when message passing blocks are also quantized fairly.As for Degree-Quant, while it can achieve comparable performance on INT8, it cannot generate expected performance with INT4 and INT2 quantization on Reddit.Due to its mask sampling strategy and low quantization level, one node  It is noteworthy that QLR preserves its accuracy results even for INT2 quantization across all datasets, while the alternatives fail to get comparable accuracy.Additionally, QLR even outperforms the full precision (FP) model in INT8 quantization in many cases, showcasing its effectiveness as a noise filter for GNNs.
In Table 3, we report the model sizes of different quantization approaches with varying quantization levels and hidden units ().Due to space limit, we only present the model sizes on CS dataset.As there is a native 8-bit support, under constant , the sizes of INT8 with QLR and Degree-Quant are consistently reduced to approximately one-fourth of the FP counterpart.For smaller bits, however, we pack INT2 and INT4 similar to the process described in [30].The size of QLR is slightly larger to that of Degree-quant due to storage of   ,   and  parameters.Given the superior accuracy performance of QLR in low-bit settings, its slight increase in model size becomes negligible in comparison.Overall, with QLR and Degree-quant, the INT2 and INT4 model sizes are significantly smaller than their FP counterparts, with reductions of 16× and 8×, respectively.However, the size of Aggregate-Quant is 2-6 times that of its counterparts, largely due to the dimension and per-node nature of the parameters.

Comparisons with existing deep GNNs
6.3.1 Node classification with existing deep GNNs.We compare SMP with the existing deep GNNs in terms of both full-precision (FP) and quantized models using QLR.Table 4 presents the classification accuracy results using 10-layer GNN.Notably, SMP consistently outperforms the alternative methods on Cora, CiteSeer and CS with FP models, and slightly lower than that of APPNP on PubMed.SMP improves over EMP by enforcing smoothness at layer-wise message propagation during training and inference.
For quantized models, although INT8 achieves high accuracy close to FP for all methods, INT4 performance of DropEdge and PairNorm drops significantly, rendering it incomparable in some cases, and throws OOM errors on larger datasets.This is primarily due to the complexity of the "backbone" models [42].For example, PairNorm, employing a GCN backbone, trains a weight matrix for each layer.Likewise, DropEdge utilizes more intricate backbones such as GCN, ResGCN [25], IncepGCN [50], and introduces connection perturbations at every layer.Consequently, these factors contribute to higher computational and storage requirements, which further escalate with the depth of the model.On the contrary, EMP, APPNP, and SMP employ much simpler architectures that involve only two weight matrices prior to GNN propagation.As a result, these models have more moderate and scalable requirements, making them more amenable for deep GNN quantization.
SMP also consistently improves the low-bit performance enabling more stable training as compared to other methods.This can be explained by the smooth, narrow-ranged representations across layers, enabling QLR to identify more precise low-bit representations.The empirical results are also in line with the quantization error upper bound for quantized SMP as proved in LEMMA 1.We also note that QLR can also improve EMP and APPNP to achieve reasonable accuracy with graceful degradation even in INT2 quantization.This highlights the importance of optimizing the quantization range for informative representation.Figure 6 presents the full training process of SMP and EMP with variations of our quantization approach on CS dataset.While SMP and EMP show unstable performance with INT2 representation generated by the basic quantizer, applying skewness-aware BT (BT * ) contributes improvements that are close to that of INT8 quantization.Furthermore, SMP is more robust than EMP, due to the existence of quantization error bound in SMP.

Inference Speedup
In Table 5, we elucidate the inference times associated with varying quantization levels for the 2-layer GCN and SMP architectures, respectively.These model inferences are conducted on the Reddit dataset.The ↑ signifies the inference speedup comparing with FP model.To realize quantized GNN speed improvements across different quantization levels, we leverage the recent Tensor Corebased approach, QGCT [58], applied to both GCN and SMP.We observe a notable speedup of 5.11 × and 6.44 × with SMP and GCN, respectively, in the context of low-bit representation (INT2), in comparison to the FP counterparts.Notably, the speedup for SMP exhibits a slight reduction compared to GCN, attributed to the additional computational overhead of SMP.Remarkably, with the same number of layer (), SMP showcases superior accuracy performance relative to GCN.This is exemplified in Table 2 and Figure 5 in the CS dataset with =2.Specifically, for SMP, the INT8 and INT4 accuracy outperforms GCN by approximately 2%, while SMP in INT2 mode demonstrates a performance advantage over GCN by up to 13.5%.

CONCLUSION
We have introduced an end-to-end solution towards achieving scalable deep GNNs, involving an efficient quantization with learnable ranges, with skewness-aware bitwise truncation, and a smoothnessaware message propagation (SMP) mechanism for efficient training and managing large deep GNNs.The solution reduces the model size and maintains its accuracy for classification even in low-bit representations.The message passing block in training is enforced to have layer-wise smoothness and constrains the changes between neighbor nodes.We formulate it as an additional constraint to a graph denoising optimization function and solved by Lagrange functions with an iterative BDMM algorithm.It aims to mitigate the oversmoothing problem in GNNs and to avoid the performance degradation encountered in low-bit quantization-aware training.We provide an upper bound on the error for the quantized SMP algorithm.Experiments show how the proposed solution achieves significant improvements over the-state-of-the-art approaches, providing a significant reduction in model sizes, an order of magnitude smaller than the full precision (FP) model with comparable accuracy results, and mitigating the oversmoothing problem on benchmark datasets.

Figure 1 :
Figure 1: Quantization error with different for aggregate output of Cora dataset, across varying scales in  ∈[0.05,1.0].Noteworthy that,   and   affect the error in different directions across different  varying from INT2-INT8.Specifically, with an optimized , the distortion in INT2 can be reduced to a scale similar to that of INT4 and INT8 as shown in Figure 1(d).Hence, a learnable  can optimize the quantization range, reducing the total error even in extreme low-bit representations.

Figure 2 :
Figure 2: Kurtosis and Skewness of different datasets at each epoch 4.2 Skewness-aware Bitwise Truncation

Figure 2
illustrates the kurtosis and skewness parameters of the message passing and aggregate output blocks for 10-layer SMP, detailed later in Section 5, using INT2, INT2-8 (BT) and INT2-8 * (BT * ) quantization across two datasets.We note that kurtosis (| |) and skewness of the normal distribution are 3 and 0 respectively.Therefore,   = | − 3| can be used to measure the normality of the tensor, where smaller   means a distribution closer to normal.The average kurtosis of BT * remains continuously smaller throughout each epoch, as compared to INT2 and BT, which indicates a robust training process.Similar trends are observed for the skewness :

Figure 3 :
Figure 3: Effect of layer-wise smoothness on the aggregate output quantization range in INT2 quantization Definition 5.1.(Layer-wiseSmoothness) Given a graph G=( , , X), the -th layer-wise smoothness is the change of connected nodes ∀(  ,   ) ∈ with a degree normalization from layer -1 to layer .

Figure 4 :
Figure 4: Avg layer-wise smoothness against different epochs (S) for GNNs SMP Contribution to Quantization.The quantization error for the -th layer representation can be expressed as    =∥H  − H , ∥ 2 2, where H , is the quantized representation of H  .Accordingly, for the quantized SMP, the smoothness constraint can be written as  , =tr((H , −H  −1, ) ⊤ L(H , −H  −1, )).We can prove    is smaller than a bound as provided in LEMMA 1, which underlines the superiority of SMP in terms of quantization.

Figure 5 :Figure 6 :
Figure 5: Results of SMP and EMP with varying layers

Table 1 :
Statistics of benchmark datasets

Table 3 :
Model size (MB) with different number of hidden units () on different quantization methods mini-batches will generate different representations at different batch training, which curbs the overall accurate representation.However, QLR can directly optimize the learnable quantization range based on the observations of subgraphs, hence, it can reduce the comprehensive quantization errors.These further confirm that optimizing the quantization range in QLR enables better preservation of accuracy in low-bit representations.

Table 4 :
) representations.For simplicity, we narrow the search space as lr ∈ {0.005, 0.008, 0.01, 0.015}, wd=5−4, lr  =5 −4 , wd  =1 −5 .We note that SMP-FP and SMP-INT8 outperform EMP-FP and EMP-INT8 in most cases with varying margins. The iprovements can be further enhanced by tuning the parameters in a wider search space as listed in Section 6.1.We note that SMP-INT4 continuously outperforms EMP-INT4 in nearly all cases with different values of  (except =2-4 on PubMed and =2 on CS).SMP-INT2 achieves relatively high accuracy when compared with EMP-INT2, which underlines the benefits of smoothness constraint of SMP for extreme low-bit quantization.6.3.3Effect of Bitwise Truncation on GNN quantization.For INT2 quantization in Figure 5, the performance of BT (INT2-8) and BT * (INT2-8 * ) outperforms INT2 quantization with the basic QLR in most cases, while the accuracy of EMP-INT2-8 * on CiteSeer is lower (around 0.1%-2.5%)than that of INT2 at =6-12.Similar results are observed with PubMed at SMP-INT2-8 * when =6-8 with margins of 0.1%-1.3%.When we compare INT2-8 and INT2-8 * under the same circumstances, the accuracy of INT2-8 * is significantly larger than that of INT2-8 in most cases (except for CiteSeer, the performance of INT2-8 * is slightly smaller than that of INT2-8 on SMP at =6-8 and EMP at =8).The performance of INT2-8 * demonstrates the advantage over INT2-8 on the large datasets, i.e., PubMed and CS, and the accuracy of INT2-8 * is close to that of INT4.Especially, Classification accuracy of Deep GNN methods (%) on benchmark datasets * ( BT *