Beyond the Additive Nodes' Convolutions: a Study on High-Order Multiplicative Integration

Graph Convolutional Neural Networks (GCNs) compute representations of graph nodes by exploiting convolution operators based on some neighborhood aggregation scheme. These operators are defined by using several stacked Graph Convolutional (GC) layers. They are usually defined as additive building blocks that fuse multiple information streams. However, when considering information integration in sequences, the flow of gradient has been shown to be more robust by adopting the Multiplicative Integration (MI) technique. Because of that, it is worth investigating the impact of MI in Graph Neural Networks. We propose three different GC layers that exploit MI to improve various aspects of the neighborhood aggregation scheme. We report both a theoretical and empirical comparison of our proposals with respect to the most common GC operators for the graph classification task.


INTRODUCTION
Recently, there has been an increased interest in machine learning models that deal with graph-structured data, such as kernel methods and neural networks.Graph Neural Networks (GNNs) define a neural architecture that follows the graph topology.From the neurons associated with a vertex and its neighbors, a hidden representation corresponds to the same vertex in another network layer.For every hidden layer of the GNN, a new transformation is performed.The transformations are determined by relying on the definition of a convolution operator in the graph domain.Graph Convolutions (GC) are generally based on a neighborhood aggregation scheme (aggregate) [8] considering, for each node, only its direct neighbors, and a combine operation that merges the representation of a node with the aggregation of its neighbors.The neighborhood aggregation schemes implemented by the various GCs proposed in literature usually exploit an additive building block [36].
In the structured domain, particularly in sequential learning, a different procedure of information integration has been studied: the Multiplicative Integration (MI) [36].The idea is that, instead of utilizing the sum operation to join the information conveyed by the various elements that compose the recurrent model equation, MI exploits the Hadamard product.Without introducing any extra parameters, the authors leverage second-order interactions between features, i.e., relationships or dependencies that exist between pairs of features within the dataset.Unlike first-order interactions, which involve individual features in isolation, second-order interactions consider how two features jointly influence the output or prediction.One of the first applications of MI on sequential domains was proposed by Goudreau et.al [9] that introduced the Second-Order Single-Layer Recurrent Neural Networks (Second-order SLRNN).
The most common model for sequences that adopts the MI is the LSTM [14] (or variants of it, e.g.GRU [3]).This model employs the MI to implement a gating mechanism to manage long-term temporal dependencies.An enhanced version of the LSTM (and earlier, of the RNN) that exploits the MI also to define the recurrent mechanism is the multiplicative LSTM/RNN [23,32].These models use the Hadamard product to combine the projection of the current time step with the hidden states that come from the previous time step.The idea of using MI to manage a gating mechanism and to combine the information flow from different temporal domains is also used in many other models, like in Highway Network [39].Another model that exploits a similar technique is the HyperNetwork [11].The HyperNetwork dynamically generates the weights of a network using another (smaller) network.In particular, the recurrent version (the HyperLSTM) generates a multiplicative bias that drives the generation of dynamical weights.A similar approach that belongs to the Bayesian framework was proposed by Krugers et al. [24].
In Graph Neural Networks, GC's aggregation/combination operation shares some critical mechanisms with the time-based aggregation mechanism used by Recurrent Neural Networks.Indeed, in literature, the Multiplicative integration is shown to be particularly convenient to aggregate contextual information that comes from different sources [19].Inspired by this similarity, in this paper, we explore how MI can be applied to define novel GCs.
Some recently proposed GNN approaches do exploit some form of multiplicative mechanisms, for instance implementing gating mechanisms [2,29,33,34] or hypernetwork-like models [2,12,28].However, they all do not adopt the MI paradigm concerning nodes' neighborhoods, and comparing such models with additive graph convolutions does not show a consistent performance improvement.
Inspired by the promising results obtained by Multiplicative Integration in sequential domains, we explore its application in GNNs.To the best of our knowledge, this is the first paper that explores the application of MI inside a GC operator.We propose three definitions of MI-based GC operators that stem from the commonly used and very effective GraphConv operator [27].These three operators are defined with the aim of exploring how the MI can be embedded into a graph convolution to obtain a secondorder GC operator.Such second-order interactions between features can capture more intricate patterns and relationships in the data, enabling us to go beyond traditional first-order feature analysis.We empirically evaluate the proposed MI-GNNs on eight commonly adopted graph classification benchmarks.The experiments show how MI, applied in the aggregation and/or combination step, allows us to uncover hidden dependencies contributing to improved model performances.We compare the proposed methods with the most common additive graph convolutional operators.In particular, we analyze the results regarding the accuracy and computational time required for training.The results highlight how the use of MI can help obtain improved performance in terms of accuracy and speed of convergence.We apply rigorous statistical hypothesis testing to assess the statistical significance of the observed improvements.Considering that the application of MI also influences the form and the flow of the gradient of the GNN, we analyze how gradient propagation differs among multiplicative and additive GCs.

GRAPH NEURAL NETWORKS
Let  = ( , , X) be a graph, where  = { 1 , . . .,   } denotes the set of nodes of the graph,  ⊆  ×  is the set of edges and X ∈ R × is a multivariate signal on the graph nodes with the -th row x   representing the attributes of   .We define A ∈ R × as the adjacency matrix of the graph, with elements    = 1 ⇐⇒ (  ,   ) ∈ .With N (), we denote the set of nodes adjacent to node .
A Graph Neural Network (GNN) is a model that exploits the structure of the graph and the information embedded in feature vectors of each node in order to learn a representation h  ∈ R  for each vertex  ∈  .In modern GNN models, the computation of h  can be divided into two main steps: aggregate and combine.We can define aggregation and combination by using two functions, A and C, respectively: It is possible to extend the range of the considered neighborhood by iteratively performing aggregation and combination for  iterations.In this way, we obtain a hidden representation h ( )  of the node  that contains information about the structure and the neighbors that are at a distance  from : where   is the size of the convolutional layer .We thus obtain a deep GNN of -layers.The choice of aggregation function A and combination function C defines the type of Graph Convolution (GC) adopted by the GNN [21,26].
In the last few years, several different GCs have been proposed, and most of them share the same computational building block that exploits a form of additive combination function.Generalizing, we can define this building block as: where  (•) is a nonlinear activation function, and F is usually a linear projection.Moreover, it is important to notice that the aggregation function A is commonly based on the sum of the neighbor embeddings.This commutative operation allows it to be invariant with respect to the order of the neighbors.
A very common formulation is the GraphConv [27], that is based on the 1-dimensional Weisfeiler-Leman graph invariant (1-WL) [10].The GraphConv is defined as follows: where W, W Σ ∈ R   ×  −1 (with  0 = ) and b ∈ R   are the network learnable parameters.Here and in the following sections, we denote the parameters with subscripts referring to their related function.

MULTIPLICATIVE INTEGRATION AND GRAPH CONVOLUTION
The GC operators defined in literature generally follow the same general structure defined in Eq. ( 2), where A is defined as an additive block.This work explores how the Hadamard product can help perform aggregation and combination in GC.We decided to explore the application of Multiplicative Integration on GC because the aggregation and the combination blocks basically integrate contextual information from different sources, and MI is an effective methodology to perform this task [19].The advantage of using Hadamard products to combine information flow in recurrent neural networks was discussed in [36].Moreover, using MI allows the definition of second-order GCs, the resulting model will define a multiplicative interaction between all the neurons that compose two GC contiguous layers, consequently influencing the model's gradient.Using the second-order layer to learn about structured domains has already shown promising results, particularly considering sequential domains.In more detail, in [9], the authors study the improvement in representational ability achieved by utilizing a second-order RNN where the recurrent neurons of a single layer were multiplied by each other.A similar approach is exploited in defining the mLSTM [23], where a multiplicative block is used to differentiate the recurrent transition function as a function of the input.Another advantage worth mentioning that arises from using the MI paradigm is that the model has better gradient properties, i.e., the gradient propagates in a stronger way from one layer to the other.
In the following part of this section, we present three different possibilities to extend the GraphConv operator by Multiplicative Integration.

MI-GNN
The first proposed version of MI-GNN (named MI-GNN-v1) exploits MI to integrate the information from the current node with the one from its neighborhood.This is obtained by replacing the sum of the additive version of the GraphConv with the Hadamard product between the current node projection and the aggregation of its neighbors: The two bias terms b and b Σ are inserted to obtain a more expressive formulation.In fact, by distributing the product over the sum and rearranging the terms, we get the equivalent equation Note that the first two terms of the above equation constitute a scaled version of the regular additive interaction between h . The scaling is dynamic since, in addition to the constant bias terms, the scaling factor also depends on the cardinality of the neighborhood.Moreover, the weights regulating the multiplicative integration between these two components, defined by the third term, are obtained by combining the two weight matrices W and W Σ of the convolution, with no increase in the number of parameters with respect to the additive GraphConv, except for the additional bias.To further enhance the expressiveness of the MI-GNN, at the cost of increasing the number of convolution parameters, the additive and multiplicative building blocks can be combined explicitly, each with its own weights.We named this variation MI-GNN-v2: The third MI variation of the GraphConv that we propose, MI-GNN-v3, exploits the Hadamard product to define both the combination and aggregation steps of the GC operator: This formulation makes the interaction among all nodes involved in the convolution uniform, implementing a global gating mechanism while preserving the sharing scheme of the parameters used in GraphConv.
The use of MI as a combination mechanism, however, involves multiplying several node embeddings (projected using the same shared weights) that can lead to numerical stability issues in case of extremely small (close to 0) or large (significantly higher than 1) values.This can make the training phase unstable.To solve this issue, by maintaining a multiplicative integration approach as an aggregation mechanism, we propose to transform the product among the neighbors into a sum by exploiting the logarithm function jointly with the ReLU function to ensure that the co-domain of A (•) is limited to a set of values that ensure a more stable training phase: where  is a small positive constant which prevents the input to log to be 0. Notice that when  = 1, we get as output only positive values, avoiding negative values with high modules.We applied the ReLU function also on the current node embedding (Wh projection because we want to ensure that the new embedding will be a positive tensor.This is critical since it will be used as input to a logarithm in the subsequent GC layer (if it exists, i.e., if  < ).

RELATED WORKS
Several models inspired by the graph convolution idea have been proposed recently.Some of them exploit the multiplicative operator to implement a gating mechanism.In particular, the Graph Attention Networks (GAT) [34] uses multiplication in defining a convolution operator based on masked self-attention.The idea is to replace the adjacency matrix in the convolution with a matrix of attention weights for multiple heads.While it may be more complex to train, GAT allows assigning a different weight to each neighbor of a node; thus, it is a very expressive graph convolution.
A similar gating approach is proposed in [25], where the authors propose an extension of Graph Neural Networks to produce sequences in output.They define a propagation model reminiscent of the GRU cell definition.In the resulting convolutional operator, the aggregation is still defined using an addictive block.In contrast, a projection of the node embeddings is used to perform GRU-like updates of the embedding that mixes the information from the other nodes and the previous timestep.
The recent work of Koishekenov and Bekkers [22] explores how to combine features best to condition GNNs on additional information.Their "strong conditioning", is the Hadamard product between the weighted adjacency and feature matrices that replace the layers in an MLP.Differently from MI-GNNs, these three discussed gating convolutions [22,25,33] do not exploit multiplicative integration to aggregate the embeddings of the various nodes in message-passing graph neural network.
A different way to exploit the multiplicative operator in defining the GC is proposed in [2], where the author proposes the GNN-FiLM.This model exploits a hyper-network that, given the node embedding, outputs two coefficients.One of the resulting coefficients is multiplied by the weights of the GCN, while the other is summed to the affine transformation of the node embeddings.Similar to the previously discussed gated GCs, in GNN-FiLM, the aggregation and the combination steps are based on a summation between node representations.
Hua et al. has proposed node pooling based on invariant multiplication [16], where they define a GNN layer that can be seen as the composition of a linear layer with a weight matrix, a multiplicative pooling layer, and another linear map.This layer is used as both the aggregation and update steps of a GNN, thus it does not explicitly leverage the graph structure, but it relies purely on higher-order feature interactions.
Finally, in [28], the authors propose two novel convolutions based on the hyper-networks framework.Both operators dynamically adjust the GC parameters based on the input node, and to do it, the model uses a hyper-network generating a vector having a different parameter for each node.This vector is then multiplied by the GC's weights to adjust them.Similar to the GNN-FiLM, the multiplicative blocks are used to adapt the weights for the current input without explicitly defining a multiplicative interaction among the embeddings of the nodes and their neighborhood.

EXPERIMENTAL SETUP AND RESULTS
Our experimental assessment aimed to empirically verify whether multiplication can be used to define graph convolutions with competitive performance.Specifically, our experimental results show that representative methods in literature exploiting multiplicative operators, i.e., GAT and GNN-FiLM, do not show a performance advantage over addition-based GCNs.We then explore the use of MI in graph neural networks, evaluating the proposals presented in Section 3, along with four widely adopted baselines, on seven graph classification benchmark datasets.We start discussing the considered GNN architectures and the adopted model selection procedure in Section 5.1.Then, we describe the adopted datasets in Section 5.2.We then present the comparison in terms of classification accuracy in Section 5.3, where we also empirically study the impact of adopting multiplicative integration in terms of training time and speed of convergence.Finally, we validate the statistical significance of the difference between the results obtained by the various considered models in Section 5.4.

GNN Architecture and Model Selection
We considered as baselines two commonly adopted GC operators, GCN [21] and GraphConv.Moreover, we also experiment with two powerful convolutions for graphs that exploit the multiplicative operation differently than the MI-GNN: GAT and GNN-FiLM.
We handled the experiments and validated all the models' hyperparameters adopting the GraphGym [38] framework.Specifically, we started off from their findings on the GNN architecture design space by setting a common baseline configuration for all the experiments based on the work of [38].We set the PReLU as activation function  [13], batch normalization [18] for each layer, and among layers, we adopted the SKIP-CAT scheme [17].The training is carried out with the ADAM optimizer [20], cosine learning rate schedule (starting from 0.01 and annealed to 0, no restarting), 5 × 10 −4 L2 weight decay for regularization.The batch size is set to 32 for all the datasets and we let every experiment run for 400 epochs.We used the libraries PyTorch=2.0.0,PyTorch Geometric=2.3.0, and our experiments have been carried out in a computing cluster equipped with GPUs Nvidia RTX A5000.Each tested network consists of Multilayer Perceptron (MLP) layers before and after the GC operator layers.This particular architectural setting is the one suggested in GraphGym.The amount of these layers and their hidden units are hyperparameters.In our evaluation, we consider graph classification tasks; therefore, all the considered models have a global pooling layer to compute a graph-level representation given the node embeddings.The pooling layer is defined by concatenating the global mean, max, and sum aggregations [5]: where  is the number of GC layers.
Each dataset is split in train/validation/test sets according to a [80%, 10%, 10%] random split.Every configuration is run 3 times, and we take the average of all the evaluation metrics (accuracy, time, etc.) taken on the test set at the best epoch in validation.The random generator seed is set likewise at the beginning of each run, thus ensuring that the dataset splits are equal for each model and making their comparison more robust and fair.We performed a full grid search over all the hyper-parameters combinations reported in Table 1, resulting in 96 configurations tested for each of the 7 datasets and of the 7 GNN layer types.The layers for GraphConv, GCNConv, GATConv, and FiLMConv are taken from the PyTorch library.

Datasets
All the considered methods were empirically validated on seven commonly adopted graph classification benchmarks.Namely, we used four datasets modeling bio-informatics problems: NCI1 [35], PROTEINS, [1], D&D [7] and ENZYMES [1].NCI1 involves chemical compounds represented by their molecular graph, where node labels represent the atom type, and bonds correspond to edges.In NCI1, the graphs represent anti-cancer screens for cell lung cancer.The remaining datasets, PROTEINS, D&D and ENZYMES involve graphs that represent proteins.Amino acids are represented by nodes and edges that connect amino acids that in the protein are less than 6Å apart.All the prediction tasks are binary classification tasks, except for the ENZYMES dataset, where a multi-class classification of chemical compounds (six classes) is represented.We further considered three large social graph datasets: COLLAB, IMDB-B, IMDB-M [37].In COLLAB, each graph represents a collaboration network of a corresponding researcher with other researchers from three fields of physics.The task consists of predicting which one of the researcher's three physics fields belongs to.IMDB-B and IMDB-M are composed of graphs derived from actors/actresses who played in different movies collected on IMDB, together with the movie genre information.Each graph has a target that represents the movie genre.IMDB-B models a binary classification task, while IMDB-M contains graphs belonging to three classes.In contrast to the bio-informatics datasets, the nodes contained in the social datasets do not have any associated label, and therefore, only the graph topology is regarded.

Results and Discussion
The results of our experiments are presented in Table 2.We can start noticing that the two existing multiplication-based methods (GNN-FiLM and GAT) do not result in the better-performing methods on any of the considered datasets.This observation enforces our intuition that research is still required in multiplication-based graph convolutions to achieve competitive performance.As for the proposed methods, at least one version of the proposed MI-GNNs is the best-performing method in six of the eight datasets.
The GCN performs well only in the COLLAB dataset, which is the dataset that has the highest edges/graph ratio.We argue that the normalized adjacency matrix of the GCN is more robust for such cases, whereas the other multiplicative or additive operators are penalized by the higher average degree.The GraphConv has the highest accuracy only for the NCI1 dataset; however, MI-GNN-v1 and -v2 perform similarly.MI-GNN-v1 and MI-GNN-v2 are the bestperforming methods on two datasets each.MI-GNN-v3 performs on par with MI-GNN-v1 on IMDB-B (but with a higher variance) and is the best-performing method on IMDB-M.Additionally, we analyzed whether the improved performances of our implementation of the MI-GNNs also translate into briefer training times.In Figure 1, we display each dataset's accuracy and total training time until the best validation epoch is reached.In the plot, we also report the Pareto frontier-the set of all Pareto-efficient points.In our case, those points refer to the methods for which no improvement in one dimension (either accuracy or training efficiency) is possible without losing performance on the other dimension.Such points can be informally interpreted as the best time/accuracy trade-off methods.We can see that the MI-GNNs are often present in the Pareto-front, meaning that not only do they achieve the best performances most of the time, but they also require less or comparable time than the baselines.Notice that on many datasets, some methods are extremely efficient (e.g., GCN on COLLAB, IMDB-M, and PROTEINS) but perform very poorly.Even though those points are part of the Pareto-front because of the low training times, they are not interesting solutions for their degraded predictive performance.We want to mention that GCN results are the slowest method because, by default, it does not store the normalized adjacency matrix.For this reason, its training could be tweaked and sped up; however, its predictive performances would not be altered.

Statistical Significance of the Results
Inspired by the analysis of [6], we investigated the performances of our proposed models beyond a simple but naive maximum-accuracy benchmark.Indeed, when comparing multiple classifiers on multiple datasets, one should apply rigorous statistical hypothesis testing before assessing whether the improvement is statistically significant.Additionally, when testing multiple hypotheses simultaneously, multiplicity issues arise and one should adopt the proper corrections.
The Friedman test [30] is a non-parametric test that does not assume the distribution and variances of the samples.For each dataset and for each configuration, it ranks the accuracies of all the models.Then, it computes a statistic  2  under the null hypothesis, which states that all the models are equivalent and their ranks should be random.In our case, this test gives a -value< 0.01, so the null hypothesis is rejected, and we can proceed with a post-hoc test to tell which algorithms perform the best.To calculate the statistical significance of the pairwise comparisons between the models, we used the Conover post-hoc test for unreplicated blocked data [4] where the -values are adjusted with the step-down method using the Sidak corrections [31].Other common -values adjustments yielded equivalent outcomes.The outcome of this analysis is neatly presented with the critical difference (CD) diagrams in Fig. 2.This plot displays the averages of the normalized ranks of the models among all the configurations.On the x-axis, 1 would stand for a model that always scores better; on the contrary, a model at 0 would always be the last one in the rankings.The groups that could not be statistically deemed different by the Conover test are linked by a horizontal crossbar.We can see that GraphConv, MI-GNN-v1, and -v2 are significantly better ranked than all the other models.While the MI-GNN-v1 has the highest average rank of 0.71, indicating that it tends to be the best model more frequently for a given configuration and dataset, there is insufficient statistical evidence to confirm this conclusion.Therefore, conducting experiments on additional datasets would be necessary to endorse this assessment.Nevertheless, this shows that MI-GNN-v1 is a valid alternative to GraphConv, and all the parameters and settings being equal, merely replacing the additive term with the Hadamard product can lead to improved performances.Moreover, the CD diagram shows us that MI-GNN-v3 is, on average, in the middle of the rankings despite achieving the best accuracy in PROTEINS and ENZYMES.This advises us that when evaluating new machine learning models, when it is possible, it is crucial to go beyond the maximum accuracy rationale-only by testing whether the new model performs statistically better than the baselines for multiple configurations, one can ensure that such improvements are significant and applicable across different scenarios.

Open Graph Benchmark
To prove the validity of our approach, we additionally evaluated the performances of the MI-GNNs on the molhiv dataset belonging to the Open Graph Benchmark (OGB) [15].This dataset is made  For ogbg-molhiv, the AUC is reported on the vertical axis. of 41 127 graphs, an order of magnitude more than the ones previously considered, and it follows its pre-defined pre-processing and evaluation pipeline.Due to its complexity and time constraints, we restricted our experimental setup to a smaller hyper-parameters grid and fewer models.For these reasons, it cannot be analyzed along with the other datasets and the procedure described in the previous section.

GRADIENT ANALYSIS
Pursuing the goal of characterizing the strengths and the weakness of the application of the multiplicative integration in GC operators, we theoretically analyze the gradient of the various proposed versions of MI-GNN and we compare them with the gradient of the most similar additive model, the GraphConv.We denote as H ∈ R   × the matrix of all the nodes' embeddings at layer , H ∈ R  +1 × is the same matrix at the following layer  + 1, and while the derivative of the MI-GNN-v1 w.r.t the weights are the following: It is worth noticing how the use of MI instead of an additive block leads the gradient with respect to W to be influenced by W Σ and vice-versa.Moreover, in the MI-GNN, the gradient of the weights that multiply the current node  also depends on the adjacency matrix A, making it possible to carry information about the neighbors when learning W. This enriched gradient for the first weights matrix of Eq. 4 implies that the neighborhood will directly influence the projection of the node .We can notice a similar effect comparing the gradient of the GraphConv and the MI-GNN-v1 w.r.t. the node embeddings of the previous layer.Indeed for the GraphConv, the gradient is the following: while the one of the MI-GNN-v1 is: For what concerns the MI-GNN-v2, we have three weight matrices (see Eq. 6).The gradient of the hidden representations w.r.t W and W Σ are the same as reported in Eq. 12, so the consideration made for the MI-GNN-v1 holds also for these second version.Interestingly, the gradient w.r.t.W ⊙ keeps the same capability of conveying information about the current node and the neighborhood as in the case of the other two weights matrices: Differently from the gradient of the hidden representation with respect to the previously considered W and W Σ parameter matrices, in this case, the gradient is not influenced by the other weights of the model.
Considering the MI-GNN-v3, the derivative w.r.t. the previous layer is highly influenced by the ReLU function applied to the projection of the node  and the projection of the neighborhood.Recall that the ReLU function is used to avoid instability during the training.Let us start considering the gradient of the hidden representation of a layer w.r.t. the representation of the layer before: where Θ( Unlike the other version of MI-GNN, this one also considers the log of the ReLU activation of the neighbors' embeddings and the ReLU derivative applied to the embedding of the current node .In particular, for W log , this behavior ensures to have always a positive value for the gradient.We are committed to ensuring transparency and reproducibility in our research.To facilitate this, all the detailed step-by-step computations, along with the relevant code and experimental analysis, are made openly accessible and available online1 .

CONCLUSIONS AND FUTURE WORKS
This paper analyzed how Multiplicative Integration (MI) can be exploited in defining graph convolution operators.We proposed three variants that explore different ways to apply MI on graphstructured data, which we dub MI-GNN.We empirically evaluated the three versions of MI-GNN on eight benchmark classification datasets, adopting a fair experimental setting and analyzing the results with a solid statistical setting.The experimental assessment showed competitive results of MI-GNNs in terms of accuracy, while MI-GNNs tend to show a reduced training time compared to competing models.
We also analyzed how the use of MI influences the training with a theoretical study of the gradients, which showed the capability of the MI-GNN models to convey structural information when computing the gradients to update the weights of the model.
Moreover, the gradient flow of the MI-GNN was influenced by the interaction between the different weights of the GC, showing the capability to convey more information compared to the widely adopted GraphConv.
In the future, we plan to study how MI can be exploited to design novel architectures for learning in graph domains.In particular, we will explore the application of MI in defining pooling layers to compute a graph-level representation of the input.Moreover, we also plan to explore if the richer gradient flow of the MI-GNN can be helpful in graph continual learning settings where it is particularly useful to consider the dynamic information of the neighbors (that changes over time) during training.Finally, we intend to extend the applicability of our approach beyond graph classification tasks and explore its effectiveness in the domain of node classification.

Figure 1 :
Figure 1: Distribution of the training times of the best performing models w.r.t their accuracy.The MI-GNNs are marked with a cross, and the other baselines are marked with a rounded point.The light-blue area (left side) helps to identify the Pareto-front.For ogbg-molhiv, the AUC is reported on the vertical axis.

Figure 2 :
Figure 2: Critical Difference diagram of the Average score ranks.

Table 2 :
Accuracy and standard deviation, in percentages, on the test set for the best-validated models on all the datasets.The best performances are highlighted in boldface.
Table3reports the best AUC on the test set, training time, and architecture for the model with the highest AUC score on the validation set.MI-GNN-v2 has the best performance and it requires fewer parameters, thus, this supports our intuition that MI is able to grasp relevant complex interactions among nodes.It is worth noticing that such an AUC score would place our model in the top 20 leaderboard for this dataset.
the Kronecker delta function.In the following, we use Einstein's notation of summation over repeated indexes:   = A   =       = B   C   .For the GraphConv, the derivatives w.r.t. the weights are (omitting the Jacobian of  and the bias term): •) is the derivative of the ReLU.If we consider the gradient w.r.t. the weights W, W log we can notice that the gradient takes into account the interaction between them, as well as the adjacency matrix:   H   •   log ReLU H   A   .

Table 3 :
Experimental results and architecture for the best-validated model on the ogbg-molhiv dataset.GNN Type Pre-MLP layers GC-layers Post-MLP layers Hidden units # Param.