Geometric Matrix Completion via Sylvester Multi-Graph Neural Network

Despite the success of the Sylvester equation empowered methods on various graph mining applications, such as semi-supervised label learning and network alignment, there also exists several limitations. The Sylvester equation's inability of modeling non-linear relations and the inflexibility of tuning towards different tasks restrict its performance. In this paper, we propose an end-to-end neural framework, SYMGNN, which consists of a multi-network neural aggregation module and a prior multi-network association incorporation learning module. The proposed framework inherits the key ideas of the Sylvester equation, and meanwhile generalizes it to overcome aforementioned limitations. Empirical evaluations on real-world datasets show that the instantiations of SYMGNN overall outperform the baselines in geometric matrix completion task, and its low-rank instantiation could further reduce the memory consumption by 16.98\% on average.


Introduction
The Sylvester equation plays a central role for various applications in applied mathematics [Golub et al., 1979] [Wachspress, 1988], systems and control theory [Benner, 2004], machine learning [Agovic et al., 2011] and graph mining [Li et al., 2021].Particularly in graph mining, the Sylvester equation has shown its applicability in numerous multi-network mining tasks, such as network alignment [Zhang and Tong, 2016], social recommendation [Du et al., 2021], and semisupervised label learning [Chen et al., 2008].
Despite its succinct mathematical formulation, elegant theoretical properties, and numerous efficient solvers, there are several limitations when Sylvester equation is applied on multi-network mining.First, the real-world network data contains various heterogeneous features.However, it is nontrivial to directly incorporate these features of the networks into the classic Sylvester equation formulation.Second, in the task of multi-network association, the classic Sylvester equation essentially calculates a linear transformation from the observed prior multi-network association matrix.However, the non-linear relation between the prior knowledge and the final solution can not be captured by the classic Sylvester equation.Third, the Sylvester equation solver is often separated from the downstream task learning in many graph mining problems, and thus the solution of the Sylvester equation has to be further adapted towards different multi-network mining tasks.For example, in network alignment, the solution matrix of multi-network association is first calculated by the equation.Then, the soft/hard alignment method is conducted as an extra post-processing step, such as greedy match [Zhang and Tong, 2016].The equation can not be trained or tuned in an end-to-end fashion as modern neural networks, and consequently the performance of the downstream tasks might be suboptimal.A natural question is: How can we get the best of both the traditional Sylvester equation formulation and the neural network models?
In this paper, we propose a multi-graph neural network framework, SYMGNN in order to generalize the traditional linear Sylvester equation towards an end-to-end neural network model.Specifically, we focus on geometric matrix completion task, and elucidate two instantiations for the SYMGNN framework.Our proposed approach bears three distinctive advantages compared with both the Sylvester equation and the existing neural models targeted on geometric matrix completion.First, the proposed framework is a general form, and it is able to incorporate network features and be flexible to be instantiated towards different downstream tasks.Second, the neural design of the model leverages the attention mechanism, so that the proposed model incorporates both within-and cross-network attention.This in turn helps increase the model expressiveness, capture the nonlinear relations between input features, and learn compatible node representations across different networks.Third, the instantiations of the proposed framework could be trained in an end-to-end fashion, which directly adapt the solution generation module to the downstream prediction module.Fourth, for geometric matrix completion, two instantiations are provided based on explicitly learning multi-network association by 2-dimensional convolution, and learning low-dimensional representations for separate input networks, respectively.The low-dimensional instantiation approach further reduces the model's space complexity.
The notations used throughout the paper are summarized in Table 1.Generally, we use bold uppercase letters to represent matrices, bold lowercase letters to represent vectors, lowercase or uppercase letters in regular font for scalars.
construct a diagonal matrix by vector v Before giving the definition of GNN-based neural Sylvester equation in Section 3, we first provide some preliminaries on the traditional Sylvester equation and the Graph Neural Networks, followed by a formal definition of the geometric matrix completion.

Preliminaries
A -Sylvester Equation for Multi-network Mining.Given two networks represented as and an anchor multi-network association matrix H, which denotes the prior knowledge of the multi-network node associations.The Sylvester equation for multi-network mining is defined as follows [Du and Tong, 2018]: (1) where Ã1 and Ã2 are the symmetrically normalized adjacency matrices of the input networks.The X represents the cross-network node association scores which the equation aims to calculate.The scalar α ∈ (0, 1) is aimed at weighting the multi-network association aggregation term (i.e.Ã2 X ÃT 1 ), and the prior knowledge term (H).Due to the normalization of A 1 and A 2 , the corresponding linear system of Eq. (1) contains a positive semi-definite coefficient matrix, which guarantees the existence of unique solution for Eq.(1).Solving Eq. ( 1) is often time-consuming.A straightforward iterative method to solve Eq. ( 1) is the fixed point iteration.More efficient method is proposed in [Du and Tong, 2018] with linear time and space complexity.
The formulation of this equation for multi-network mining (Eq.( 1)) enjoys several distinctive advantages, which are summarized as follows.Firstly, theoretically the existence and uniqueness of the solution X can be guaranteed.Furthermore, there exists various efficient and scalable solvers for the solution.Secondly, the solution X can be seen as a fixed point of the equation and can be obtained by iteratively evaluating the Eq.(1).Compared to existing neural models, which might contain a number of hidden layers, there is no need to save the hidden states/representations.Thirdly, when reaching the fixed point, theoretically it is equivalent to proceed the recurrent process implied by Eq. (1) infinite times, so the formulation is able to leverage long-range dependency when solving X.
However, despite the advantages and effectiveness in various tasks, generally there are also several limitations of this formulation which are summarized as follows.Firstly, the numerical features of the nodes can not be effectively utilized for calculating X.Secondly, the X can be seen as a linear transformation from the prior knowledge matrix H.However, the potential non-linear relationship between them can not be captured by this formulation.Thirdly, since the equation is not learnable and not tunable, the solution X should always be adapted to a target downstream task by another learning model, but not in an end-to-end fashion.This might result in suboptimal performance for the downstream task.B -Graph Neural Networks.The Graph Neural Networks (GNN) are powerful deep learning models for network data.The basic idea of GNN model is to learn node representations via learnable aggregation, in which the node features are accumulated and transformed from the neighborhood features.Given a network G = (A, F), where A ∈ R n×n is the adjacency matrix of G, and F ∈ R n×d is the feature matrix with d being the dimension of features, representative GNN aggregation step at time step t can be written as follows.
where Ã is the normalized adjacency matrix with added selfloops.W is a learnable parameter matrix for the aggregated features.Different GNN models adopts different feature aggregation mechanisms.GCN model [Kipf and Welling, 2016] inserts the adjacency matrix with self-links and applies the renormalization.It also sets Ω = 0. GAT model [Veličković et al., 2017] utilizes the self-attention mechanism in feature aggregation.GIN model [Xu et al., 2018] adopts an MLP layer after the aggregation of hidden representations of nodes for improving the discriminative ability of GNN model.C -Convolutional Graph Embedding.Proposed in [Yao et al., 2018], the convolutional graph embedding (CGE) model is a GNN-based single network embedding model.Different from traditional GCN [Kipf and Welling, 2016], which simply sums up the hidden representations of all neighbors, the aggregation weights for center nodes and neighbor nodes are differentiated and learnable with the model in CGE.Specifically, in the (l + 1)-th layer, the output of a CGE layer can be represented as: (3) where V (l+1) and V (l) are the node representation matrices of the (l + 1)-th and l-th layer, respectively.σ is the learnable weight vector for the self-connections, and Θ (l) is the learnable weight matrix.φ() is an activation function.Using CGE adds more expressiveness to the model compared with GCN, and we will elaborate how to leverage it in Section 3.

Geometric Matrix Completion
Different from traditional matrix completion problem, the geometric matrix completion needs to handle two additional networks which reflect the topological relations between the nodes of two entities sets.Specifically, the problem is defined as follows.
Problem 1 GEOMETRIC MATRIX COMPLETION Given: Two networks with node features and the partially observed multi-network association H of the nodes in G 1 and G 2 ; Output: The unobserved entries in H.

Proposed Model
In this section, we elaborate our proposed framework of multi-graph neural networks.First, we present a general framework of neural Sylvester equation, followed by two GNN-based instantiations of the general framework targeted at the geometric matrix completion.Second, we introduce the training method with details.Finally, we provide analysis of the proposed two instantiations in terms of computatinal complexity.

Sylvester Multi-Graph Neural Network Framework
The goal of the proposed SYMGNN framework is to leverage the advantages of the traditional Sylvester equation, and in the meanwhile overcoming its limitations.First, if we observe the Sylvester equation in Eq. ( 1) from an iterative perspective, we can see that the first term on the right side aggregate the multi-network node association X linearly for the updated X.The second term incorporates the prior multinetwork association message H into the updated X.Second, similar to the ideas of the traditional Sylvester equation, we identify the two key modules of the SYMGNN: (1) the multinetwork aggregation learning module; and (2) the prior multinetwork association incorporation learning module.The general framework can be represented in Eq. ( 4).
) where a W () and b Θ () are two neural modules with parameters W and Θ, and weighting scalar α ∈ [0, 1].φ() is a nonlinear activation function.X is the multi-network association output of the SYMGNN framework, and it can be further fed into a neural network for adapting towards a downstream task in an end-to-end fashion.As we can see, this framework is a neural generalization originated from the Sylvester equation in Eq. (1).When the neural modules a W and b Θ are linear aggregation functions, the Eq. ( 4) degenerates to the classic Sylvester equation Eq. ( 1).Numerous instantiations exist for different downstream tasks.Next, we will discuss how to specifically instantiate this framework towards geometric matrix completion (Problem 1).

Base Model for Geometric Matrix Completion
Here, we present our base model for geometric matrix completion problem.In order to instantiate a W (F 1 , F 2 , A 1 , A 2 ), we design two parallel layers which adopt the CGE-based aggregation layer and the attention-based aggregation layer respectively.The motivation here is to learn the 2-d hidden representations for the multi-network association solution.In order to achieve this, we design two types of 2-d convolutional non-linear aggregation module, namely the adjacency matrixbased GCE neural aggregation, and the attention-based neural aggregation.To be specific, given {A 2 , F 2 } with n 1 , n 2 nodes respectively, we first apply the learnable parameters of self-connections from Eq. (3) on A 1 and A 2 to obtain the updated adjacency matrices.The goal is to adopt the learnable weight of the self-connections from CGE for improving the expressiveness of the model: (5b) where σ 1 and σ 2 are learnable weights for self-connections of the first network and the second network respectively.Before applying the multi-network aggregation layers, the node features of G 1 and G 2 are fed into an MLP layer for obtaining the hidden features U 1 = MLP 1 (F 1 ) and U 2 = MLP 2 (F 2 ).In the CGE-based multi-network aggregation, the output of the l-th level aggregation can be represented as: where are parameter matrices and φ() is an activation function.In practice, as a metric learning approach for the generalization of Mahalanobis distance [Yoshida et al., 2021], in order to capture the feature correlation between nodes from two different networks.
In the attention-based multi-network aggregation, the output can be represented as: a) is the parameter matrix, B 1 and B 2 are attention score matrices for G 1 and G 2 respectively.For instance, the attention score of node (i, j) ∈ G 1 is calculated as: In order to instantiate b Θ (F 1 , F 2 , H)), similar to the first term a W (F 1 , F 2 , Ã1 , Ã2 ), we can also adopt two types of parallel aggregation layers.The first one is the direct neural aggregation from prior multi-network association, and the second one is via attention schema.However, since the entries of the prior multi-network association matrix H is often real values or multi-class categorical rates, it is unreasonable to directly use H for cross-network feature aggregation.Thus the prior multi-network association-based multi-network aggregation is only adopted when H contains binary associations.The two types of multi-network aggregation modules are shown as follows.
Putting everything together, as shown in Figure 1, the intermediate multi-network association matrices X 1 , X 2 , X 3 , X 4 consist of the hidden representation tensor X ∈ R n1×n2×4 for the multi-network association solution.We apply a fully connected layer on X for obtaining the final multi-network association X = bmm(X , W) where W ∈ R n1×4×1 is the parameter tensor.

Low-rank Model for Geometric Matrix Completion
Instead of conducting bi-linear neural aggregation for the multi-network association directly, we can generate the embeddings for nodes of G 1 and G 2 respectively.Similar to the base model, we consider both the direct neural aggregation from the original network topology, and the neural aggregation from the within network attentions.First, given two networks 2 are then into two parallel CGE-based and attention-based neural modules for generating the hidden representations of the node features for two networks separately.We take U (h) 1 as an example, and the process for U (h) 2 is similar.The updated node hidden representations after an l-layer CGE module is: l) is the parameters for the l-th layer, and φ() is an activation function.After L layers, we obtain U (L) 1 .The updated node hidden representations after the attention-based neural aggregation module is: ) where the attention score matrix B 1 can be calculated via Eq.( 8).b Θ (F 1 , F 2 , H) is also instantiated for G 1 and G 2 separately.Here, we can adopt a similar prior multi-network association-based neural aggregation when the prior H denotes binary or multi-class relations.For H with entries of K classes, we apply K neural networks, in which each neural network aggregates one class of nodes.
where H i is the prior multi-network association which only contains the entries of the i-th class.Θ (p) i and Θ (c) are learnable parameters.The V (p) i for all classes are then concatenated and fed into an MLP for the node representation ).The cross-network attention matrix C is calculated by the same method as in Eq. ( 9b).
Putting everything together, we now have four representation matrices for each network: We can adopt another fully connected layer to obtain a final representation U 11 .The predicted multi-network association between two nodes is calculated by the dot product of the row vectors of the resulting representation matrices U 1 and U 2 .

Training
For matrix completion, we adopt the Mean Squared Error (MSE) loss for both instantiations: ) where the M matrix is a mask of 0, 1, with 1 indicating the position of the observed prior multi-network associations.For the low-rank instantiation, the dot product of the node representations are used as the final solutions.For the regularization of the model parameters, we adopt the weight decay method with 0.01 as decay factor as we find that it shows slightly better performance over L 2 regularization.We use the Adam optimizer as it overall shows the stablest training.

Complexity Analysis
For notation simplicity, assume that the two input networks contain n nodes and m edges respectively.Suppose the feature dimension is d, the number of observed rating is m and the dimension of node representations is r < d.For the base model, the major computation lies in the within-network and cross-network attention calculation as well as the aggregation.From Eq. ( 8), the within-network attention aggregation costs O(n 2 d).From Eq. (9b), the cross-network attention aggregation costs O(n 2 d).Eq. (3) costs O(Lmd + d 2 n).Since usually r, d << m, n, so its computation is not comparable with the attention-based neural aggregation.The overall time complexity for the base model is O(#iter • (n 2 d)), where #iter is the total number of iterations.The space complexity is O(n 2 ) because of the main storage of attention score matrices.Similarly, for the low-rank model, the overall time and space complexity are also O(#iter • (n 2 d)) and O(n 2 ).However, for the low-rank model, if we do not apply the within-network and cross-network attention-based neural aggregation, the time and space complexity would be reduced to O(L(md + nd 2 ) + m d), and O(m + n(d 2 + r 2 )) respectively.As we will see from Section 4, this will reduce the running time with slight performance drop.Since the base model needs to store the intermediate multi-network association matrix, the space complexity can not be further reduced even if the attention-based neural aggregation is dropped.

Experiments
In this section, we present the experimental results on realworld benchmark datasets to show the effectiveness of the proposed models.

Experimental Setting
A -Datasets and Pre-processing.The benchmark datasets used in the experiments are summarized in Table 2.For the benchmark datasets, ML-3K and Flixster have both user-user and item-item interaction networks.Douban only contains a user-user interaction network and YahooMusic only contains an item-item interaction network.For these two datasets, we use the identity matrix as the adjacency matrix for the missing networks.For ML-100K, ML-1M, we construct their user-user and item-item interaction networks by adopting a k-nearest neighbors search via their features, and k is treated as a hyperparameter in our model.All the datasets include multi-class categorical ratings.For the training/testing split, we use the same partition which is also adopted by existing methods, such as [Yao et al., 2018] [Monti et al., 2017], etc. B -Baseline Methods.We use five baselines in our comparison, including the traditional Sylvester equation Sylv.[Wachspress, 1988], and recent neural network-based and GNNbased methods: IGMC [Zhang and Chen, 2019], GC-MC [Berg et al., 2017], PinSage [Ying et al., 2018], and sRGCNN [Monti et al., 2017].C -Experimental and Hyperparameter Settings.For the effectiveness comparison, we tune the hyperparameters of the model based on the best performance on the validation set.We use 2-layer GCE and attention aggregation in both instantiations on all datasets except for ML-100K and ML-1M.On these two datasets, the base model uses 3-layer GCE and attention aggregation.For the k-NN method used for generating social networks and item-item interaction networks on ML-100K and ML-1M datasets, we use k = 10 for the lowrank model and k = 12 for the base model.Further studies of the sensitivity of k will be discussed in the ablation study.The metric for comparison is the widely adopted rooted mean squared error (RMSE).

Effectiveness Results
The first comparison results are shown in Table 3, as these datasets are the most common benchmarks among all existing methods.The results are reported based on the average of five runs.The best performances are shown in bold fonts and the second best performances are shown with underlines.As we can observe from the table, the traditional Sylvester equation can not achieve competitive results compared to other neural network/GNN-based baseline methods, which is consistent with our discussion on the limitations of the Sylvester equation.The Sylvester equation can not effectively incorporate node features, and also can not capture non-linear relations between the observed multi-network association and the solution.Among all the neural network-based methods, the proposed framework with low-rank instantiation outperforms the rest of the baselines on Douban, Flixster and YahooMusic datasets.Flixster (U) represents the dataset with only the usage of user-user interaction network.The performance of the proposed method slightly drops, and it shows the importance of both interaction networks of users and items in our model.Particularly, on YahooMusic dataset, the proposed method achieves 7.65% improvement over the best baseline.The average improvement over all datasets on Douban, Flixster and YahooMusic is 2.58%, which shows the effectiveness of the proposed models.The comparison results on ML-100K and ML-1M datasets are shown in Table 4.As we can see, the Sylvester equation is still not competitive with the rest of the methods.Our proposed low-rank instantiation consistently performs the best over all baselines.Among all baselines, GC-MC has close performance compared with our methods.GC-MC contains the graph encoder and the bi-linear decoder architecture which has similar effects as our proposed GNN-based neural aggregation model.This is consistent with our intuition of the effectiveness of the cross-network feature aggregation.

Parameter Sensitivity
We mainly study the impact of the number of GCE-based neural aggregation layers and the k value for k-NN method in graph construction in ML-100K datasets.The results are shown in Figures 2 and 3. From Figure 2, we can see that the performance is relatively stable for both models in terms of different number of layers.From Figure 3, the original models exhibit stabler performance over the models without the attention-based neural aggregation (the dashed lines) w.r.t. the k value.This also indicates that the attention-based neural aggregation makes the model less sensitive to the hyperparameter for constructing graphs when the user-user/item-item interactions are not directly available.

Related work
A -Multi-network Mining.Generally, multi-network mining techniques can be categorized into traditional numerical techniques and recent neural techniques.For numerical methods, GT-COPR by Li et al. [Li et al., 2019b] aims at inferring multi-relations among the entities across multiple

Figure 1 :
Figure 1: The overall illustration of two instantiations for SYMGNN.(a): the base model, and (b): the low-rank model.
with n 1 , n 2 nodes respectively, the node features are fed into an MLP layer for obtaining the hidden features U

Figure 2 :
Figure 2: RMSE vs. number of layers for GCE aggregation.

Figure 3 :
Figure 3: RMSE vs. k for k-NN method in graph construction.networksby a low-rank tensor by a tensor-based optimization method.After that, Li et al.[Li et al., 2021] propose an optimization method and a low-rank tensor-based label propagation algorithm for multi-relation inference.Liu et al. propose a cross-network multi-relation association learning method (i.e.CGRL) for inference of multi-network associations[Liu and Yang, 2016].The Sylvester equation is widely adopted in solving multi-network mining tasks, such as network alignment[Zhang and Tong, 2018],[Chu et al., 2019], cross-network similarity learning[Li et al., 2019a], and social recommendation[Tang et al., 2013].Chen et al. propose a Sylvester equation-based solver for semi-supervised multilabel learning problems [Chen et al., 2008].Du et al. propose a Krylov-subspace based fast solver (i.e.FASTEN) for the Sylvester equation for various multi-network mining tasks.B -Graph Neural Network Models.Graph Neural Networks (GNN) include a broad range of deep learning models on network data.Recent advances primarily concentrate on convolutional models, such as GCN [Kipf and Welling, 2016], GAT [Veličković et al., 2017], GIN [Xu et al., 2018] and GraphSAGE [Hamilton et al., 2017].We will briefly review representative GNN models related to multi-network mining tasks and matrix completion.Multi-Graph CNN (MGCNN) by Monti et al. [Monti et al., 2017] is one of earliest works which explores the convolutional models on multi-networks for matrix completion.GC-MC by Berg et al. [Berg et al., 2017] proposes a network-based auto-encoder framework for matrix completion.The model produces latent features of users and items through message passing on the bipartite interaction networks.IGMC by Zhang et al. [Zhang and Chen, 2019] proposes an inductive matrix completion method using GNN model on bipartite graphs induced from user-item ratings when the side information is unavailable.GraphRec by Fan et al. [Fan et al., 2019] jointly captures the interactions and opinions in the user-user and user-item network and propose a GNN-based framework for recommendation.

Table 1 :
Symbols and Definition

Table 2 :
The statistics of the benchmark datasets.

Table 3 :
RMSE comparison for geometric matrix completion.

Table 4 :
RMSE comparison on ML-100K and ML-1M dataset.The ablation study results are shown in Table 5.The 'Base model (G)' and 'Base model (A)' represent the model with only GCE-based neural aggregation and the model with only attention-based neural aggregation.The low-rank model uses the same abbreviation.The values inside the parentheses denote the maximum allocated GPU memory in one epoch, in which we use the same batch size (i.e.50) for comparison.
As we can see, firstly the original model performs the best over all variants in terms of RMSE for both base and lowrank instantiations.Secondly, the model without the attention neural aggregation overall consumes the least GPU memory during training.On average, with only 1.24% performance drop, the models without attention neural aggregation show 16.98% less memory consumption.Furthermore, comparing with other baselines' performance in Table4, the variant lowrank model in Table5still outperforms all baseline methods.