Predicting the Silent Majority on Graphs: Knowledge Transferable Graph Neural Network

Graphs consisting of vocal nodes ("the vocal minority") and silent nodes ("the silent majority"), namely VS-Graph, are ubiquitous in the real world. The vocal nodes tend to have abundant features and labels. In contrast, silent nodes only have incomplete features and rare labels, e.g., the description and political tendency of politicians (vocal) are abundant while not for ordinary people (silent) on the twitter's social network. Predicting the silent majority remains a crucial yet challenging problem. However, most existing message-passing based GNNs assume that all nodes belong to the same domain, without considering the missing features and distribution-shift between domains, leading to poor ability to deal with VS-Graph. To combat the above challenges, we propose Knowledge Transferable Graph Neural Network (KT-GNN), which models distribution shifts during message passing and representation learning by transferring knowledge from vocal nodes to silent nodes. Specifically, we design the domain-adapted"feature completion and message passing mechanism"for node representation learning while preserving domain difference. And a knowledge transferable classifier based on KL-divergence is followed. Comprehensive experiments on real-world scenarios (i.e., company financial risk assessment and political elections) demonstrate the superior performance of our method. Our source code has been open sourced.


INTRODUCTION
Graph structured data is prevalent in the real world e.g., social networks [6,61], financial networks [13,78,84], and citation networks [35,52,75].For many practical scenarios, the collected graph usually contains incomplete node features and unavailable labels due to reasons such as limitation of observation capacity, incompleteness of knowledge, and distribution-shift, which is referred to as the data-hungry problem for graphs.Compared with traditional ideal graphs(Fig. 1 a)), the nodes on such graphs (Fig. 1 b)) can be divided into two categories according to the degree of data-hungry: vocal nodes and silent nodes.We name such graph as the VS-Graph.
VS-Graph is ubiquitous and numerous graphs in the physical world.Taking two famous scenarios "political election" [1,62,70] and "company financial risk assessment" [5,7,12,23,49,83] as examples.As Fig. 1 (c) illustrated, the politicians (i.e., political celebrities) in the minority and the civilians in the majority form a politician-civilian graph by social connections (e.g., following on Twitter).We can obtain detailed descriptions (attributes) and clear political tendencies (labels) for politicians (vocal nodes) while such information is unavailable for civilians (silent nodes).Meanwhile, predicting the political tendency of the majority is critical for political elections.Similar problems also exist in company financial risk assessment in Fig. 1 (d), where the investment relations between listed companies in the minority and unlisted companies in the majority form the graph.Only listed companies publish their financial statements and the financial risk status of listed companies is clear.While unlisted companies are not obligated to disclose financial statements, and it is difficult to infer their risk profile based on extremely limited business information.However, the financial risk assessment of unlisted companies in the majority is of great significance for financial security [9,20,31,47].Overall, the vocal nodes ("the vocal minority") have abundant features and labels.
While the silent nodes ("the silent majority") have rare labels and incomplete features.VS-Graph is also common in other real-world scenarios, such as celebrities and ordinary netizens in social networks.Meanwhile, predicting the silent majority on VS-Graph is important yet challenging.Recently, Graph neural networks (GNNs) have achieved state-ofthe-art performance on graph-related tasks.Generally, GNNs follow the message-passing mechanism, which aggregates neighboring nodes' representations iteratively to update the central node representation.However, the following three problems lead to the failure of GNNs to predict the silent majority: 1) the data distributionshift problem exists between vocal nodes and silent nodes, e.g., listed companies have more assets and cash flows and therefore show different distribution from unlisted companies in terms of attributes, which is rarely considered by the previous GNNs; 2) the feature missing problem exists and can not be solved by traditional heuristic feature completion strategies due to the above data distribution-shift; 3) the lack of labels for silent majority hinders an ideal model, and directly training models on vocal nodes and silent nodes concurrently leads to poor performance in predicting the silent majority due to the distribution-shift.
To solve the aforementioned challenges, we propose Knowledge Transferable Graph Neural Network (KTGNN), which targets at learning representations for the silent majority on graphs via transferring knowledge adaptively from vocal nodes.KTGNN takes domain transfer into consideration for the whole learning process, including feature completion, message passing, and the final classifier.Specifically, Domain-Adapted Feature Complementor (DAFC) and Domain-Adapted Message Passing (DAMP) are designed to complement features for silent nodes and conduct message passing between vocal nodes and silent nodes while modeling and preserving the distribution-shift.With the learned node representations from different domains, we propose the Domain Transferable Classifier (DTC) based on KL-divergence minimization to predict the labels of silent nodes.Different from existing transfer learning methods that aim to learn domain-invariant representations, we propose to transfer the parameters of classifiers from vocal domain to silent domain rather than forcing a single classifier to capture domain-invariant representations.Comprehensive experiments on two real-world scenarios (i.e., company financial risk assessment and political elections) show that our method gains significant improvements on silent node prediction over SOTA GNNs.
The contributions of this paper are summarized as follows: • We define a practical and widespread VS-Graph, and solve the new problem of predicting silent majority, one important and realistic AI application.(i.e., company financial risk assessment and political elections) show the superiority of our method, e.g., AUC achieves an improvement of nearly 6% on company financial risk assessment compared to the current SOTA methods.

PRELIMINARY
In this section, we give the definitions of important terminologies and concepts appearing in this paper.

Graph Neural Network
where ℎ is the node representation vector at -th layer of GNN, M denotes the message function of aggregating neighbor's features, and U denotes the update functions with the neighborhood messages and central node feature as inputs.By stacking multiple layers, GNNs can aggregate information from higher-order neighbors.

Problem Definition: Silent Node Classification on the VS-Graph
In this paper, we propose the problem of silent node prediction on VS-Graph and we mainly focus on the node classification problem.First, we give the definition of VS-Graph: Definition 2.1 (VS-Graph).Given a VS-Graph G  ( , , ,  , Ψ), where  =   ∪  is the node-set, including all vocal nodes and silent nodes. = {   = (  ,   )|  ,   ∈   ∪  } is the edge set. = [  ;   ] is the attribute matrix for all nodes and  = [  ;   ] is the class label of all nodes (  = −1 denotes the label of   is unavailable).Ψ :  → {,  } is the population indicator that maps a node to its node population (i.e., vocal nodes, silent nodes).
For a VS-Graph, the attributes and labels of vocal nodes (i.e.,   ,   ) are complete.However,   is rare and   is incomplete.Specifically, for vocal node   ∈   , its attribute vector where    ∈ R   is part of attributes observable for all nodes and    ∈ R   is part of attributes unobservable for silent nodes while observable for vocal nodes.For silent node   ∈   , its attribute vector where    ∈ R   is the observable part and 0 ∈ R   means the unobservable part which is complemented by 0 per-dimension initially.Then the problem of Silent Node Classification on VS-Graph is defined as follows: Definition 2.2 (Silent Node Classification on the VS-Graph).Given a VS-Graph G  ( , , , , Ψ), where |  | << |  | and the attributes of nodes from the two populations (i.e.,   and   ) belong to different domains with distinct distributions P  ≠ P  .Under these settings, the target is to predict the labels (  ) of silent nodes (  ) with the support of a small set of vocal nodes (  ) with complete but out-of-distribution information.

EXPLORATORY ANALYSIS
"The Silent Majority" in Real-world Graphs: To demonstrate that the distribution-shift problem between vocal nodes and silent nodes does exist in the real-world, we choose two representative real-world scenarios (i.e., political social network, company equity network) as examples and conduct comprehensive analysis on the two real-world VS-Graphs.And the observations show that there exists a significant distribution-shift between vocal nodes and silent nodes on their shared feature dimensions.The detailed information of the datasets is summarized in Sec.5.1.

Single-Variable Analysis
We analyze the distribution of single-variable on the real-world Company VS-Graph, including the listed company nodes and unlisted company nodes (more details in Sec.5.1).All node attributes of this dataset are from real-world companies and have practical physical meanings (e.g., register capital, actual capital, staff number), which are appropriate for statistical analysis of a single dimension.
Instead of directly calculating the empirical distribution, we visualize the distribution with box-plots in a more intuitive manner.We select three important attributes (i.e., register capital, actual capital, staff number) and present the box plot of all nodes distinguished by their labels (i.e., risky company or normal company) and populations (i.e., listed company or unlisted company) in the Fig. 2. We observe that there exists significant distribution-shift between listed companies (vocal nodes) and unlisted companies (silent nodes), which confirms our motivations.Considering that there are hundreds of attributes, we only present three typical attributes correlated to the company assets here due to the space limitation.The above single-variable analysis demonstrates that there exists significant domain difference on the certain dimension of attributes between vocal nodes and silent nodes.To further demonstrate the domain difference of all attributes, we also visualize by scatter plot on all attributes with t-SNE Algorithm [63].Specifically, we first project the   -dimensional attributes to 2-dimensional vectors and then present the scatter plot on the 2D plane.

Multi-Variable Visualization
As shown in Fig. 3, we respectively visualize the two VS-Graph Datasets: Twitter (a political social network) and Company (a company equity network).Specifically, Fig. 3

METHODOLOGY
We propose the Knowledge Transferable Graph Neural Network (KTGNN) to learn effective representations for silent nodes on VS-Graphs.

Model Overview
We first present an overview of our KTGNN model in Fig. 4. KTGNN can be trained through an end-to-end manner on the VS-Graph and is composed of three main components: (1)  Before completing features, we first partition the nodes on the VS-Graph into two sets: V + and V − , where nodes in V + have complete features while nodes in V − have incomplete features.Initially, V + =   and V − =   .Then the DAMP module completes the features of nodes in V − by transferring the knowledge of nodes in V + to V − iteratively.After each iteration, we add the nodes in V − whose features are completed to V + and remove these nodes from V − .And we set a hyper-parameter  as the max-iteration of feature completion.By setting different , the set of nodes with complete features can cover all nodes on the Graph, and we have   ⊂ V + ⊂  .Fig. 4-(a) present the illustration of our DAFC module.At the first iteration, features of vocal nodes are used to complete features of incomplete silent nodes that directly connect to the vocal nodes.Given incomplete silent node   ∈ {  ∩ V − }, its complemented unobservable feature    can be calculated: where    is the calibrated variable for    by eliminating the effects of domain differences: where Then for other incomplete silent nodes that do not directly connect to any vocal nodes, they will be complemented from the second iteration until the algorithm converges (reaching the max-iteration  or all silent nodes have been complemented).After iteration 1, DAFC uses complete silent nodes to further complement features for the remaining incomplete silent nodes at each iteration.During this process, the set of complete nodes V + expand layer-by-layer like breadth-first search (BFS), as shown in Eq.4: where  (•) is the neighbor importance factor: where  is a  activation function.Note that the node set {N (  ) ∩ V + } ∈   , because all silent nodes that have vocal neighbors have been complemented and merged into V + after iteration 1 (see Eq. 2), therefore the unobservable features    in Eq. 4 do not need to be calibrated by the domain difference factor.Finally, we get the complemented features for silent nodes To guarantee that the learned unobservable domain difference (i.e., Δ   in Eq. 3) is actually applied in the process of the domainadapted feature completion, we add a Distribution-Consistency Loss L  when optimizing the DAFC module:

Domain Adapted Message Passing (DAMP)
According to the directions of edges based on the population types from source nodes to target nodes, we divide the message passing on a VS-Graph into four parts: (  message passing, while directions (3) and ( 4) are within-domain (indistribution) message passing.However, out-of-distribution messages from cross-domain neighbors should not be passed to central nodes directly, otherwise, the domain difference will become noises that degrade the model performance.To solve the aforementioned problem, we design a Domain Adapted Message Passing (DAMP) module.For messages from cross-domain neighbors, we first calculate the domain-difference scaling factor and then project the OOD features of source nodes into the domain of target nodes.e.g., for edges from vocal nodes to silent nodes, we project the features of vocal nodes to the domain of silent nodes and then pass the projected features to the target silent nodes.
Specifically, the DAMP module considers two factors: the bidirectional domain difference and the neighbor importance.Given an edge  , = (  ,   ), the message function is given as follows: where ℎ  is the source node feature calibrated by the domain difference factor and  Ψ(  ) (•) is the neighbor importance function: where  is  and Δ , is the distribution-shift variable: ) where  is  ℎ; Ψ(  ) is the population indicator (see Eq. 1); X  = E   ∼  (  ) and X = E   ∼  (   ) are respectively the expectation of the observable features for vocal nodes and silent nodes, and X  − X represents the domain difference between vocal and silent nodes according to their complete attributes (the attributes of silent nodes have been complemented with DAFC).
With the DAMP module, we calibrate the source nodes to the domain of target nodes while message passing, which eliminates the noises caused by the OOD features.It should be noticed that the distribution-shift variable Δ , only works for cross-domain message passing (  →   or   →   ), and Δ , = 1 for within-domain message passing (  →   or   →   ).With DAMP, we finally obtain the representations of silent nodes and vocal nodes while preserving their domain difference.Different from mainstream domain adaption methods that project the samples from source and target domains into the common space and only preserve the domain-invariance features, our DAMP method conducts message passing while preserving domain difference.

Domain Transferable Classifier (DTC)
With DAFC and DAMP, we solve the feature incompleteness problem and obtain the representations of silent nodes and vocal nodes that preserve the domain difference.Considering that the node representations come from two distinct distributions and the datahungry problem of silent nodes (see Sec. 3 and Sec.2.2), we cannot directly train a good classifier for silent nodes.To solve the crossdomain problem and label scarcity problem of silent nodes, we design a novel Domain Transferable Classifier (DTC) for the silent node classification by transferring knowledge from vocal nodes.Traditional domain adaption methods usually transfer cross-domain knowledge by constraining the learned representations to retain domain-invariance information only, under the assumption that the distribution of two domains is similar, which may not be satisfied.Rather than constraining representations, DTC targets at transferring model parameters so that the knowledge of the optimized source classifier can be transferred to the target domain.
The solid rectangular box in Fig. 4-(c) shows our motivation to design DTC.Specifically, the classifier trained on source domain only performs under-fitting in target domain due to domain shift (blue dotted line).Meanwhile, the classifier trained on target domain tends to overfit due to label scarcity (orange dotted line).An ideal classifier is between these two classifiers (green line).Based on this motivation (verified via experimental results in Fig. 5), we transfer knowledge from both source classifier and target classifier to introduce an ideal classifier by minimizing the KL divergence between them.Specifically, as shown in Fig. 4-(c), DTC has three components, including the source classifier   , the target classifier   and the cross-domain transformation module  → .
To predict the silent majority, the source classifier is trained with the vocal nodes and the target domain classifier is trained with the silent nodes.the cross-domain transformation module is used to transfer the original source domain classifier to a newly generated target domain classifier (   =  → (  (•))), which takes the parameters of source domain classifier as input and then generate the parameters of a new target domain classifier.Specifically, both   =  (   ) and   =  (   ) are implemented by one fully-connected layer with Sigmoid activation.And   → (  ) =  (  ) is implemented by a multi-layer perception with nonlinear transformation to give it the ability to change the shape of the discriminant boundary of the classifier.
The loss function to optimize DTC and the whole model can be divided into two parts: the KL loss L  and the classification loss L   .
The KL loss function L  is proposed to realize the knowledge transfer between the source domain and the target domain by constraining the discriminative bound of the generated classifier   to locate between that of   and   : L  = (P  , P  ) + (P  , P  ) (11) where The classification loss L   is defined as: where  is the binary cross-entropy loss function: Combined with the Distribution-Consistency Loss L  , our final loss function L is: where  is a hyper-parameter to control the weight of L  , and we use  = 1 in all our experiments.

EXPERIMENTS
In this section, we compare KTGNN with other state-of-the-art methods on two real-world datasets in different scenarios.

Datasets
Basic information of the real-world VS-Graphs are shown in Table 1.More details about the datasets are shown in Appendix A.
Company: Company dataset is a VS-Graph based on 10641 real-world companies in China (provided by TianYanCha), and the target is to classify each company into binary classes (risky/normal).On this VS-Graph, vocal nodes denote listed companies and silent nodes denote unlisted companies.The business information and equity graphs of all companies are available, while only the financial statements of listed companies can be obtained (missing for unlisted companies).Edges of this dataset indicate investment relations between companies.
Twitter: Following the dataset proposed by Xiao et al. [70], we construct the Twitter VS-Graph based on data crawled from Twitter, and the target is to predict the political tendency (binary classes: democrat or republican) of civilians.On this VS-Graph, vocal nodes denote famous politicians and silent nodes denote ordinal civilians.The tweet information is available for all Twitter users, while the personal descriptions from the homepage are available only for politicians (missing for civilians).Edges of this dataset represent follow relations between Twitter users.
For each dataset (Company and Twitter), we randomly divide the annotated silent nodes into train/valid/test sets with a fixed ratio of 60%/20%/20%.And we add all annotated vocal nodes into the training set because our target is to classify the silent nodes.The detailed hyper-parameter settings of KTGNN and other baselines are presented in Appendix D.

Main Results
We focus on silent node classification on the VS-Graph in this paper and conduct comprehensive experiments on two critical real-world scenarios (i.e., political election and company financial assessment).
Results of Silent Node Classification We select two representative metrics (i.e., F1-Score and AUC) to evaluate the model performance on the silent node classification task.Considering that baseline models cannot directly handle graphs with partiallymissing features (silent nodes in VS-Graphs), we combine these baseline GNNs with some heuristic feature completion strategies (e.g., completion by zero, completion by mean of neighbors) to keep fair comparison with our methods, and we finally choose the best   3 indicate that the "Mean-of-Neighbors" strategy wins in most cases.However, all these heuristic completion strategies ignore the distribution-shift between the vocal nodes and the silent nodes, and thus are far less effective than our KTGNN model.

Ablation Study
Effects of Each Module: To validate the effects of each component of KTGNN, we design some variants of KTGNN by removing one of modules (i.e., DAFC, DAMP, DTC, L  , L  ) and the results are shown in the Table 4.We observe that the performance of all KTGNN variants deteriorates to some degree.And our full KTGNN module gains the best performance, which demonstrates the important effects of each component in KTGNN.
Effects of KL Loss: We also validate the role of KL loss L  (see Eq. 11) in our method.As shown in the Fig. 5, the final target classifier generated from the source classifier gains the highest scores and lowest loss among the three sub-classifiers in the DTC module (see Sec. 4.4).And the result of KL Loss (components of L  ) indicates that the discriminant bound of the generated target classifier is located between that of the source classifier and target classifier, which further confirms our motivations.
Effects of Cross-Domain Knowledge Transfer: We further validate our approach with borderline cases (i.e., topological traits that may worsen the correct knowledge transfer from vocal nodes to silent nodes).We validate this by cutting part of such edges.The results in Table 5 indicate that our approach is robust over these borderline cases.

HyperParameter Sensitivity Analysis
We validate the sensitivity of main hyper-parameters of KTGNN:  (max-iteration of DAFC ) and  (weight of L  ).As shown in Fig. 6, we conduct experiments on the Company dataset with different  and .The results show that KTGNN gains the best performance when  = 2, which is decided by the graph property that 2-hop neighbors of vocal nodes cover most silent nodes, and gains the best performance when  = 1.0, which also indicates the effectiveness of the KL loss.

Representation Visualization
We visualize the learned representations of KTGNN in Fig.

CONCLUSION
In this paper, we propose a novel and widespread problem: silent node classification on the VS-Graph ("predicting the silent majority on graphs"), where the silent nodes suffer from serious data-hungry problems (feature-missing and label scarcity).Correspondingly, we design a novel KTGNN model for silent node classification by transferring useful knowledge from vocal nodes to silent nodes adaptively.Comprehensive experiments on the two representative real-world VS-Graphs demonstrate the superiority of our approach.

Figure 1 :
Figure 1: Examples of silent node classification, where (a) and (b) show the difference between silent node classification vs. traditional node classification.(c) and (d) show two real-world VS-Graphs.

Figure 2 :
Figure 2: Box-plot of single-variable distribution for company equity graph dataset.Each box-plot actually represents the conditional distribution  ( |, O), where the Yaxis is the value of attributes after the log function.

Figure 3 :
Figure 3: T-SNE visualization of the raw node features distinguished by their populations (vocal→orange, silent→cyan) or labels (binary classes: positive→red, negative→blue).Note that we only visualize the observable attributes   (dimensions shared by vocal and silent nodes).
(a) shows the nodes of different populations in the Twitter dataset (two colors denote vocal nodes and silent nodes); Fig. 3 (b) shows the vocal nodes (politicians) of different classes (i.e., political parties, including democrat and republican) and Fig. 3 (c) shows the silent nodes (civilians) of different classes.For the company dataset, Fig. 3 (d) shows the nodes of different populations (two colors denote vocal nodes and silent nodes); Fig. 3 (e) shows the vocal nodes (listed companies) of different classes (i.e., risky/normal company) and Fig. 3 (c) shows the silent nodes (unlisted companies) of different classes.Fig. 3 (a) and (d) demonstrate that nodes from different populations have distinct distributions, which are reflected as different clusters with distinct shapes and positions.From Fig. 3 (b) and (e), we observe that the vocal nodes of different classes have similar distributions, which are reflected as indistinguishable clusters mixed together, and the silent nodes of different classes Fig. 3 (c) and (f) have similar phenomena as vocal nodes.All these findings demonstrate that there exists a significant distribution-shift between vocal nodes and silent nodes on these real-world VS-Graphs.
the output probability of   , P  ∈ R |  |× | C | is the output probability of   , P  ∈ R |  |×| C | and P  ∈ R |  |× | C | are the output probability of   , |C| is the number of classes and |C| = 2 for binary classification tasks.

Figure 5 :
Figure 5: F1 Score and Loss Curve of KTGNN on Company dataset (results of Twitter dataset are shown in appendix).

Figure 6 :
Figure 6: Hyper-parameter analysis of  and  on Company dataset (see results on Twitter dataset in Fig. 9 of Appendix).
7. Subfigures (a) and (d) show that our KTGNN remains the domain difference between vocal nodes and silent nodes during message-passing.And Sub-figures (b) (c) and (d) (f) show that the representations learned by KTGNN have better distinguishability between different classes for both vocal nodes and silent nodes, compared with the T-SNE visualization of raw features in Fig. 3 (The representation visualization of baseline GNNs are shown in the appendix).

Figure 7 :
Figure 7: T-SNE visualization of the node completed representations learned by KTGNN, distinguished by their populations (vocal→orange, silent→cyan) or node labels (binary classes: positive→red, negative→blue).
R   ×  and   ∈ R 2  ×  are learnable parametric matrix, and  is a  ℎ activation function; Δ   is the transformed domain difference; X   = E   ∼  (   ) and X  = E   ∼  (   ) are respectively the expectation of the observable features for vocal nodes and silent nodes, and X   − X  represents the domain difference between vocal and silent nodes according to their observable attributes.In this part, we aim at learning the unobservable domain difference based on the observable domain difference X   − X  .

vocal nodes → silent nodes Iteration 2: Complete silent nodes → Incomplete silent nodes Repeat until convergence
1) Messages from vocal nodes to silent nodes; (2) Messages from silent nodes to vocal nodes; (3) Messages from vocal nodes to vocal nodes; (4) Messages from silent nodes to silent nodes.Among the four directions of message passing, directions (1) and (2) are cross-domain (out-of-distribution)Iteration 1:

Table 1 :
Basic information of the dataset used in this paper.

Table 2 :
Results of silent node classification.All base models except KTGNN are combined with one optimal heuristic feature completion strategy.eachbaselineGNN(results in Table2).As shown in Table2, our KTGNN model gains significant improvement in the performance of silent node classification on both Twitter and Company datasets compared with other state-of-the-art GNNs.

Table 3 :
Results of models with different completion strategies on Company dataset."None"meanswe only use the observable attributes (i.e.,   ) for both vocal and silent nodes without completion; "0-Completion" and "Mean-of-Neighbors" use the zero vector and mean vector of vocal neighbors to complete the missing dimensions for silent nodes.Effects of Feature Completion Strategy For baseline GNNs,we also analyze the effects of heuristic feature completion methods on the Company dataset at Table 6 (results of Twitter datasets are presented at Table 6 in Appendix).And the results in Table

Table 4 :
Ablation studies of KTGNN compared with its variants by removing certain components.