Link Prediction on Latent Heterogeneous Graphs

On graph data, the multitude of node or edge types gives rise to heterogeneous information networks (HINs). To preserve the heterogeneous semantics on HINs, the rich node/edge types become a cornerstone of HIN representation learning. However, in real-world scenarios, type information is often noisy, missing or inaccessible. Assuming no type information is given, we define a so-called latent heterogeneous graph (LHG), which carries latent heterogeneous semantics as the node/edge types cannot be observed. In this paper, we study the challenging and unexplored problem of link prediction on an LHG. As existing approaches depend heavily on type-based information, they are suboptimal or even inapplicable on LHGs. To address the absence of type information, we propose a model named LHGNN, based on the novel idea of semantic embedding at node and path levels, to capture latent semantics on and between nodes. We further design a personalization function to modulate the heterogeneous contexts conditioned on their latent semantics w.r.t. the target node, to enable finer-grained aggregation. Finally, we conduct extensive experiments on four benchmark datasets, and demonstrate the superior performance of LHGNN.


INTRODUCTION
Objects often interact with each other to form graphs, such as the Web and social networks.The prevalence of graph data has catalyzed graph analysis in various disciplines.In particular, link prediction [43] is a fundamental graph analysis task, enabling widespread applications such as friend suggestion in social networks, recommendation in e-commerce graphs, and citation prediction in academic networks.In these real-world applications, the graphs are typically heterogeneous as opposed to homogeneous, also known as Heterogeneous Information Networks (HINs) [39], in which multiple types of nodes and edges exist.For instance, the academic HIN shown in the top half of Fig. 1(a) interconnects nodes of three types, namely, Author (A), Paper (P) and Conference (C), through different types of edges such as "writes/written by" between author and paper nodes and "publishes/published in" between conference and paper nodes, etc.The multitude of node and edge types in HINs implies rich and diverse semantics on and between nodes, which opens up great opportunities for link prediction.
A crucial step for link prediction is to derive features from an input graph.Recent literature focuses on graph representation learning [4,37], which aims to map the nodes on the graph into a lowdimensional space that preserves the graph structures.Various approaches exist, ranging from the earlier shallow embedding models [10,27,32] to more recent message-passing graph neural networks (GNNs) [11,18,34,38].Representation learning on HINs generally follows the same paradigm, but aims to preserve the heterogeneous semantics in addition to the graph structures in the low-dimensional space.To express heterogeneous semantics, existing work resorts to type-based information, including simple node/edge types (e.g., an author node carries different semantics from a paper node), and type structures like metapath [30] (e.g., the metapath A-P-A implies two authors are collaborators, whereas A-P-C-P-A implies two authors in the same field; see Sect. 3 for the metapath definition).Among the state-of-the-art heterogeneous GNNs, while hinging on the common operation of message passing, some exploit node/edge types [12,16,22,41] and others employ type structures [9,28,35].
Our problem.The multitude of node or edge types gives rise to rich heterogeneous semantics on HINs, and forms the key thesis of HIN representation learning [39].However, in many real-world scenarios, type information is often noisy, missing or inaccessible.One reason is type information does not exist explicitly and has to be deduced.For instance, when extracting entities and their relations from texts to construct a knowledge graph, NLP techniques are widely used to classify the extractions into different types, which can be noisy.Another reason is privacy and security, such that the nodes in a network may partially or fully hide their identities and types.Lastly, even on an apparently homogeneous  graph, such as a social network which only consists of users and their mutual friendships, could have fine-grained latent types, such as different types of users (e.g., students and professionals) and different types of friendships (e.g., friends, family and colleagues), but we cannot observe such latent types.Formally, we call those HINs without explicit type information as Latent Heterogeneous Graphs (LHGs), as shown in the bottom half of Fig. 1(a).The key difference on an LHG is that, while different types still exist, the type information is completely inaccessible and cannot be observed by data consumers.It implies that LHGs still carry rich heterogeneous semantics that are crucial to effective representation learning, but the heterogeneity becomes latent and presents a much more challenging scenario given that types or type structures can no longer be used.In this paper, we investigate this unexplored problem of link prediction on LHGs, which calls for modeling the latent semantics on LHGs as links are formed out of relational semantics between nodes.

Challenges and insights.
We propose a novel model for link prediction on LHGs, named Latent Heterogeneous Graph Neural Network (LHGNN).Our general idea is to develop a latent heterogeneous message-passing mechanism on an LHG, in order to exploit the latent semantics between nodes for link prediction.More specifically, we must address two major challenges.
First, how to capture the latent semantics on and between nodes without any type information?In the absence of explicit type information, we resort to semantic embedding at both node and path levels to depict the latent semantics.At the node level, we complement the traditional representation of each node (e.g., h 4 in Fig. 1(b), which we call the primary embedding) with an additional semantic embedding (e.g., s 4 ).While the primary embedding can be regarded as a blend of content and structure information, the semantic embedding aims to distill the more subtle semantic information (i.e., latent node type and relation types a node tends to associate with), which could have been directly expressed if explicit types were available.Subsequently, at the path level, we can learn a semantic path embedding based on the node-level semantic embeddings in a path on an LHG, e.g., s 4 , s 1 and s 0 in the path  4 - 1 - 0 in Fig. 1(b).The path-level semantic embeddings aim to capture the latent high-order relational semantics between nodes, such as the collaborator relation between authors  4 and  0 in Fig. 1(a), to mimic the role played by metapaths such as A-P-A.
Second, as context nodes of a target node often carry latent heterogeneous semantics, how to differentiate them for finer-grained context aggregation?We propose a learnable personalization function to modulate the original message from each context node.The personalization hinges on the semantic path embedding between each context node and the target node as the key differentiator of heterogeneous semantics carried by different context nodes.For illustration, refer to the top half of Fig. 1(b), where the green nodes (e.g.,  1 ,  4 and  6 ) are the context nodes of the doubly-circled target node (e.g.,  0 ).These context nodes carry latent heterogeneous semantics (e.g.,  1 is a paper written by  0 ,  4 is a collaborator of  0 , and  6 is a related paper  0 might be interested in), and thus can be personalized by the semantic path embedding between each context and the target node before aggregating them.
Contributions.In summary, our contributions are three-fold.(1) We investigate a novel problem of link prediction on latent heterogeneous graphs, which differs from traditional HINs due to the absence of type information.(2) We propose a novel model LHGNN based on the key idea of semantic embedding to bridge the gap for representation learning on LHGs.LHGNN is capable of inferring both node-and path-level semantics, in order to personalize the latent heterogeneous contexts for finer-grained message passing within a GNN architecture.( 3) Extensive experiments on four realworld datasets demonstrate the superior performance of LHGNN in comparison to the state-of-the-art baselines.

RELATED WORK
Graph neural networks.GNNs have recently become the mainstream for graph representation learning.Modern GNNs typically follow a message-passing scheme, which derives low-dimensional embedding of a target node by aggregating messages from context nodes.Different schemes of context aggregation have been proposed, ranging from simple mean pooling [11,18] to neural attention [34] and other neural networks [38].
For representation learning on HINs, a plethora of heterogeneous GNNs have been proposed.They depend on type-based information to capture the heterogeneity, just as earlier HIN embedding approaches [5,8,14].On one hand, many of them leverage simple node/edge types.HetGNN [41] groups random walks by node types, and then applies bi-LSTM to aggregate node-and type-level messages.HGT [16] employs node-and edge-type dependent parameters to compute the heterogeneous attention over each edge.Simple-HGN [22] extends the edge attention with a learnable edge type embedding, whereas HetSANN [12] employs a type-aware attention layer.On the other hand, high-order type structures such as meta-path [30] or meta-graph [6] have also been used.HAN [35] uses meta-paths to build homogeneous neighbor graphs to facilitate node-and semantic-level attentions in message aggregation, whereas MAGNN [9] proposes several meta-path instance encoders to account for intermediate nodes in a meta-path instance.Meta-GNN [28] differentiates context nodes based on metagraphs.Another work Space4HGNN [45] proposes a unified design space to build heterogeneous GNNs in a modularized manner, which can potentially utilize various type-based information.Despite their effectiveness on HINs, they cannot be applied to LHGs due to the need of explicit type information.

Knowledge graph embedding.
A knowledge graph consists of a large number of relations between head and tail entities.Translation models [3,19,36] are popular approaches that treat each relation as a translation in some vector space.For example, TransE [3] models a relation as a translation between head and tail entities in the same embedding space, while TransR [19] further maps entities into multiple relation spaces to enhance semantic expressiveness.Separately, RotatE [31] models each relation as a rotation from head to tail entities in complex space, and is able to capture different relation patterns such as symmetry and inversion.These models require the edge type (i.e., relation) as input, which is not available on an LHG.Compared to heterogeneous GNNs, they do not utilize entity features or model multi-hop interactions between entities, which can lead to inferior performance on HINs.
Predicting missing types.A few studies [13,25] aim to predict missing entity types on knowledge graphs.However, a recent work [42] shows that these approaches tend to propagate type prediction errors on the graph, which harms the performance of other tasks like link prediction.Therefore, RPGNN [42] proposes a relation encoder, which is adaptive for each node pair to handle the missing types.However, all of them still require partial type information from a subset of nodes and edges for supervised training, which makes them infeasible in the LHG setting.

PRELIMINARIES
We first review or define the core concepts in this work.
Heterogeneous Information Networks (HIN).An HIN [29] is defined as a graph  = ( , , , ,, ), where  denotes the set of nodes and  denotes the set of node types,  denotes the set of edges and  denotes the set of edge types.Moreover,  :  →  and  :  →  are functions that map each node and edge to their types in  and , respectively. is an HIN if | | + || > 2.
On an LHG the metapaths become latent too, as the types   and   are not observable.Generally, any path  = ( 1 ,  2 , . . .,  +1 ) on an LHG is an instance of some latent metapath, which carries latent semantics representing an unknown composition of relations between the starting node  1 and end node  +1 .
Link prediction on LHG.Given a query node, we rank other nodes by their probability of forming a link with the query.The difference lies in the input graph, where we are given an LHG.

PROPOSED METHOD: LHGNN
In this section, we introduce the proposed method LHGNN for link prediction on latent heterogeneous graphs.

Overall Framework
We start with the overall framework of LHGNN, as presented in Fig. 2.An LHG is given as input as illustrated in Fig. 2(a), which is fed into an LHGNN layer in Fig. 2(b).Multiple layers can be stacked, and the last layer would output the node representations, to be further fed into a link encoder for the task of link prediction as shown in Fig. 2(c).More specifically, the LHGNN layer is our core component, which consists of two sub-modules: a semantic embedding sub-module to learn node-level and path-level latent semantics, and a latent heterogeneous context aggregation sub-module to aggregate messages for the target node.We describe each submodule and the link prediction task in the following.

Semantic Embedding
Semantic embedding aims to model both node-and path-level latent semantics, as illustrated in Fig. 2

(b1).
Node-level semantic embedding.For each node , alongside its primary embedding h  , we propose an additional semantic embedding s  .Similar to node embeddings on homogeneous graphs, the primary embeddings intend to capture the overall content and structure information of nodes.However, on an LHG, the content of a node contains not only concrete topics and preferences, but also subtle semantic information inherent to nodes of each latent type (e.g., node type, and potential single or multi-hop relations a node tends to be part of).Hence, we propose a semantic encoder to locate and distill the subtle semantic information from the primary embeddings to generate semantic embeddings, which will be later employed for link prediction.Note that this is different from disentangled representation learning [1], which can be regarded as disentangling a mixture of latent topics.
Specifically, in the -th layer, given primary embeddings from the previous layer1 (h −1  1 , h −1  2 , . ..), a semantic encoder generates the corresponding semantic embeddings (s   1 , s   2 , . ..).For each node , the semantic encoder   extracts the semantic embedding s   from its primary embedding h −1  : can take many forms, we simply materialize it as a fully connected layer: where Since the semantic embedding only distill the subtle semantic information from the primary embedding, it needs much fewer dimensions, i.e.,    ≪   ℎ .Path-level semantic embedding.A target node is often connected with many context nodes through paths.On an LHG, these paths may carry different latent semantics by virtue of the heterogeneous multi-hop relations between nodes.In an HIN, to capture the heterogeneous semantics from different context nodes, metapaths have been a popular tool.For example, in the top half of Fig. 1(a), the HIN consists of three types of nodes.There exist different metapaths that capture different semantics between authors: A-P-A for authors who have directly collaborated, or A-P-C-P-A for authors who are likely in the same field, etc.However, on an LHG, we do not have access to the node types, and thus are unable to define or use any metapath.Thus, we employ a path encoder to fuse the node-level semantic embeddings associated with a path into a path-level embedding.The path-level semantic embeddings attempt to mimic the role of metapaths on an HIN, to capture the latent heterogeneous semantics between nodes.Concretely, we first perform random walks to sample a set of paths.Starting from each target node, we sample  random walks of length  max , e.g., { 1 , . . .,  4 } for  1 as shown in Fig. 2(b).Then, for each sampled walk, we truncate it into a shorter path with length  ≤  max .These paths of varying lengths can capture latent semantics of different ranges, and serve as the instances of different latent metagraphs.
Next, for each path   , a path encoder encodes it to generate a semantic path embedding s    .Let   denote the set of sampled paths starting from a target node .If there exists a path   ∈   such that   ends at node , we call  a context node of  or simply context of .For instance, given the set of paths   1 = { 1 , . . .,  4 } for the target node  1 in Fig. 2(b1), the contexts are  4 ,  8 ,  7 , which carry different semantics for  1 .In the -th layer, we employ a path encoder   to embed each path   into a semantic path embedding s    ∈ R    based on the node-level semantic embeddings of each node   in the path.
where   can take many forms, ranging from simple pooling to recurrent neural networks [24] and transformers [33].As the design of path encoder is not the focus of this paper, we simply implement it as a mean pooling, which computes the mean of the node-level semantic embeddings as the path-level embedding.
Remark.It is possible or even likely to have multiple paths between the target and context nodes.Intrinsically, these paths are instances of one or more latent metapaths, which bear a two-fold advantage.First, as each latent metapath depicts a particular semantic relationship between two nodes, having plural latent metapaths could capture different semantics between the two nodes.This is more general than many heterogeneous GNNs [9,35] on a HIN, which rely on a few handcrafted metapaths.Second, it is much more efficient to sample and process paths than subgraphs [17].Although a metagraph can express more complex semantics than a metapath [44], the combination of multiple metapaths is a good approximation especially given the efficiency trade-off.

Latent Heterogeneous Context Aggregation
To derive the primary embedding of the target node, the next step is to perform latent heterogeneous context aggregation for the target node.The aggregation generally follows a message-passing GNN architecture, where messages from context nodes are passed to the target node.For example, consider the target node  1 shown in Fig. 2(b2).The contexts of  1 include { 4 ,  8 ,  7 } based on the paths  1 , . . .,  4 , and their messages (i.e., embeddings) would be aggregated to generate the primary embedding of  1 .
However, on an LHG, the contexts of a target node carry latent heterogeneous semantics.Thus, these heterogeneous contexts shall be differentiated before aggregating them, in order to preserve the latent semantics in fine granules.Note that on an HIN, the given node/edge types or type structures can be employed as explicit differentiators for the contexts.In contrast, the lack of type information on an LHG prevents the adoption of the explicit differentors, and we resort to path semantic embeddings as the basis to differentiate the messages from heterogeneous contexts.That is, in each layer, we personalize the message from each context node conditioned on the semantic path embeddings between the target and context node.The personalized context messages are finally aggregated to generate the primary embedding of the target node, i.e., the output of the current layer.
Context personalization.Consider a target node , a context node , and their connecting path .In particular,  can be differentiated from other contexts of  given their connecting path , which is associated with a unique semantic path embedding.Note that there may be more than one paths connecting  and .In that case, we treat  as multiple contexts, one instance for each path, as each path carries different latent semantics and the corresponding context instance needs to be differentiated.
Specifically, in the -th layer, we personalize the message from  to  with a learnable transformation function , which modulates 's original message h −1  (i.e., its primary embedding from the previous layer) into a personalized message h  | conditioned on the path  between  and .That is, where the transformation function  (•;    ) is learnable with parameters    .We implement  using a layer of feature-wise linear modulation (FiLM) [20,26], which enables the personalization of a message (e.g., h −1  ) conditioned on arbitrary input (e.g., s   ).The FiLM layer learns to perform scaling and shifting operations to modulate the original message: where is a shifting vector, both of which are learnable and specific to the path .Note that 1 is a vector of ones to center the scaling around one, and ⊙ denotes the element-wise multiplication.To make    and    learnable, we materialize them using a fully connected layer, which takes in the semantic path embedding s   as input to become conditioned on the path , as follows. where where () gives the length of the path  so that  − () acts as a weighting scheme to bias toward shorter paths, and  > 0 is a hyperparameter controlling the decay rate.We use mean-pooling as the aggregation function, although other choices such as sumor max-pooling could also be used.
Note that the self-information of the target node is also aggregated, by defining a self-loop on the target node as a special path.More specifically, given a target node  and its self-loop we define () = 0 and h  | = h   , which means the original message of the target node will be included into c   with a weight of 1. Finally, based on the aggregated context embedding, we obtain the primary embedding of node  in the -th layer: where

Link Prediction
In the following, we discuss the treatment of the link prediction task on an LHG, as illustrated in Fig. 2(c).In particular, we will present a link encoder and the loss function.
Link encoder.For link prediction between two nodes, we design a link encoder to capture the potential latent relationships between the two candidates.Given two candidate nodes  and  and their respective semantic embeddings s  , s  obtained from the last LHGNN layer, the link encoder is materialized in the form of a recurrent unit to generate a pairwise semantic embedding: where s , ∈ R  ℎ can be interpreted as an embedding of the latent relationships between the two nodes  and , W, U ∈ R   × ℎ , b ∈ R  ℎ are learnable parameters.Here  ℎ and   are the number of dimensions of the primary and semantic embeddings from the last layer, respectively.Note that s , has the same dimension as the node representations, which can be used as a translation to relate nodes  and  in the loss function in the next part.
Loss function.We adopt a triplet loss for link prediction.For an edge (, ) ∈ , we construct a triplet (, , ), where  is a negative node randomly sampled from the graph.Inspired by translation models in knowledge graph [3,19],  can be obtained by a translation on  and the translation approximates the latent relationships between  and , i.e., h  ≈ h  + s , .Note that h  denotes the primary node embedding from the final LHGNN layer.In contrast, since  is a random node unrelated to , h  cannot be approximated by the translation.Thus, given a set of training triplets  , we formulate the following triplet margin loss for the task: where  (, ) = ∥h  + s , − h  ∥ 2 is the Euclidean norm of the translational errors, and  > 0 is the margin hyperparameter.
Besides the task loss, we also add constraints to scaling and shifting in the FiLM layer.During training, the scaling and shifting may become arbitrarily large to overfit the data.To prevent this issue, we restrict the search space by the following loss term on the scaling and shifting vectors.where ℓ is the total number of layers and  is the set of all sampled paths.The overall loss is then where  > 0 is a hyperparameter to balance the loss terms.We further present the training algorithm for LHGNN in Appendix A, and give a complexity analysis therein.

EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the effectiveness of LHGNNon four benchmark datasets.

Experimental Setup
Datasets.We employ four graph datasets summarized in Table 1.Note that, while all these graphs include node or edge types, we hide such type information to transform them into LHGs.
• FB15k-237 [31] is a refined subset of Freebase [2], a knowledge graph about general facts.It is curated with the most frequently used relations, in which each node is an entity, and each edge represents a relation.• WN18RR [31] is a refined subset of WordNet [23], a knowledge graph demonstrating lexical relations of vocabulary.• DBLP [40] is an academic bibliographic network, which includes three types of nodes, i.e., paper (P), author (A) and conference (C), as well as four types of edges, i.e., P-A, A-P, P-C, C-P.The node features are 334-dimensional vectors which represent the bag-of-word vectors for the keywords.• OGB-MAG [15] is a large-scale academic graph.It contains four types of nodes, i.e., paper (P), author (A), institution (I) and field (F), as well as four types of edges, i.e., A-I, A-P, P-P, P-F.Features of each paper is a 128-dimensional vector generated by word2vec, while the node feature of the other types is generated by metapath2vec [5] with the same dimension following previous work [7].In our experiments, we randomly sample a subgraph with around 100K entities from the original graph using breadth first search.
Baselines.For a comprehensive comparison, we employ baselines from three major categories.
Note that the heterogeneous GNNs and translation models require node/edge types as input, to apply them to an LHG, we adopt two strategies: either treating all nodes or edges as one type, or generating pseudo types, which we will elaborate later.See Appendix B for more detailed descriptions of the baselines.
Model settings.See Appendix D for the hyperparameters and other settings of the baselines and our method.

Evaluation of Link Prediction
In this part, we evaluate the performance of LHGNN on the main task of link prediction on LHGs.
Settings.For knowledge graph datasets (FB15k-237 and WN18RR), we use the same training/validation/testing proportion in previous work [31], as given in Table 1.For the other datasets, we adopt a 80%/10%/10% random splitting of the links.Note that the training graphs are reconstructed from only the training links.
We adopt ranking-based evaluation metrics for link prediction, namely, NDCG and MAP [21].In the validation and testing set, given a ground-truth link (, ), we randomly sample another 9 nodes which are not linked to  as negative nodes, and form a candidate list together with node .For evaluation, we rank the 10 nodes based on their scores w.r.t.node .For our LHGNN, the score for a candidate link (, ) is computed as −∥h  + s , − h  ∥ 2 .For classic and heterogeneous GNN models, we implement the same triplet margin loss for them, as given by Eq. (10).The only difference is that  (, ) is defined by ∥h  − h  ∥ 2 in the absence of semantic embeddings.Similarly, their link scoring function is defined as −∥h  − h  ∥ 2 for a candidate link (, ).Translation models also use the same loss and scoring function as ours, except for replacing our link encoding s , with their type-based relation embedding.

Scenarios of comparison.
Since the type information is inaccessible on LHGs, for heterogeneous GNNs and the translation methods, we consider the following scenarios.
The first scenario is to treat all nodes/edges as only one type in the absence of type information.
In the second scenario, we generate pseudo types.For nodes, we resort to the -means algorithm to cluster nodes into  clusters based on their features, and treat the cluster ID of each node as its type.Since the number of clusters or types  is unknown, we experiment with different values.For each heterogeneous GNN or translation model, we use "X-" to denote a variant of model X with  pseudo node types, where X is the model name.For instance, HAN-3 means HAN with three pseudo node types.Note that there is no node feature in FB15k-237 and WN18RR.To perform clustering, we use node embeddings by running X-1 first.On the other hand, edge types are derived using the Cartesian product of the node types, resulting in  ×  pseudo edge types.Finally, for HAN which requires metapath, we further construct pseudo metapaths based on the pseudo node types.For each pseudo node type, we employ all metapaths with length two starting and ending  at that type.We also note that some previous works [13,25] can predict missing type information.However, they cannot be used to generate the pseudo types, as they still need partial type information from some nodes and edges as supervision.
Besides, in the third scenario, we also evaluate the heterogeneous GNNs on a complete HIN with all node/edge types given.This assesses if explicit type information is useful, and how the baselines with full type information compare to our model.

Performance on LHGs.
In the first scenario, all methods do not have access to the node/edge types.We report the results in Table 2, and make the following observations.First, our proposed LHGNN consistently outperforms all the baselines across different metrics and datasets.The results imply that LHGNN can adeptly capture latent semantics between nodes to assist link prediction, even without any type information.
Second, the performance of classic GNN baselines is consistently competitive or even slightly better than heterogeneous GNNs.This finding is not surprising-while heterogeneous GNNs can be effective on HINs, their performance heavily depends on high-quality type information which is absent from LHGs.
Third, translation models are usually worse than GNNs, possibly because they do not take advantage of node features, and lack a message-passing mechanism to fully exploit graph structures.
Performance with pseudo types.In the second scenario, we generate pseudo types for heterogeneous GNNs and translation models, and report their results in Table 3.We observe different outcomes on different kinds of baselines.
On one hand, translation models generally benefit from the use of pseudo types.Compared to Table 2 without using pseudo types, TransE- can achieve an improvement of 13.2% and 8.5% in MAP and NDCG, respectively, while TransR- can improve the two metrics by 5.2% and 3.6%, respectively (numbers are averaged over the four datasets).This demonstrates that even very crude type estimation (e.g., -means clustering) is useful in capturing latent semantics between nodes.Nevertheless, our model LHGNN still outperforms translation models using pseudo types.
On the other hand, heterogeneous GNNs can only achieve marginal improvements with pseudo types, if not worse performance.A potential reason is that pseudo types are noisy, and the messagepassing mechanism of GNNs can propagate local errors caused by the noises and further amplify them across the graph.In contrast, the lack of message passing in translation models make them less susceptible to noises, and the benefit of pseudo types outweighs the effect of noises.
Overall, while pseudo types can be useful to some extent, they cannot fully reveal the latent semantics between nodes due to potential noises.Moreover, we need to set a predetermined number of pseudo types, which is not required by our model LHGNN.
Performance on complete HINs.The third scenario is designed to further evaluate the importance of type information, and how LHGNN fares against baselines equipped with full type information on HINs.Specifically, we compare the performance of the heterogeneous GNNs on the two datasets DBLP and OGB-MAG, where type information is fully provided.To enhance the link prediction of the heterogeneous GNN models, we adopt a relationaware decoder [22] to compute the score for a candidate link (, ) as h   W  h  , where W  ∈ R  ℎ × ℎ is a learnable matrix for each edge type  ∈ .We report the results in Table 4.
We observe that heterogeneous GNNs with full type information consistently perform better than themselves without any type information (cf.Table 2).This is not surprising given the rich semantics expressed by explicit types.Moreover, LHGNN achieves comparable results to the heterogeneous GNNs or sometimes better results (cf.

Evaluation of Node Type Classification
To evaluate the expressiveness of semantic embeddings in capturing type information, we further use them to conduct node type classification on DBLP and OGB-MAG thanks to their accessible ground truth types.We perform stratified train/test splitting, i.e., for each node type, we use 60% nodes for training and 40% for testing.For each node, we concatenate its primary node embedding and semantic embedding, and feed it into a logistic regression classifier to predict its node type.We choose five competitive baselines, and use their output node embeddings to also train a logistic regression classifier on the same split.We employ macro-F score and accuracy as the evaluation metrics, and report the results in Table 5.We observe that LHGNN can significantly outperform the other baselines, with only one exception in accuracy on OGB-MAG.Since the node types are imbalanced (e.g., authors account for 77.8% of all nodes on DBLP, and 64.4% on OGB-MAG), accuracy may be skewed by the majority class and is often not a useful indicator of predictive power.The results demonstrate the usefulness of semantic embedding in our model to effectively express type information.

Model Analyses
Ablation study.To evaluate the contribution of each module in LHGNN, we conduct an ablation study by comparing with several degenerate variants: (1) no link encoder: we remove the link We present the results in Fig. 3 and make the following observations.First, the performance of LHGNN drops significantly when removing the link encoder, showing their importance on LHGs.In other words, the learned latent semantics between nodes are effective for link prediction.Second, without the personalization for context aggregation, the performance also declines.This shows that the context nodes have heterogeneous relationships to the target node, and the semantic path embeddings can work as intended to personalize the contexts.Third, without both of them, the model usually achieves the worst performance.
Scalability.We sample five subgraphs from the largest dataset OGB-MAG, with sizes ranging from 20k to 100k nodes.We present the total training time and number of epochs of LHGNN on these subgraphs in Table 6.As the graph grows by 5 times, total training time to converge only increases by 2 times, since generally fewer epochs are needed for convergence on larger graphs.
Additional studies.We present results on additional model studies in Appendices E, F and G, respectively.

CONCLUSION
In this paper, we investigated a challenging and unexplored setting of latent heterogeneous graphs (LHG) for the task of link prediction.Existing approaches on heterogeneous graphs depend on explicit type-based information, and thus they do not work well on LHGs.To deal with the absence of types, we proposed a novel model named LHGNN for link prediction on an LHG, based on the novel idea of semantic embedding at both node and path levels, and a personalized aggregation of latent heterogeneous contexts for target nodes in a fine-grained manner.Finally, extensive experiments on four benchmark datasets show the superior performance of LHGNN.

APPENDICES A Algorithm and Complexity
We outline the model training for LHGNN in Algorithm 1. Calculate the loss L by Eqs. ( 10),( 11) and ( 12); update Θ by minimizing L; 13: return Θ.
In line 1, we initialize the model parameters.In line 3, we sample a batch of triplets from training data.In lines 4-10, we calculate layer-wise node representations on the training set.Specifically, for each node  in layer , we calculate the semantic embeddings at the node level in line 6 and path level in line 8. Next, we personalize the contexts in line 9 and aggregate them in line 10.In lines 11-12, We compute the loss and update the parameters.
We compare the complexity of one layer and one target node in LHGNN against a standard message-passing GNN.In a standard GNN, the aggregation for one node in the -th layer has complexity  ( d  ℎ  −1 ℎ ), where   ℎ is the output dimension of the -th layer and d is the node degree.In our model, we first need to compute the node-and path-level semantic embeddings.At the node level, the cost is  (    −1 ℎ ) based on Eq. (1); at the path level, given a node with  paths of maximum length  max , the cost is  ( max    ) based on Eq. ( 2).The computation of scaling and shifting vectors takes  ( −1 ℎ    ) based on Eqs. ( 5) and ( 6), and the personalization of context embeddings takes  ( −1 ℎ ) based on Eq. ( 4).Thus, the aggregation and representation update step takes  (  ℎ  −1 ℎ ) based on Eqs. ( 7) and (8).Furthermore, the cost to sample a path of length  max is  ( max ).To sample  paths for a target node in one LHGNN layer, the overhead is  ( max ), which is negligible compared to the aggregation cost, as  max is small (less than 5 in our experiments).Therefore, the total complexity for one node in one layer is  ( max    +   ℎ  −1 ℎ +   ℎ  −1  ).As  max is a small constant and  −1  ≪  −1 ℎ , the complexity reduces to  (  ℎ  −1 ℎ ).Furthermore, we can limit , the number of sampled paths from each node, by some constant value, in which case our model has the same complexity class as a standard GNN.

B Details of Baselines
We provide detailed descriptions for the baselines.
• Classic GNNs: GCN [18] aggregates information for the target node by applying mean-pooling over its neighbors.GAT [34]: utilizes self-attention to assign different weights to neighbors of the target node during aggregation.Meanwhile, GraphSAGE [11]: concatenates the target node with the aggregated information from its neighbors to produce the node embedding.These models treat all nodes or edges as a uniform type and do not attempt to distinguish them.• Heterogeneous GNNs: HAN [35] makes use of handcrafted metapaths to decompose HIN into multiple homogeneous graphs, one for each metapath, then employs hierarchical attention to learn both node-level and semantic-level importance for aggregation.HGT [16] applies the transformer model, using node and edge type parameters to capture the heterogeneity.HGN [22] extends GAT by employing node type information in the calculation of attention scores.These three models require type-based information in their architectures.Given an LHG, we either assume a single node/edge type or employ pseudo types for these methods, as described in the main paper.• Translation models: TransE [3] models the relation between entities as a translation in the same embedding space.TransR [19] maps entity embeddings into a relation-wise space before the translation.These models require the relation type (i.e., edge type) to be known.Similarly, we assume a single edge type or employ pseudo types for them.

C Environment
All experiments are conducted on a workstation with a 12-core CPU, 128GB RAM, and 2 RTX-A5000 GPUs.We implemented the proposed LHGNN using Pytorch 1.10 and Python 3.8 in Ubuntu-20.04.
For our model, we also employ a two-layer architecture with L2-Normalization.We randomly sample 50 paths starting from each target node.We set the dimension of semantic embeddings as 10, decay ratio  as 0.1, the weight of scaling and shifting constraints  as 0.0001, and margin as 0.2.

E Number of Layers
We investigate the impact of number of layers in representative baseline models.Results from Table VII show that adding more layers might not give better results.Overall, they are still much worse than our proposed LHGNN.

F Impact of Model Size
We investigate the impact of model size (i.e., the number of learnable parameters in a model) on empirical performance.Specifically, we select several representative GNNs from the baselines, and vary the size of each GNN by increasing its hidden dimension.For instance, GCN-32 indicates that its hidden layer has 32 neurons.Table VIII shows the link prediction performance of the GNNs with varying model sizes.Overall, larger models employing the same GNN architecture can only achieve slight improvements, and LHGNN continues to outperform them despite having a relatively small model.The results imply that the effectiveness of LHGNN comes from the architectural design rather than stacking with more parameters.

G Parameter Sensitivity
To study the impact of model parameters, we showcase two of the important parameters, including the maximum length for path sampling  max , and the weight of scaling and shifting constraint .We present the results in Fig. IV.For sparse datasets with low average degree such as WN18RR and DBLP, using a large maximum length for paths can generally improve the performance, as it can exploit more contextual structures around the target node.For the other two datasets, their performance is generally less affected as the maximum length of paths increases.For the hyper-parameter , LHGNN generally achieves the best performance in the interval [0.0001, 0.001] across the four datasets, demonstrating the necessity of this constraint.

H Data Ethics Statement
To evaluate the efficacy of this work, we conducted experiments which only use publicly available datasets 234 , in accordance to their usage terms and conditions if any.We further declare that no personally identifiable information was used, and no human or animal subject was involved in this research.

Figure 1 :
Figure 1: Illustration of our problem and approach.(a) Comparison of HIN and LHG.(b) Key insights of our approach.

𝑠
and b  * ∈ R   −1 ℎ are learnable parameters, and LeakyReLU(•) is the activation function.Note that the parameters of the transformation function  in layer  boil down to parameters of the fully connected layers, i.e.,    = {W   , W   , b   , b   }.Context aggregation.Next, we aggregate the personalized messages from latent heterogeneous contexts into c   ∈ R   −1 ℎ , the aggregated context embedding for the target node : 12:

Figure IV :
Figure IV: Impact of parameters.

Table 1 :
Summary of Datasets.

Table 2 :
Evaluation of link prediction on LHGs.Best is bolded and runner-up underlined; OOM means out-of-memory error.

Table 3 :
Evaluation of link prediction on LHGs with pseudo types for heterogeneous GNNs and translation models.

Table 2 )
, despite LHGNN not requiring any explicit

Table 4 :
Evaluation of link prediction on HINs with full access to node/edge types for heterogeneous GNNs.Percentages in parenthesis indicate the improvement to their performance on LHGs (cf.Table2).

Table 5 :
Evaluation of node type classification on LHGs.
type.A potential reason is the node-and path-level semantic embeddings in LHGNN can capture latent semantics in a finer granularity, whereas the explicit types on a HIN may be coarse-grained.For example, on a typical academic graph, there are node types of Author or Paper, but no finer-grained types like Student/Faculty Author or Research/Application Paper is available.
Algorithm 1 Model Training for LHGNN Input: latent heterogeneous graph  = ( , ), training triplets  , a set of random walk paths   for each node  Output: Model parameters Θ. for each path  ∈   that ends at context node  do   ( {s    |  in the path  });

Table VII :
Impact of different number of layers.In each column, the best is bolded and the runner-up is underlined.

Table VIII :
Impact of model size.|Θ| denotes the number of learnable parameters in a model.