Contrastive Meta-Learning for Few-shot Node Classification

Few-shot node classification, which aims to predict labels for nodes on graphs with only limited labeled nodes as references, is of great significance in real-world graph mining tasks. Particularly, in this paper, we refer to the task of classifying nodes in classes with a few labeled nodes as the few-shot node classification problem. To tackle such a label shortage issue, existing works generally leverage the meta-learning framework, which utilizes a number of episodes to extract transferable knowledge from classes with abundant labeled nodes and generalizes the knowledge to other classes with limited labeled nodes. In essence, the primary aim of few-shot node classification is to learn node embeddings that are generalizable across different classes. To accomplish this, the GNN encoder must be able to distinguish node embeddings between different classes, while also aligning embeddings for nodes in the same class. Thus, in this work, we propose to consider both the intra-class and inter-class generalizability of the model. We create a novel contrastive meta-learning framework on graphs, named COSMIC, with two key designs. First, we propose to enhance the intra-class generalizability by involving a contrastive two-step optimization in each episode to explicitly align node embeddings in the same classes. Second, we strengthen the inter-class generalizability by generating hard node classes via a novel similarity-sensitive mix-up strategy. Extensive experiments on few-shot node classification datasets verify the superiority of our framework over state-of-the-art baselines. Our code is provided at https://github.com/SongW-SW/COSMIC.


INTRODUCTION
The task of node classification aims at learning a model to assign labels for unlabeled nodes on graphs [5,15,21].In fact, many realworld applications can be formulated as the node classification task [24,31].For example, in social media networks [6,26] such as Facebook, where each node represents a user and edges represent friendship relations, a classification model is tasked to predict the preferences and interests of users based on their profiles and user relations.Recently, Graph Neural Networks (GNNs) [35,49] have shown remarkable advantages in learning node representations and predicting node labels based on the learned representations.Nevertheless, GNNs generally require a considerable number of labeled nodes to ensure the quality of learned node representations [57].That being said, the performance of GNNs will severely degrade when the number of labeled nodes is limited.In practice, it often remains difficult to acquire sufficient labeled nodes for each class [7].For example, GNNs are widely used in the task to identify users according to various topics [49].However, usually only a few users in social networks are known to be associated with newly formed topics.Then, the trained GNNs can easily encounter a significant performance drop.Hence, there is a surge of research interests aiming at performing node classification with only limited labeled nodes as references, known as few-shot node classification.
To tackle the few-shot node classification problem, existing works have demonstrated the effectiveness of the meta-learning strategy [5,15,24].In general, these works first extract transferable knowledge from classes with abundant labeled nodes (i.e., metatraining classes).Then the learned knowledge is generalized to other classes with limited labeled nodes (i.e., meta-test classes).Particularly, these works introduce the conception of episode [9] for the training phase to episodically emulate each target meta-test task in the evaluation phase.More specifically, in each episode, a metatask is sampled on the graph to train the GNN model: a few labeled nodes (i.e., support set) are sampled from the meta-training classes as references for classifying the test nodes (i.e., query set) sampled from the same classes.By training on multiple episodes, the GNN model learns a shared node embedding space across meta-training and meta-test classes [7,15,24].In essence, the key to improving the performance of GNNs on meta-test classes with limited labeled nodes is to learn generalizable node embeddings [7,24,32].In this work, we unprecedentedly propose to effectively learn generalizable node embeddings by considering two facets: intra-class and inter-class generalizability.In particular, intra-class generalizability refers to the ability of models in aligning embeddings for nodes in the same class, while inter-class generalizability measures the ability in distinguishing node embeddings among different classes.Although these two properties are crucial in learning node embeddings, due to the distinct challenges posed by the complex graph structures, it is non-trivial to guarantee them.
Concretely, there are two key challenges in learning generalizable node embeddings for few-shot node classification.Firstly, achieving intra-class generalizability is difficult on graphs.This involves learning similar embeddings for nodes within the same class, which is crucial for classifying meta-test classes that are unseen during meta-training.However, existing methods mainly focus on distinguishing node labels in each episode and do not encourage learning similar intra-class node representations.Additionally, neighboring nodes contain crucial contextual structural knowledge for learning similar intra-class node embeddings.However, the existing strategy of sampling individual nodes in each episode fails to capture such structural information, resulting in limited intra-class generalizability.Secondly, inter-class generalizability is not guaranteed on graphs.In few-shot node classification, the model must be able to classify nodes in a variety of unseen meta-test classes.However, the meta-training classes may be insufficient or too easy to classify, resulting in a lack of capability to classify various unseen classes and low inter-class generalizability.
To tackle the aforementioned challenges regarding the generalizability of learned node embeddings, we propose a novel contrastive meta-learning framework COSMIC for few-shot node classification.Specifically, our framework tames the challenges with two essential designs.(1) To enhance the intra-class generalizability of the GNN model, we propose to incorporate graph contrastive learning in each episode.As a prevailing technique used in graph representation learning, graph contrastive learning has proven to be effective in achieving comprehensive node representations [13,53].Inspired by those works, to enhance the intra-class generalizability, we propose a two-step optimization in each episode.For the first step, we conduct graph contrastive learning on nodes in the support set to update the GNN model, and we propose to utilize subgraphs to represent nodes to incorporate structural knowledge.For the second step, we leverage this updated GNN to perform classification on nodes in the query set and compute the classification loss for further updating the GNN model.In this way, the GNN model is forced to learn similar intra-class node embeddings via the proposed contrastive meta-learning strategy and updated on the query set for further intra-class generalizability.(2) To improve inter-class generalizability, we propose a novel similarity-sensitive mix-up strategy to generate additional classes in each episode.In particular, inter-class generalizability, i.e., the capability in distinguishing node embeddings in different meta-test classes, is difficult to learn when the meta-training classes are insufficient or not difficult enough.Thus, we utilize the classes in each episode to generate new hard classes via mix-up, where the mixing ratio is based on the similarity between nodes in different classes.The generated classes are then incorporated in the graph contrastive learning step.In this way, the model is forced to distinguish additional difficult classes to enhance inter-class generalizability.In summary, our contributions are: • We improve the meta-learning strategy for few-shot node classification from the perspective of intra-class and inter-class generalizability of the learned node embeddings.
• We develop a novel contrastive meta-learning framework that (1) incorporates a two-step optimization based on the proposed contrastive learning strategy to improve intra-class generalizability; (2) leverages the proposed similarity-sensitive mix-up strategy to generate hard classes for enhancing inter-class generalizability.
• We conduct extensive experiments on four benchmark node classification datasets under the few-shot scenario and validate the superiority of our proposed framework.

PRELIMINARIES 2.1 Problem Statement
In this section, we provide the formal definition for the problem of few-shot node classification.We first denote an input attributed graph as  = (V, E, X), where V is the set of nodes, and E is the set of edges.X ∈ R |V |× denotes the node feature matrix, where  is the feature dimension.Furthermore, the entire set of node classes is denoted as C, which can be further divided into two disjoint sets: C  and C  , i.e., the sets of meta-training classes and meta-test classes, respectively.Specifically, It is noteworthy that the number of labeled nodes in C  is sufficient for meta-training, while it is generally small in C  [7,15,24,57].
In this way, the studied problem of few-shot node classification is formulated as follows: Definition 1. Few-shot Node Classification: Given an attributed graph  = (V, E, X) and a meta-task T = {S, Q} sampled from C  , our goal is to develop a learning model such that after meta-training on labeled nodes in C  , the model can accurately predict labels for the nodes in the query set Q, where the only available reference is the limited labeled nodes in the support set S.
Moreover, under the  -way -shot setting, the support set S consists of exactly  labeled nodes for each of the  classes from C  , and the query set Q is also sampled from the same  classes.In this scenario, the problem is called an  -way -shot node classification problem.Essentially, the objective of few-shot node classification is to learn a model that can well generalize to meta-test classes in C  with only limited labeled nodes as the reference.

Episodic Training
In practice, we adopt the prevalent episodic training framework for the meta-training process, which has proven to be effective in various fields, such as few-shot image classification and fewshot knowledge completion [7,9,28,37,50].Particularly, the metatraining process is conducted on a certain number of episodes, each of which contains a meta-training task emulating the structure of meta-test tasks.The only difference is that the meta-training tasks are sampled from C  , while the meta-test tasks are sampled from C  .In this regard, the model can keep the consistency between meta-training and meta-test.Moreover, many works [7,9,15,50] have demonstrated the benefits of such emulation-based learning strategy for better classification performance, especially under fewshot settings.Our proposed framework generally follows this design and inherits its merits in preserving classification performance with only scarce labeled nodes.

PROPOSED FRAMEWORK
In this section, we introduce the overall structure of our proposed framework COSMIC in detail.As illustrated in Fig.
where   (or   ) is a node in V, and   (or  ′  ) is the corresponding label.Specifically, given the meta-task T and its support set S (|S| =  × ) on a graph  = (V, E, X), we aim to conduct contrastive learning while incorporating supervision information to enhance intra-class generalizability.We denote the -th node in the -th class in S as    and its corresponding encoded representation as h   ( = 1, 2, . . .,  and  = 1, 2, . . ., ), which is learned by GNN  .To enhance the intra-class generalizability, we propose to leverage the concept of mutual information (MI) from deep InfoMax [1]: where   is the node set for the -th class in S, i.e.,   = {   |  = 1, 2, . . .,  }.Moreover, MI(   ,   ) denotes the mutual information between node    and   .That being said, in the above equation, we aim to maximize the MI between nodes in the same class while minimizing the MI between nodes in different classes.To be specific, the MI term is implemented based on the following formula: where  is the number of views for each node, and   (•) is the -th learning function, where  = 1, 2, . . .,  .Specifically,   (•) first transforms the input node into a view with a specific function and then learns a representation from the transformed view, where the output is a -dimensional vector.A more detailed description of how to generate views by transforming nodes is given in the next subsection.Moreover, L , is the loss for    in a meta-task T , and  ∈ R + is a scalar temperature parameter.Here the negative samples include all other nodes in the support set, while the positive samples are other nodes that share the same class as    .In each meta-task, the loss is computed over all nodes in the support set S, thus forming the proposed contrastive meta-learning loss L  , We conduct contrastive meta-learning on  episodes, where each episode consists of a meta-training task With the objective L  , we first perform one gradient descent step for GNN  to fast adapt it to a meta-training task T  : where S  is the support set of the meta-training task T  sampled in episode , and  ∈ {1, 2, . . ., }.L  (S  ;  ( ) ) denotes the contrastive meta-learning loss calculated on S  with the GNN parameters  ( ) .  is the learning rate for L  .For the second step of our contrastive meta-learning, we incorporate the cross-entropy loss L  on the query set via another step of gradient descent,: where Q  denotes the query set of the meta-task T  sampled in episode .L  denotes the cross-entropy loss calculated on the query set Q  with the updated GNN parameters  ( ) .  is the corresponding learning rate.It is noteworthy that a fully-connected layer is used during meta-training for the cross-entropy loss.As a result, through this two-step optimization, we have conducted one episode of contrastive meta-learning to obtain the updated GNN parameters  ( +1) .After training on a total number of  episodes, we can obtain the final trained GNN  with parameters  ( ) .It is noteworthy that different from the supervised contrastive loss [19], our contrastive meta-learning loss restricts the classification range to  classes in each meta-task, and then the updated GNN will be used for classification in the query set.In such a manner, the model will be trained to fast adapt to various tasks with different classes, thus learning intra-class generalizability when the model is enforced to classify unseen nodes in the query set.In addition, our design also differs from infoPatch [23] that contrasts between support samples and query samples of different views since our contrastive meta-learning loss is specified for the support set.

Subgraph Construction.
In this part, we introduce the function to generate different views (i.e.,   (•)) for our contrastive metalearning framework.To incorporate more structural context in each meta-task, we propose to represent each node by the subgraph extracted from it.This is because nodes are often more correlated to their regional neighborhoods compared with other long-distance nodes [48,58].Nevertheless, directly sampling its neighborhood can potentially incorporate redundant information [15].Therefore, we propose to selectively sample neighborhoods as the subgraph based on the Personalized PageRank (PPR) algorithm [16].Such subgraph sampling scheme has been proven effective for many different graph learning tasks [17,56], and we validate its effectiveness for few-shot node classification in this paper.As a result, the sampled subgraph will include the most important nodes regarding the central node, which can provide context information for each meta-task while substantially reducing the irrelevant information in neighborhoods.
Concretely, given the adjacency matrix where I is the identity matrix, and In this way, S , can represent the importance score between nodes   and   .To select the most correlated nodes for an arbitrary node   , we extract the nodes that bear the largest important scores to   as follows: where Γ(  ) is the extracted node set for the centroid node   , and ℎ ,  is the importance score threshold for other nodes to be selected.Specifically, to ensure a consistent size of extracted subgraphs, we define ℎ ,  as the   -th largest entry of S ,: (with   itself excluded), where   is a hyperparameter.In other words, Γ(  ) consists of the top-  important nodes for   on graph .In this manner, the extracted subgraph node set is denoted as Here the original edges in this subgraph will be kept.Thus, the edge set of the subgraph can be represented as follows: Note that this strategy can extract both the neighboring nodes of   and other nodes that are far away and important, which could incorporate more contextual knowledge.
Based on the proposed subgraph sampling strategy, for nodes in the support set S (|S| =  × ) of meta-task T , we can accordingly extract  ×  subgraphs, denoted as G = { where

Similarity-sensitive Mix-up
Although we have enhanced the intra-class generalizability of GNN models via our contrastive meta-learning framework, the lack of inter-class generalizability can still lead to suboptimal performance on meta-test classes.In particular, we propose a similaritysensitive mix-up strategy to generate additional classes in each meta-task to compensate for the potential lack of sufficient or difficult meta-training classes for inter-class generalizability.The generated classes are based on mixed subgraphs and will be incorporated into our contrastive meta-learning loss.In this way, the model is forced to distinguish between both the  classes in each meta-task and the generated (unseen) classes, thus promoting inter-class generalizability.It is worth mentioning that in our framework, the extracted subgraphs all maintain the same size of   , which provides further convenience for performing subgraph-level mix-up.
Specifically, for each of  ×  nodes in the support set S, we propose to generate a corresponding mixed subgraph.Here we denote the set of mixed subgraphs as G = {    | = 1, 2, . . .,  ,  = 1, 2, . . .,  }, where    is the mixed subgraph generated for the th node in the -th class.For each node in S, we first randomly sample a node   from the input graph  and generate its subgraph  To adaptively control the mixup procedure to generate harder instances, we further design a similarity-sensitive strategy to decide the value of the parameter    in the Beta distribution.Generally, if the sampled node   is dissimilar to node    , we should increase the value of    so that a smaller  is more likely to be sampled.Then, based on Eq. ( 13), the mixed subgraph will absorb more information from a different structure, i.e., the subgraph generated from   .Particularly, we propose to adaptively adjust the  value in the Beta distribution based on the Bhattacharyya distance [2] between node    and node   regarding their importance scores based on the Personalized PageRank algorithm.Intuitively, Bhattacharyya distance can measure the contextual relations between two embeddings in a space formed by the importance scores: where .
(16) In this way, the model is forced to distinguish between these  additional mixed classes that are potentially more difficult than the original  classes.

Meta-test
During meta-test, we leverage the trained encoder GNN  to learn a representation for each node based on its extracted subgraph.Specifically, for a given meta-task T = {S, Q}, a new simple classifier   (implemented as a fully connected layer parameterized by ) is trained on the support set S based on the cross-entropy loss: Algorithm 1 Learning Process of the Proposed Framework.
Input: A graph  = (V, E, X), a meta-test task T  = {S, Q}, meta-training classes C  , the number of meta-training episodes  , the number of classes in each meta-task  , and the number of labeled nodes for each class .Output: Predicted labels for the query nodes in Q of T  .
// Meta-training phase 1:  ← 0 ; 2: while  <  do Compute the representations for all subgraphs and nodes with the encoder GNN  ; 7: Compute the contrastive meta-learning loss for each node based on the learned representations according to Eq. ( 2); 8: Update parameters of GNN  with the contrastive metalearning loss on nodes in S  by one gradient descent step based on Eq. ( 5);

9:
Update parameters of GNN  with the cross-entropy loss on nodes in Q  by one gradient descent step based on Eq. ( 6); where p  ∈ R  is the probability that the -th support node   in S belongs to each class of the  classes in meta-task T .Here |S| =  ×  under the  -way -shot setting. ( ) denotes parameters of GNN  after  episodes of meta-training.Moreover,  , = 1 if the -th node belongs to the -th class in T , and  , = 0, otherwise. , denotes the -th element in p  .  (  ) is the representation of the subgraph   = (V  , E  , X  ) extracted from   based on Eq. ( 8), which is learned by the GNN encoder: It is notable that during training the classifier, the parameters of the encoder GNN  are fixed to ensure a faster convergence.Moreover, we additionally introduce a weight-decay regularization term () = ∥ ∥ 2 /2.In consequence, we can achieve a classifier that is specified for the meta-task T based on the following objective: where  * denotes the optimal parameters for the classifier ML  (In practice, we choose Logistic Regression as the classifier).Then we conduct classification for nodes in the query set Q with the learned subgraph representations of these nodes.The label of the -th query node   is obtained by ŷ = argmax  {  , }.Here   , is the -the element of p   , which is learned in a similar way as Eq. ( 20) and Eq. ( 19) based on the extracted subgraph of   .The process of our framework is described in Algorithm 1 and illustrated in Fig. 1.

EXPERIMENTAL EVALUATIONS
To achieve an empirical evaluation of our framework COSMIC, we conduct experiments on four prevalent real-world node classification datasets with different few-shot settings, including CoraFull [3], ogbn-arxiv [14], Coauthor-CS [27], and DBLP [33].Their statistics and class splitting policy are provided in Table 5.More detailed descriptions are included in Appendix B.2.

Overall Evaluation Results
In this section, we compare the overall results of our framework with all baseline methods on few-shot node classification.The results are presented in Table 1.Specifically, to evaluate our framework under different few-shot settings, we conduct the experiments with different values of  and  under the  -way -shot setting.For the evaluation metrics, following the common practice [32], we utilize the averaged classification accuracy and 95% confidence interval over ten repetitions for a fair comparison.From the results, we can obtain the following observations: (1) COSMIC consistently outperforms other baselines in all datasets with different values of  and .The results strongly validate the superiority of our contrastive meta-learning framework on few-shot node classification.
(2) The performance of all methods significantly degrades when a larger value of  is presented (i.e., more classes in each meta-task), since a larger class set will increase the variety of classes and hence lead to difficulties in classification.Nevertheless, our framework encounters a less significant performance drop compared with other baselines.This is because our contrastive meta-learning strategy is capable of handling various classes by leveraging both supervision and structural information in each episode.(3) With a larger value of  (i.e., more support nodes in each class), all methods exhibit decent performance improvements.Moreover, on Coauthor-CS, compared with the second-best baseline results (underlined), our framework achieves more significant improvements due to the enhancement of inter-class generalizability, which is crucial in Coauthor-CS with noticeably fewer meta-training classes than other datasets.(4) The confidence interval is generally larger on the setting with  = 1, i.e., the 1-shot setting.This is because each meta-task only consists of one support node for each of the  classes, making the decision boundary easy to overfit.In consequence, the results inevitably

Ablation Study
In this section, we conduct an ablation study to evaluate the effectiveness of three modules in our framework COSMIC.In particular, we compare COSMIC with its three degenerate variants: (1) COS-MIC without contrastive meta-learning (referred to as COSMIC w/o C).In this variant, we remove the contrastive learning loss so that only the cross-entropy loss is used for model training.
(2) COSMIC without using subgraphs (referred to as COSMIC w/o S).In this variant, we only leverage node representations learned from the same original graph.As a result, the model cannot incorporate contextual information of each node into meta-tasks.( 3) COSMIC without similarity-sensitive mix-up (referred to as COSMIC w/o M).In this variant, the additional pairs of subgraphs are not incorporated.
In consequence, the model cannot effectively achieve inter-class generalizability.Fig. 2 presents the results of our ablation study on ogbn-arxiv and DBLP datasets (similar results are observed on other datasets).Based on the results, we can achieve the following findings: (1) In general, our proposed framework COSMIC outperforms all three variants, which validates the effectiveness and necessity of the proposed three key components.Moreover, the advantage of the proposed framework becomes more significant on harder few-shot node classification tasks (i.e. a larger value of  or a smaller value of ).This demonstrates the robustness of our framework regarding different  -way -shot settings.(2) It can be observed that the variant without the contrastive meta-learning loss (i.e.COSMIC w/o C) generally exhibits inferior performance.This validates that our contrastive meta-learning strategy, which integrates a contrastive learning objective into each episode, succeeds in enhancing the generalizability of GNN models to meta-test classes.(3) The other two modules are also crucial for our framework.More specifically, when the value of  decreases (i.e., fewer labeled nodes for each class in meta-tasks), the performance improvement brought by the subgraphs is more significant.This verifies that incorporating more contextual information in each meta-task can further compensate for the scarcity of labeled nodes.We further validate these points through experiments in the following section.

Embedding Analysis
To explicitly illustrate the advantage of the proposed framework and the effectiveness of each designed component, in this subsection, we analyze the quality of the learned representations for nodes in meta-test classes through different training strategies.to its ablated counterparts and other baselines.Based on the results, we can obtain the following discoveries: (1) Comparing Fig. 3 (a) to (d)-(f), we can observe that our proposed framework COSMIC can generate the most discriminative node embeddings on meta-test classes, compared to those competitive baselines: Meta-GNN, GPN, and TENT.This signifies that our framework extracts more generalizable knowledge and effectively transfers it to meta-test classes.
(2) Comparing Fig. 3 (a) to (b), it can be observed that node embeddings learned from COSMIC without the proposed contrastive meta-learning method will exhibit less intra-class discrimination.Comparing Fig. 3 (a) to (c), we can observe that the learned node embeddings from different classes have more overlappings, which means the inter-class generalizability is limited.In other words, this visualization further validates the effectiveness of the proposed components for improving the GNN model's intra-class and interclass generalizability.4.4.2Node Embedding Clustering Evaluation.For a more quantitive comparison, in this subsection, we present the detailed node embedding evaluations on CoraFull with NMI and ARI scores in Table 2. Similar to the previous experiments, we discover that the proposed framework, COSMIC, can learn the most discriminative node representations on meta-test classes, and the performance of ablated variants of COSMIC will degrade due to the limitations of both intra-class and inter-class generalizability.

Effect of Subgraph Size 𝐾 𝑠
In this subsection, we conduct experiments to study the impact of the subgraph size   in COSMIC.Specifically, with a larger value of   , our framework will incorporate more contextual information in each meta-task, which can further enhance the learning of transferable knowledge.Fig. 4 reports the results of our framework with varying values of   under four different few-shot settings.From the results, we can observe that increasing the size of subgraphs will first lead to better performance for our framework and then bring a slight performance drop.This is because larger subgraphs can involve more contextual information in each meta-task and thus contribute to the learning of transferable knowledge, while an excessively large subgraph can involve irrelevant information that harms the performance.Moreover, the performance advancement with larger subgraphs is more significant in 2-way settings, which means the incorporation of contextual information is more crucial in meta-tasks with less labeled nodes (i.e., a smaller support set).

Choice of Encoder
In this subsection, we conduct experiments on our framework with different choices of the encoder GNN  .Notably, our framework does not require a specific implementation of GNNs and is thus compatible with any kind of GNN architectures.In particular, we change the GNN encoder to GAT [35], GIN [51], GraphSAGE [11] (denoted as SAGE in Table 3), and SGC [47] to evaluate the effects of different GNNs.The results are provided in Table 3. From the results, we can observe that generally GAT and SGC maintain relatively better performance on different few-shot settings.This is probably due to the fact that these GNN encoders can more effectively exploit the structural information, which can benefit from our contrastive meta-learning loss and mix-up strategy.Moreover, the results demonstrate that our proposed framework COSMIC maintains decent performance with various choices of GNN encoders, which validates the capability of COSMIC under different application scenarios.In other experiments, for the sake of simplicity and generality, we deploy GCN as the encoder for all the baselines and our proposed framework.[11,35,44], more recently, many studies [7,45,57] have shown that the performance of GNNs will severely degrade when the number of labeled node is limited, i.e., the few-shot node classification problem.Inspired by how humans transfer previously learned knowledge to new tasks, researchers propose to adopt the meta-learning paradigm [9] to deal with this label shortage issue [43].Particularly, the GNN models are trained by explicitly emulating the test environment for few-shot learning, where the GNNs are expected to gain the adaptability to generalize onto new domains.For example, Meta-GNN [57] applies MAML [9] to learn directions for optimization with limited labels.GPN [7] adopts Prototypical Networks [28] to make the classification based on the distance between the node feature and the prototypes.MetaTNE [22] and RALE [24] also use episodic metalearning to enhance the adaptability of the learned GNN encoder and achieve similar results.However, those existing works usually directly apply meta-learning to graphs [41], ignoring the crucial distinction from images that nodes in a graph are not i.i.d.data, thus leading to several drawbacks as discussed in the paper.Our work bridges the gap by developing a novel contrastive meta-learning framework for few-shot node classification.

Graph Contrastive Learning
Contrastive learning has become an effective representation learning paradigm in image [4], text [38], and graph [8,13,60] domains.Specifically, for a typical graph contrastive learning method, a GNN encoder is forced to maximize the consistency between differently augmented views for original graph data.The augmentation is achieved by specific heuristic transformations, such as randomly dropping edges and nodes [13,53], and randomly perturbing attributes of nodes and edges [34].Pretraining with such general pretexts will help the GNN model to learn transferable graph patterns [13].The pretrained model parameters have been proven to be a superior initialization for GNNs when fine-tuned on various downstream tasks, including node classification [13,53].However, all existing works fine-tune the GNNs on sufficiently labeled datasets, making them unsuitable for scenarios where there are only a few labeled nodes for fine-tuning [30].

Mix-up on Graphs
Mix-up [55] has become a popular data augmentation technique for training deep models to enhance their generalizability and robustness.As a common practice [10,36,55], both the attributes and labels of a pair of original instances are linearly interpolated and integrated into the original datasets to train the model, with weights sampled from Beta distributions.To achieve this manner for graphs, one work [46] modifies graph convolution to mix the graph parts within the receptive field.Another work [12] proposes to learn a graph generator to align the pair of graphs and interpolate the generated counterparts.However, these methods require extra deep modules to learn, making the generated graphs hard to interpret.
In our work, we design a novel heuristic method that calculates subgraph-level similarities, based on which the mix-up strategy can adaptively generate subgraphs with additional classes.We incorporate them into the novel contrastive meta-learning framework.

CONCLUSION
In this paper, we investigate the few-shot node classification problem, which aims to assign labels for nodes with only limited labeled nodes as references.We improve the meta-learning strategy from the perspective of enhancing the intra-class and inter-class generalizability for the learned node embeddings.

A NOTATIONS
In this section, we provide used notations in this paper along with their descriptions for comprehensive understanding.

B REPRODUCIBILITY B.1 Baseline
We conduct experiments with the following baseline methods to compare performance: • Prototypical Networks (ProtoNet) [28]: ProtoNet learns prototypes for classes within each meta-task and classifies query instances via their similarities to prototypes.• MAML [9]: MAML proposes to optimize model parameters according to gradients on the support instances and metaupdate parameters based on query instances.• Meta-GNN [57]: Meta-GNN combines MAML and Graph Neural Networks (GNNs) to perform meta-learning on graph data for few-shot node classification.
• GPN [7]: GPN learns node importance and combines Prototypical Networks to improve performance.• AMM-GNN [40]: AMM-GNN proposes to extend MAML with an attribute matching mechanism.• G-Meta [15]: G-Meta utilizes subgraphs to learn node representations, based on which the classification is conducted.• TENT [42]: TENT proposes to reduce the task variance among various meta-tasks and conduct task-adaptive fewshot node classification from different levels.
• CoraFull [3] is an extension of the prevalent dataset Cora [52] from the entire citation network.On this graph, papers and citation relations are represented as nodes and edges, respectively.The classes of nodes are obtained according to the paper topic.For this dataset, we use 40/15/15 node classes for meta-training/meta-validation/meta-test.

B.3 Implementation Details
In this section, we provide more details on the implementation settings of our experiments.Specifically, we implement COSMIC with PyTorch [25] and train our framework on a single 48GB Nvidia A6000 GPU.We utilize a one-layer GCN [21] as our base GNN model with the hidden size set as 1024.In the beta distribution, we set the constant value  as 10 and  as 5.Moreover, we set the number of meta-training tasks  as 1000.During the meta-test phase, we randomly sample 100 meta-test tasks, where the query set size (i.e., |Q|) is set as 10.We adopt the Adam [20] optimization method, where the learning rates for contrastive meta-learning loss and cross-entropy loss,   and   , are both set as 0.001.

C TRAINING TIME
In this subsection, we conduct additional experiments to evaluate the computational cost of our framework compared to other baselines.Specifically, in Table 6, we demonstrate the overall training time of the proposed framework, COSMIC, and baselines on two typical datasets: Coauthor-CS, which is a simpler dataset with a smaller graph, and ogbn-arxiv, which is a harder dataset with a significantly larger graph.For all methods, we consider the training time till convergence (time for preprocessing excluded) for 10 runs and report their average.For consistency, we run all the experiments on a single 48 GB Nvidia A6000 GPU.As shown in the results, for

D ADDITIONLA RESULTS D.1 Meta-learning Frameworks
In this subsection, we provide the results of other meta-learning frameworks with our contrastive learning loss, as shown in Table 7.
From the results, we observe that using MAML is generally better than other meta-learning frameworks.Regarding the meta-learning framework, we chose to utilize MAML due to its ability to perform both inner-and outer-loop optimizations, which aligns with our strategy that first learns node representations with intra-class and inter-class generalizability and then conducts classification.Although our contrastive loss can also be incorporated into other meta-learning frameworks, they do not contain such a two-stage optimization, which makes them less effective for our purpose.

D.2 Contrastive Leraning Frameworks
In this subsection, we conduct additional experiments and provide the results of different contrastive learning methods with MAML as the framework, as shown in Table 8.
From the results, we observe that our proposed contrastive metalearning loss achieves superior performance.This is because compared to other contrastive learning methods, our strategy can enhance the intra-class and inter-class generalizability of the model, which is more suitable for few-shot node classification problems.

Figure 1 :
Figure 1: The illustration of the proposed framework COSMIC under the 2-way 1-shot setting: (a) Episodic meta-learning framework.(b) Contrastive Meta-learning strategy during each episode.(c) similarity-sensitive subgraph mix-up strategy.Specifically, CE loss denotes the cross-entropy loss, and nodes in different colors indicate different classes.
| = 1, 2, . . .,  ,  = 1, 2, . . .,  }.Note that    = (V   , E   , X   ), where V   , E   , and X   are the node set, edge set, and feature matrix of subgraph    , respectively.With the encoder GNN  , we can obtain the representation of each node as: Then with the extracted subgraph for node    , i.e.,    = (A   , X   ), we can perform mix-up as follows: where A   and X   are the mixed adjacency matrix and feature matrix for the subgraph generated from node    , respectively.Moreover, • denotes the element-wise multiplication operation.Λ  ∈ R   ×  and Λ  ∈ R   × are mixing ratio matrices based on the similarity between the two subgraphs    and   .To provide more variance for the mix-up strategy, we sample each element in Λ  and Λ  , i.e.,  ∈ [0, 1], independently from the commonly used Beta(   , ) distribution [54]:  , ∼ Beta(   , ), ∀,  ∈ {1, 2, . . .,   },  , ∼ Beta(   , ), ∀ ∈ {1, 2, . . .,   }, ∀ ∈ {1, 2, . . .,  }. (14) denotes the importance score of node    (or   ), i.e., the corresponding row in S.Moreover, s   () denotes the -th entry in s   . (•) is the sigmoid function, and  is a constant value to control the magnitude of    .In this way, we can obtain an adaptive value for the parameter of the Beta distribution, from which Λ  and Λ  are sampled.Denoting the mixed subgraph as    = ( A   , X   ), we can obtain the set of mixed subgraphs G = {    | = 1, 2, . . .,  ,  = 1, 2, . . .,  } after performing mix-up for each node in S. In this way, we can obtain  mixed classes in addition to the original  classes in each meta-task.Denoting the central node in each mixed subgraph as    , the generated mixed classes can be represented as   = {    |  = 1, 2, . . .,  }.Then these mixed classes are used as additional classes in Eq. (2), which is improved as follows: L , = − 1 MI(    ,   ) +  =1,≠ MI(    ,   )

3 : 4 : 5 :
Sample a meta-training taskT  = {S  , Q  } from C  ;Construct a subgraph for each node in the support set S  ; Mix up constructed subgraphs based on similarities; 6:

10 :𝑦
←  + 1; 11: end while // Meta-test phase 12: Construct a subgraph for each node in the support set S and the query set Q; 13: Compute the subgraph representations for nodes in S and Q with the trained encoder GNN  ; 14: Fine-tune a simple classifier   based on the subgraph representations from the support set S according to Eq. (20); 15: Predict labels for query nodes based on the subgraph representations from the query set Q; L  S;  ( ) , log  , ,

Figure 4 :
Figure 4: The results of COSMIC with varying values of the subgraph size   on ogbn-arxiv and DBLP.

Table 2 :
The overall NMI (↑) and ARI (↑) results of COSMIC and baseline on two datasets under the 5-way 5-shot setting.

Table 3 :
Comparisons of different GNNs used in our framework on ogbn-arxiv.

Table 4 :
Notations used in this paper.

Table 5 :
[33]istics of four node classification datasets.Coauthor-CS[27]is a co-authorship graph obtained from the Microsoft Academic Graph in the KDD Cup 2016 challenge.Specifically, nodes represent the authors, and edges denote the relations that they co-authored a paper.Moreover, node features represent the paper keywords in each author's papers.The node classes are assigned based on the most active fields of the authors.We use 5/5/5 node classes for meta-training/meta-validation/meta-test.•DBLP[33]is a citation network, where the nodes represent papers, and edges denote the citation relations between papers.Specifically, the node features are obtained based on the paper abstract, and node classes are assigned according to the paper venues.For this dataset, we use 77/30/30 node classes for meta-training/meta-validation/meta-test.

Table 6 :
The overall training time results of COSMIC and baselines on two datasets under the 5-way 1-shot setting.

Table 7 :
[28]overall results of using other meta-learning frameworks with our contrastive loss on ogbn-arxiv.ProtoNet[28]67.36±3.7376.99±2.8739.10±1.6456.57±2.14Matching [37] 68.86±3.3077.01±2.7944.79±1.8558.72±1.94Relation [29] 71.10±3.0079.56±3.1247.88±1.6660.96±1.54COSMIC 75.71±3.1785.19±2.3553.28±2.1965.42±1.69 the smaller and simpler graph dataset Coauthor-CS, the proposed COSMIC requires more training time to achieve higher accuracy.However, for the larger and harder graph dataset ogbn-arxiv, COS-MIC can achieve the best performance while demanding similar or even less training time.This is because COSMIC learns the encoder at the subgraph level, while all the existing meta-learning-based baselines resort to training a global graph encoder, thus leading to more training episodes.This experiment demonstrates that the proposed COSMIC can scale well to large and complex graphs.