GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks

Graphs can model complex relationships between objects, enabling a myriad of Web applications such as online page/article classification and social recommendation. While graph neural networks(GNNs) have emerged as a powerful tool for graph representation learning, in an end-to-end supervised setting, their performance heavily rely on a large amount of task-specific supervision. To reduce labeling requirement, the"pre-train, fine-tune"and"pre-train, prompt"paradigms have become increasingly common. In particular, prompting is a popular alternative to fine-tuning in natural language processing, which is designed to narrow the gap between pre-training and downstream objectives in a task-specific manner. However, existing study of prompting on graphs is still limited, lacking a universal treatment to appeal to different downstream tasks. In this paper, we propose GraphPrompt, a novel pre-training and prompting framework on graphs. GraphPrompt not only unifies pre-training and downstream tasks into a common task template, but also employs a learnable prompt to assist a downstream task in locating the most relevant knowledge from the pre-train model in a task-specific manner. Finally, we conduct extensive experiments on five public datasets to evaluate and analyze GraphPrompt.


INTRODUCTION
The ubiquitous Web is becoming the ultimate data repository, capable of linking a broad spectrum of objects to form gigantic and complex graphs.The prevalence of graph data enables a series of downstream tasks for Web applications, ranging from online page/article classification to friend recommendation in social networks.Modern approaches for graph analysis generally resort to graph representation learning including graph embedding and graph neural networks (GNNs).Earlier graph embedding approaches [12,33,41] usually embed nodes on the graph into a low-dimensional space, in which the structural information such as the proximity between nodes can be captured [5].More recently, GNNs [13,20,43,50] have emerged as the state of the art for graph representation learning.Their key idea boils down to a message-passing framework, in which each node derives its representation by receiving and aggregating messages from its neighboring nodes recursively [48].
Graph pre-training.Typically, GNNs work in an end-to-end manner, and their performance depends heavily on the availability of large-scale, task-specific labeled data as supervision.This supervised paradigm presents two problems.First, task-specific supervision is often difficult or costly to obtain.Second, to deal with a new task, the weights of GNN models need to be retrained from scratch, even if the task is on the same graph.To address these issues, pretraining GNNs [15,16,30,34] has become increasingly popular, inspired by pre-training techniques in language and vision applications [1,7].The pre-training of GNNs leverages self-supervised learning on more readily available label-free graphs (i.e., graphs without task-specific labels), and learns intrinsic graph properties that intend to be general across tasks and graphs in a domain.In other words, the pre-training extracts a task-agnostic prior, and can be used to initialize model weights for a new task.Subsequently, the initial weights can be quickly updated through a lightweight fine-tuning step on a smaller number of task-specific labels.
However, the "pre-train, fine-tune" paradigm suffers from the problem of inconsistent objectives between pre-training and downstream tasks, resulting in suboptimal performance [23].On one hand, the pre-training step aims to preserve various intrinsic graph  properties such as node/edge features [15,16], node connectivity/links [13,16,30], and local/global patterns [15,30,34].On the other hand, the fine-tuning step aims to reduce the task loss, i.e., to fit the ground truth of the downstream task.The discrepancy between the two steps can be quite large.For example, pre-training may focus on learning the connectivity pattern between two nodes (i.e., related to link prediction), whereas fine-tuning could be dealing with a node or graph property (i.e., node classification or graph classification task).
Prior work.To narrow the gap between pre-training and downstream tasks, prompting [4] has first been proposed for language models, which is a natural language instruction designed for a specific downstream task to "prompt out" the semantic relevance between the task and the language model.Meanwhile, the parameters of the pre-trained language model are frozen without any finetuning, as the prompt can "pull" the task toward the pre-trained model.Thus, prompting is also more efficient than fine-tuning, especially when the pre-trained model is huge.Recently, prompting has also been introduced to graph pre-training in the GPPT approach [39].While the pioneering work has proposed a sophisticated design of pre-training and prompting, it can only be employed for the node classification task, lacking a universal treatment that appeals to different downstream tasks such as both node classification and graph classification.
Research problem and challenges.To address the divergence between graph pre-training and various downstream tasks, in this paper we investigate the design of pre-training and prompting for graph neural networks.In particular, we aim for a unified design that can suit different downstream tasks flexibly.This problem is non-trivial due to the following two challenges.Firstly, to enable effective knowledge transfer from the pretraining to a downstream task, it is desirable that the pre-training step preserves graph properties that are compatible with the given task.However, since different downstream tasks often have different objectives, how do we unify pre-training with various downstream tasks on graphs, so that a single pre-trained model can universally support different tasks?That is, we try to convert the pre-training task and downstream tasks to follow the same "template".Using pretrained language models as an analogy, both their pre-training and downstream tasks can be formulated as masked language modeling.
Secondly, under the unification framework, it is still important to identify the distinction between different downstream tasks, in order to attain task-specific optima.For pre-trained language models, prompts in the form of natural language tokens or learnable word vectors have been designed to give different hints to different tasks, but it is less apparent what form prompts on graphs should take.Hence, how do we design prompts on graphs, so that they can guide different downstream tasks to effectively make use of the pre-trained model?Present work.To address these challenges, we propose a novel graph pre-training and prompting framework, called GraphPrompt, aiming to unify the pre-training and downstream tasks for GNNs.Drawing inspiration from the prompting strategy for pre-trained language models, GraphPrompt capitalizes on a unified template to define the objectives for both pre-training and downstream tasks, thus bridging their gap.We further equip GraphPrompt with taskspecific learnable prompts, which guides the downstream task to exploit relevant knowledge from the pre-trained GNN model.The unified approach endows GraphPrompt with the ability of working on limited supervision such as few-shot learning tasks.
More specifically, to address the first challenge of unification, we focus on graph topology, which is a key enabler of graph models.In particular, subgraph is a universal structure that can be leveraged for both node-and graph-level tasks.At the node level, the information of a node can be enriched and represented by its contextual subgraph, i.e., a subgraph where the node resides in [17,55]; at the graph level, the information of a graph is naturally represented by the maximum subgraph (i.e., the graph itself).Consequently, we unify both the node-and graph-level tasks, whether in pre-training or downstream, into the same template: the similarity calculation of (sub)graph1 representations.In this work, we adopt link prediction as the self-supervised pre-training task, given that links are readily available in any graph without additional annotation cost.Meanwhile, we focus on the popular node classification and graph classification as downstream tasks, which are node-and graphlevel tasks, respectively.All these tasks can be cast as instances of learning subgraph similarity.On one hand, the link prediction task in pre-training boils down to the similarity between the contextual subgraphs of two nodes, as shown in Fig. 1(a).On the other hand, the downstream node or graph classification task boils down to the similarity between the target instance (a node's contextual subgraph or the whole graph, resp.) and the class prototypical subgraphs constructed from labeled data, as illustrated in Figs.1(b) and  (c).The unified template bridges the gap between the pre-training and different downstream tasks.
Toward the second challenge, we distinguish different downstream tasks by way of the ReadOut operation on subgraphs.The ReadOut operation is essentially an aggregation function to fuse node representations in the subgraph into a single subgraph representation.For instance, sum pooling, which sums the representations of all nodes in the subgraph, is a practical and popular scheme for ReadOut.However, different downstream tasks can benefit from different aggregation schemes for their ReadOut.In particular, node classification tends to focus on features that can contribute to the representation of the target node, while graph classification tends to focus on features associated with the graph class.Motivated by such differences, we propose a novel task-specific learnable prompt to guide the ReadOut operation of each downstream task with an appropriate aggregation scheme.As shown in Fig. 1, the learnable prompt serves as the parameters of the ReadOut operation of downstream tasks, and thus enables different aggregation functions on the subgraphs of different tasks.Hence, GraphPrompt not only unifies the pre-training and downstream tasks into the same template based on subgraph similarity, but also recognizes the differences between various downstream tasks to guide taskspecific objectives.
Contributions.To summarize, our contributions are three-fold.(1) We recognize the gap between graph pre-training and downstream tasks, and propose a unification framework GraphPrompt based on subgraph similarity for both pre-training and downstream tasks, including both node and graph classification tasks.(2) We propose a novel prompting strategy for GraphPrompt, hinging on a learnable prompt to actively guide downstream tasks using task-specific aggregation in ReadOut, in order to drive the downstream tasks to exploit the pre-trained model in a task-specific manner.(3) We conduct extensive experiments on five public datasets, and the results demonstrate the superior performance of GraphPrompt in comparison to the state-of-the-art approaches.

RELATED WORK
Graph representation learning.The rise of graph representation learning, including earlier graph embedding [12,33,41] and recent GNNs [13,20,43,50], opens up great opportunities for various downstream tasks at node and graph levels.Note that learning graph-level representations requires an additional ReadOut operation which summarizes the global information of a graph by aggregating node representations through a flat [8,11,50,56] or hierarchical [10,21,31,51] pooling algorithm.We refer the readers to two comprehensive surveys [5,48] for more details.
Graph pre-training.Inspired by the application of pre-training models in language [2,7] and vision [1,29] domains, graph pretraining [49] emerges as a powerful paradigm that leverages selfsupervision on label-free graphs to learn intrinsic graph properties.While the pre-training learns a task-agnostic prior, a relatively light-weight fine-tuning step is further employed to update the pre-trained weights to fit a given downstream task.Different pretraining approaches design different self-supervised tasks based on various graph properties such as node features [15,16], links [13,16,19,30], local or global patterns [15,30,34], local-global consistency [14,32,37,44], and their combinations [40,52,53].
However, the above approaches do not consider the gap between pre-training and downstream objectives, which limits their generalization ability to handle different tasks.Some recent studies recognize the importance of narrowing this gap.L2P-GNN [30] capitalizes on meta-learning [9] to simulate the fine-tuning step during pre-training.However, since the downstream tasks can still differ from the simulation task, the problem is not fundamentally addressed.In other fields, as an alternative to fine-tuning, researchers turn to prompting [4], in which a task-specific prompt is used to cue the downstream tasks.Prompts can be either handcrafted [4] or learnable [22,24].On graph data, the study of prompting is still limited.One recent work called GPPT [39] capitalizes on a sophisticated design of learnable prompts on graphs, but it only works with node classification, lacking a unification effort to accommodate other downstream tasks like graph classification.Besides, there is a model also named as GraphPrompt [54], but it considers an NLP task (biomedical entity normalization) on text data, where graph is only auxiliary.It employs the standard text prompt unified by masked language modeling, assisted by a relational graph to generate text templates, which is distinct from our work.
Comparison to other settings.Our few-shot setting is different from other paradigms that also deal with label scarcity, including semi-supervised learning [20] and meta-learning [9].In particular, semi-supervised learning cannot cope with novel classes not seen in training, while meta-learning requires a large volume of labeled data in their base classes for a meta-training phase, before they can handle few-shot tasks in testing.

PRELIMINARIES
In this section, we give the problem definition and introduce the background of GNNs.

Problem Definition
Graph.A graph can be defined as  = ( , ), where  is the set of nodes and  is the set of edges.We also assume an input feature matrix of the nodes, X ∈ R | |× , is available.Let x  ∈ R  denote the feature vector of node   ∈  .In addition, we denote a set of graphs as G = { 1 ,  2 , . . .,   }.
Problem.In this paper, we investigate the problem of graph pretraining and prompting.For the downstream tasks, we consider the popular node classification and graph classification tasks.For node classification on a graph  = ( , ), let  be the set of node classes with ℓ  ∈  denoting the class label of node   ∈  .For graph classification on a set of graphs G, let C be the set of graph labels with   ∈ C denoting the class label of graph   ∈ G.
In particular, the downstream tasks are given limited supervision in a few-shot setting: for each class in the two tasks, only  labeled samples (i.e., nodes or graphs) are provided, known as -shot classification.

Graph Neural Networks
The success of GNNs boils down to the message-passing mechanism [48], in which each node receives and aggregates messages (i.e., features or embeddings) from its neighboring nodes to generate its own representation.This operation of neighborhood aggregation can be stacked in multiple layers to enable recursive message passing.Formally, in the -th GNN layer, the embedding of node , denoted by h   , is calculated based on the embeddings in the previous layer, as follows.
where N  is the set of neighboring nodes of ,   is the learnable GNN parameters in layer .Aggr(•) is the neighborhood aggregation function and can take various forms, ranging from the simple mean pooling [13,20] to advanced neural networks such as neural attention [43] or multi-layer perceptrons [50].Note that in the first layer, the input node embedding h 0  can be initialized as the node features in X.The total learnable GNN parameters can be denoted as Θ = { 1 ,  2 , . ..}.For brevity, we simply denote the output node representations of the last layer as h  .

PROPOSED APPROACH
In this section, we present our proposed approach GraphPrompt.

Unification Framework
We first introduce the overall framework of GraphPrompt in Fig. 2. Our framework is deployed on a set of label-free graphs shown in Fig. 2(a), for pre-training in Fig. 2(b).The pre-training adopts a link prediction task, which is self-supervised without requiring extra annotation.Afterward, in Fig. 2(c), we capitalize on a learnable prompt to guide each downstream task, namely, node classification or graph classification, for task-specific exploitation of the pretrained model.In the following, we explain how the framework supports a unified view of pre-training and downstream tasks.

Instances as subgraphs.
The key to the unification of pre-training and downstream tasks lies in finding a common template for the tasks.The task-specific prompt can then be further fused with the template of each downstream task, to distinguish the varying characteristics of different tasks.
In comparison to other fields such as visual and language processing, graph learning is uniquely characterized by the exploitation of graph topology.In particular, subgraph is a universal structure capable of expressing both node-and graph-level instances.On one hand, at the node level, every node resides in a local neighborhood, which in turn contextualizes the node [25,27,28].The local neighborhood of a node  on a graph  = ( , ) is usually defined by a contextual subgraph   = ( (  ),  (  )), where its set of nodes and edges are respectively given by where  (, ) gives the shortest distance between nodes  and  on the graph , and  is a predetermined threshold.That is,   consists of nodes within  hops from the node , and the edges between those nodes.Thus, the contextual subgraph   embodies not only the self-information of the node , but also rich contextual information to complement the self-information [17,55].On the other hand, at the graph level, the maximum subgraph of a graph , denoted   , is the graph itself, i.e.,   = .The maximum subgraph   spontaneously embodies all information of .In summary, subgraphs can be used to represent both node-and graph-level instances: Given an instance  which can either be a node or a graph (e.g.,  =  or  = ), the subgraph   offers a unified access to the information associated with .
Unified task template.Based on the above subgraph definitions for both node-and graph-level instances, we are ready to unify different tasks to follow a common template.Specifically, the link prediction task in pre-training and the downstream node and graph classification tasks can all be redefined as subgraph similarity learning.Let s  be the vector representation of the subgraph   , and sim(•, •) be the cosine similarity function.As illustrated in Figs.2(b) and (c), the three tasks can be mapped to the computation of subgraph similarity, which is formalized below.
• Link prediction: This is a node-level task.Given a graph  = ( , ) and a triplet of nodes (, , ) such that (, ) ∈  and (, ) ∉ , we shall have Intuitively, the contextual subgraph of  shall be more similar to that of a node linked to  than that of another unlinked node.
Note that the class prototypical subgraph is a "virtual" subgraph with a latent representation in the same embedding space as the node contextual subgraphs.Basically, it is constructed as the mean representation of the contextual subgraphs of labeled nodes in a given class.Then, given a node   not in the labeled set , its class label ℓ  shall be Intuitively, a node shall belong to the class whose prototypical subgraph is the most similar to the node's contextual subgraph.
Then, given a graph   not in the labeled set D, its class label   shall be Intuitively, a graph shall belong to the class whose prototypical subgraph is the most similar to itself.□ It is worth noting that node and graph classification can be further condensed into a single set of notations.Let (, ) be an Optimize with pre-training loss (Eq.( 11)) Optimize with prompt tuning loss (Eq.( 14)) Finally, to materialize the common task template, we discuss how to learn the subgraph embedding vector s  for the subgraph   .Given node representations h  generated by a GNN (see Sect. 3.2), a standard approach of computing s  is to employ a ReadOut operation that aggregates the representations of nodes in the subgraph   .That is, The choice of the aggregation scheme for ReadOut is flexible, including sum pooling and more advanced techniques [50,51].In our implementation, we simply use sum pooling.In summary, the unification framework is enabled by the common task template of subgraph similarity learning, which lays the foundation of our pre-training and prompting strategies as we will introduce in the following parts.

Pre-Training Phase
As discussed earlier, our pre-training phase employs the link prediction task.Using link prediction/generation is a popular and natural way [13,16,18,30], as a vast number of links are readily available on large-scale graph data without extra annotation.In other words, the link prediction objective can be optimized on label-free graphs, such as those shown in Fig. 2(a), in a self-supervised manner.
Based on the common template defined in Sect.4.1, the link prediction task is anchored on the similarity of the contextual subgraphs of two candidate nodes.Generally, the subgraphs of two positive (i.e., linked) candidates shall be more similar than those of negative (i.e., non-linked) candidates, as illustrated in Fig. 2(b).Subsequently, the pre-trained prior on subgraph similarity can be naturally transferred to node classification downstream, which shares a similar intuition: the subgraphs of nodes in the same class shall be more similar than those of nodes from different classes.On the other hand, the prior can also support graph classification downstream, as graph similarity is consistent with subgraph similarity not only in letter (as a graph is technically always a subgraph of itself), but also in spirit.The "spirit" here refers to the tendency that graphs sharing similar subgraphs are likely to be similar themselves, which means graph similarity can be translated into the similarity of the containing subgraphs [36,42,56].
Formally, given a node  on graph , we randomly sample one positive node  from 's neighbors, and a negative node  from the graph that does not link to , forming a triplet (, , ).Our objective is to increase the similarity between the contextual subgraphs   and   , while decreasing that between   and   .More generally, on a set of label-free graphs G, we sample a number of triplets from each graph to construct an overall training set T pre .Then, we define the following pre-training loss.
where  is a temperature hyperparameter to control the shape of the output distribution.Note that the loss is parameterized by Θ, which represents the GNN model weights.
The output of the pre-training phase is the optimal model parameters Θ 0 = arg min Θ L pre (Θ).Θ 0 can be used to initialize the GNN weights for downstream tasks, thus enabling the transfer of prior knowledge downstream.

Prompting for Downstream Tasks
The unification of pre-training and downstream tasks enables more effective knowledge transfer as the tasks in the two phases are made more compatible by following a common template.However, it is still important to distinguish different downstream tasks, in order to capture task individuality and achieve task-specific optimum.
To cope with this challenge, we propose a novel task-specific learnable prompt on graphs, inspired by prompting in natural language processing [4].In language contexts, a prompt is initially a handcrafted instruction to guide the downstream task, which provides task-specific cues to extract relevant prior knowledge through a unified task template (typically, pre-training and downstream tasks are all mapped to masked language modeling).More recently, learnable prompts [22,24] have been proposed as an alternative to handcrafted prompts, to alleviate the high engineering cost of the latter.
Prompt design.Nevertheless, our proposal is distinctive from language-based prompting for two reasons.Firstly, we have a different task template from masked language modeling.Secondly, since our prompts are designed for graph structures, they are more abstract and cannot take the form of language-based instructions.Thus, they are virtually impossible to be handcrafted.Instead, they should be topology related to align with the core of graph learning.In particular, under the same task template of subgraph similarity learning, the ReadOut operation (used to generate the subgraph representation) can be "prompted" differently for different downstream tasks.Intuitively, different tasks can benefit from different aggregation schemes for their ReadOut.For instance, node classification pays more attention to features that are topically more relevant to the target node.In contrast, graph classification tends to focus on features that are correlated to the graph class.Moreover, the important features may also vary given different sets of instances or classes in a task.
Formally, let p  denote a learnable prompt vector for a downstream task , as shown in Fig. 2(c).The prompt-assisted ReadOut operation on a subgraph   for task  is where s , is the task -specific subgraph representation, and ⊙ denotes the element-wise multiplication.That is, we perform a feature weighted summation of the node representations from the subgraph, where the prompt vector p  is a dimension-wise reweighting in order to extract the most relevant prior knowledge for the task .
Note that other prompt designs are also possible.For example, we could consider a learnable prompt matrix P  , which applies a linear transformation to the node representations: More complex prompts such as an attention layer is another alternative.However, one of the main motivation of prompting instead of fine-tuning is to reduce reliance on labeled data.In few-shot settings, given very limited supervision, prompts with fewer parameters are preferred to mitigate the risk of overfitting.Hence, the feature weighting scheme in Eq. ( 12) is adopted for our prompting as the prompt is a single vector of the same length as the node representation, which is typically a small number (e.g., 128).
Prompt tuning.To optimize the learnable prompt, also known as prompt tuning, we formulate the loss based on the common template of subgraph similarity, using the prompt-assisted taskspecific subgraph representations.Formally, consider a task  with a labeled training set T  = {( 1 ,  1 ), ( 2 ,  2 ), . ..},where   is an instance (i.e., a node or a graph), and   ∈  is the class label of   among the set of classes  .The loss for prompt tuning is defined as where the class prototypical subgraph for class  is represented by s, , which is also generated by the prompt-assisted, task-specific ReadOut.
Note that, the prompt tuning loss is only parameterized by the learnable prompt vector p  , without the GNN weights.Instead, the pre-trained GNN weights Θ 0 are frozen for downstream tasks, as no fine-tuning is necessary.This significantly decreases the number of parameters to be updated downstream, thus not only improving the computational efficiency of task learning and inference, but also reducing the reliance on labeled data.

EXPERIMENTS
In this section, we conduct extensive experiments including node classification and graph classification as downstream tasks on five benchmark datasets to evaluate the proposed GraphPrompt.

Experimental Setup
Datasets.We employ five benchmark datasets for evaluation.(1) Flickr [47] is an image sharing network.(2) PROTEINS [3] is a collection of protein graphs which include the amino acid sequence, conformation, structure, and features such as active sites of the proteins.(3) COX2 [35] is a dataset of molecular structures including 467 cyclooxygenase-2 inhibitors.(4) ENZYMES [46] is a dataset of 600 enzymes collected from the BRENDA enzyme database.( 5) BZR [35] is a collection of 405 ligands for benzodiazepine receptor.
We summarize these datasets in Table 1, and present further details in Appendix B. Note that the "Task" column indicates the type of downstream task performed on each dataset: "N" for node classification and "G" for graph classification.
Baselines.We evaluate GraphPrompt against the state-of-the-art approaches from three main categories, as follows.( 1) End-to-end graph neural networks: GCN [20], GraphSAGE [13], GAT [43] and GIN [50].They capitalize on the key operation of neighborhood aggregation to recursively aggregate messages from the neighbors, and work in an end-to-end manner.(2) Graph pre-training models: DGI [44], InfoGraph [38], and GraphCL [53].They work in the "pretrain, fine-tune" paradigm.In particular, they pre-train the GNN models to preserve the intrinsic graph properties, and fine-tune the pre-trained weights on downstream tasks to fit task labels.(3) Graph prompt models: GPPT [39].GPPT utilizes a link prediction task for pre-training, and resorts to a learnable prompt for the node classification task, which is mapped to a link prediction task.
Note that other few-shot learning methods on graphs, such as Meta-GNN [57] and RALE [26], adopt a meta-learning paradigm [9].Thus, they cannot be used in our setting, as they require labeled data in their base classes for the meta-training phase.In our approach, only label-free graphs are utilized for pre-training.
Settings and parameters.To evaluate the goal of our Graph-Prompt in realizing a unified design that can suit different downstream tasks flexibly, we consider two typical types of downstream tasks, i.e., node classification and graph classification.In particular, for the datasets which are suitable for both of these two tasks, i.e., PROTEINS and ENZYMES, we only pre-train the GNN model once on each dataset, and utilize the same pre-trained model for the two downstream tasks with their task-specific prompting.
The downstream tasks follow a -shot classification setting.For each type of downstream task, we construct a series of -shot classification tasks.The details of task construction will be elaborated later when reporting the results in Sect.5.2.For task evaluation, as the -shot tasks are balanced classification, we employ accuracy as the evaluation metric following earlier work [26,45].
For all the baselines, based on the authors' code and default settings, we further tune their hyper-parameters to optimize their performance.We present more implementation details of the baselines and our GraphPrompt in Appendix D.

Performance Evaluation
As discussed, we perform two types of downstream task different from the link prediction task in pre-training, namely, node classification and graph classification in few-shot settings.We first evaluate on a fixed-shot setting, and then vary the shot numbers to see the performance trend.
Few-shot node classification.We conduct this node-level task on three datasets, i.e., Flickr, PROTEINS, and ENZYMES.Following a typical -shot setup [26,45,57], we generate a series of few-shot tasks for model training and validation.In particular, for PROTEINS and ENZYMES, on each graph we randomly generate ten 1-shot node classification tasks (i.e., in each task, we randomly sample 1 node per class) for training and validation, respectively.Each training task is paired with a validation task, and the remaining nodes not sampled by the pair of training and validation tasks will be used for testing.For Flickr, as it contains a large number of very sparse node features, selecting very few shots for training may result in inferior performance for all the methods.Therefore, we randomly generate ten 50-shot node classifcation tasks, for training and validation, respectively.On Flickr, 50 shots are still considered few, accounting for less than 0.06% of all nodes on the graph.
Table 2 illustrates the results of few-shot node classification.We have the following observations.First, our proposed GraphPrompt outperforms all the baselines across the three datasets, demonstrating the effectiveness of GraphPrompt in transferring knowledge from the pre-training to downstream tasks.In particular, by virtue of the unification framework and prompt-based task-specific aggregation in ReadOut function, GraphPrompt is able to narrow the gap between pre-training and downstream tasks, and guide the downstream tasks to exploit the pre-trained model in a task-specific manner.Second, compared to graph pre-training models, end-toend GNN models can sometimes achieve comparable or even better performance.This implies that the discrepancy between the pretraining and downstream tasks in these pre-training approaches obstructs the knowledge transfer from the former to the latter.In  such a case, even with sophisticated pre-training, they cannot effectively promote the performance of downstream tasks.Third, the graph prompt model GPPT is only comparable to or even worse than the other baselines, despite also using prompts.A potential reason is that GPPT requires much more learnable parameters in their prompts than ours, which may not work well given very few shots (e.g., 1-shot).
Few-shot graph classification.We further conduct few-shot graph classification on four datasets, i.e., PROTEINS, COX2, ENZYMES, and BZR.For each dataset, we randomly generate 100 5-shot classification tasks for training and validation, following a similar process for node classification tasks.We illustrate the results of few-shot graph classification in Table 3, and have the following observations.First, our proposed GraphPrompt significantly outperforms the baselines on these four datasets.This again demonstrates the necessity of unification for pre-training and downstream tasks, and the effectiveness of prompt-assisted task-specific aggregation for ReadOut.Second, as both node and graph classification tasks share the same pre-trained model on PROTEINS and ENZYMES, the superior performance of GraphPrompt on both types of task further demonstrates that, the gap between different tasks is well addressed by virtue of our unification framework.Third, the graph pre-training models generally achieve better performance than the end-to-end GNN models.This is because both InfoGraph and GraphCL capitalize on graph-level tasks for pre-training, which are naturally closer to the downstream graph classification.Performance with different shots.We study the impact of number of shots on the PROTEINS and ENZYMES datasets.For node classification, we vary the number of shots between 1 and 10, and compare with several competitive baselines (i.e., GIN, DGI, GraphCL, and GPPT) in Fig. 3.For few-shot graph classification, we vary the number of shots between 1 and 30, and compare with competitive baselines (i.e., GIN, InfoGraph, and GraphCL) in Fig. 4. The task settings are identical to those stated earlier.
In general, our proposed GraphPrompt consistently outperforms the baselines especially with lower shots.For node classification, as the number of nodes in each graph is relatively small, 10 shots per class might be sufficient for semi-supervised node classification.Nevertheless, GraphPrompt is competitive even with 10 shots.For graph classification, GraphPrompt can be surpassed by some baselines when given more shots (e.g., 20 or more), especially on ENZYMES.On this dataset, 30 shots per class implies 30% of the 600 graphs are used for training, which is not our target scenario.

Model Analysis
We further analyse several aspects of our model.Due to space constraint, we only report the ablation and parameter efficiency study, and leave the rest to Appendix E.
Ablation study.To evaluate the contribution of each component, we conduct an ablation study by comparing GraphPrompt with different prompting strategies: (1) no prompt: for downstream tasks,   we remove the prompt vector, and conduct classification by employing a classifier on the subgraph representations obtained by a direct sum-based ReadOut.(2) lin.prompt: we replace the prompt vector with a linear transformation matrix in Eq. (13).
We conduct the ablation study on three datasets for node classification (Flickr, PROTEINS, and ENZYMES) and graph classification (COX2, ENZYMES, and BZR), respectively, and illustrate the comparison in Fig. 5.We have the following observations.(1) Without the prompt vector, no prompt usually performs the worst among the variants, showing the necessity of prompting the ReadOut operation differently for different downstream tasks.(2) Converting the prompt vector into a linear transformation matrix also hurts the performance, as the matrix involves more parameters thus increasing the reliance on labeled data.
Parameter efficiency.We also compare the number of parameters that needs to be updated in a downstream node classification task for a few representative models, as well as their number of floating point operations (FLOPs), in Table 4.
In particular, as GIN works in an end-to-end manner, it is obvious that it involves the largest number of parameters for updating.For GPPT, it requires a separate learnable vector for each class as its representation, and an attention module to weigh the neighbors for aggregation in the structure token generation.Therefore, GPPT needs to update more parameters than GraphPrompt, which is one factor that impairs its performance in downstream tasks.For our proposed GraphPrompt, it not only outperforms the baselines GIN and GPPT as we have seen earlier, but also requires the least parameters and FLOPs for downstream tasks.For illustration, in addition to prompt tuning, if we also fine-tune the pre-trained weights instead of freezing them (denoted GraphPrompt+ft), there will be significantly more parameters to update.

CONCLUSIONS
In this paper, we studied the research problem of prompting on graphs and proposed GraphPrompt, in order to overcome the limitations of graph neural networks in the supervised or "pre-train, fine-tune" paradigms.In particular, to narrow the gap between pretraining and downstream objectives on graphs, we introduced a unification framework by mapping different tasks to a common task template.Moreover, to distinguish task individuality and achieve task-specific optima, we proposed a learnable task-specific prompt vector that guides each downstream task to make full of the pretrained model.Finally, we conduct extensive experiments on five public datasets, and show that GraphPrompt significantly outperforms various state-of-the-art baselines.
• GPPT [39].GPPT pre-trains a GNN model based on the link prediction task, and employs a learnable prompt to reformulate the downstream node classification task into the same format as link prediction.

D Further Implementation Details
For baseline GCN [20], we employ a 3-layer architecture, and set the hidden dimension as 32.For GraphSAGE [13], we utilize the mean aggregator, and employ a 3-layer architecture.The hidden dimension is also set to 32.For GAT [43], we employ a 2-layer architecture and set the hidden dimension as 32.Besides, we apply 4 attention heads in the first GAT layer.Similarly, for GIN [50], we also employ a 3-layer architecture and set the hidden dimension as 32.For the pre-training and prompting approaches, we use the backbones in their original paper.Specifically, for DGI [44], we use a 1-layer GCN as the backbone, and set the hidden dimension as 512.Besides, we utilize PReLU as the activation function.For InfoGraph [38], we use a 3-layer GIN as the backbone, and set its hidden dimension as 32.For GraphCL [53], we also employ a 3-layer GIN as its backbone, and set the hidden dimension as 32.In particular, we choose the augmentations of node dropping and subgraph, with a default augmentation ratio of 0.2.For GPPT [39], we utilize a 2-layer GraphSAGE as backbone, set its hidden dimension as 128, and utilize the mean aggregator.For our proposed GraphPrompt, we employ a 3-layer GIN as the backbone, and set the hidden dimensions as 32.In addition, we set  = 1 to construct 1-hop subgraphs for the nodes.

E Further Experimental Results
Scalability study.We investigate the scalability of GraphPrompt on the dataset PROTEINS for graph classification.We divide the graphs into six groups based on their size (i.e., number of nodes).The size of graphs in each group is approximately 50, 60, . .., 100 nodes.We sample 10 graphs from each group, and record the prompt tuning time on the 10 graphs in each epoch.The results are presented in Fig. 6.Note that we also report the tuning time for Graph-Prompt-ft, a variant of GraphPrompt, which fine-tunes all the parameters including the pre-trained GNN weights.We first observe that the tuning time of our GraphPrompt increases linearly as the graph size increases, demonstrating the scalability of Graph-Prompt on larger graphs.In addition, compared to GraphPrompt, GraphPrompt-ft needs more tuning time, showing the inefficiency of the fine-tuning paradigm.Parameter sensitivity.We evaluate the sensitivity of two important hyperparameters in GraphPrompt, and show the impact in Figs.7 and 8 for node classification and graph classification, respectively.
For the number of hops () in subgraph construction, the performance on node classification gradually decreases as the number of hops increases.This is because a larger subgraph tends to bring in irrelevant information for the target node, and may suffer from the over-smoothing issue [6].On the other hand, for graph classification, the number of hops only affects the pre-training stage as the whole graph is used in downstream classification.In this case, the number of hops does not show a clear trend, implying less impact on graph classification since both small and large subgraphs are helpful in capturing substructure information at different scales.
For the hidden dimension, a smaller dimension is better for node classification, such as 32 and 64.For graph classification, a slightly larger dimension might be better, such as 64 and 128.Overall, 32 or 64 appears to be robust for both node and graph classification.

F Data Ethics Statement
To evaluate the efficacy of this work, we conducted experiments which only use publicly available datasets, namely, Flickr3 , PRO-TEINS, COX2, ENZYMES and BZR 4 , in accordance to their usage terms and conditions if any.We further declare that no personally identifiable information was used, and no human or animal subject was involved in this research.

Figure 3 :
Figure 3: Impact of shots on few-shot node classification.

Figure 4 :
Figure 4: Impact of shots on few-shot graph classification.

•
Graph classification: This is a graph-level task.Consider a set of graphs G with a set of graph classes C, and a set of labeled graphs D = {( 1 ,  1 ), ( 2 ,  2 ), . ..}where   ∈ G and   is the corresponding label of   .In the -shot setting, there are exactly  pairs of (  ,   = ) ∈ D for every class  ∈ C. Similar to node classification, for each class  ∈ C, we define a graph class prototypical subgraph, also represented by the mean embedding vector of the (sub)graphs in :

Table 1 :
Summary of datasets.

Table 2 :
Accuracy evaluation on node classification.tabular results are in percent, with best bolded and runner-up underlined.

Table 4 :
Study of parameter efficiency on node classification.