Text-Attributed Graph Representation Learning: Methods, Applications, and Challenges

Text documents are usually connected in a graph structure, resulting in an important class of data named text-attributed graph, e.g., paper citation graph and Web page hyperlink graph. On the one hand, Graph Neural Networks (GNNs) consider text in each document as general vertex attribute and do not specifically deal with text data. On the other hand, Pre-trained Language Models (PLMs) and Topic Models (TMs) learn effective document embeddings. However, most models focus on text content in each single document only, ignoring link adjacency across documents. The above two challenges motivate the development of text-attributed graph representation learning, combining GNNs with PLMs and TMs into a unified model and learning document embeddings preserving both modalities, which fulfill applications, e.g., text classification, citation recommendation, question answering, etc. In this lecture-style tutorial, we will provide a systematic review of text-attributed graph, including its formal definition, recent methods, diverse applications, and challenges. Specifically, i) we will formally define text-attributed graph and briefly review GNNs, PLMs, and TMs, which are the fundamentals of some existing methods. ii) We will then revisit the technical details of text-attributed graph models, which are generally split into two categories, PLM-based and TM-based. iii) Besides, we will show diverse applications built on text-attributed graph. iv) Finally, we will discuss some challenges of existing models and propose solutions for future research.


OVERVIEW
Description of the tutorial topic.Text documents are usually connected in a graph structure.For example, academic papers cite each other in a citation graph, Web pages link to other pages in a hyperlink graph.We call such a class of data text-attributed graph [2] with two modalities, text content within documents and graph connection across documents.On the one hand, Graph Neural Networks (GNNs) [5] derive effective document embeddings, unifying both vertex attribute and graph connectivity.However, most models consider text in documents as general attribute and do not specifically deal with text data.Consequently, they can not capture language representations or linguistic semantics in text corpora.
On the other hand, Pre-trained Language Models (PLMs) [16] and Topic Models (TMs) [15,31] learn contextualized language representations and document embeddings.However, existing methods mainly deal with text within each individual document only, and ignore the graph adjacency across documents, e.g., citations and hyperlinks.Graph connectivity reveals topic similarity, and modeling it could allow semantics to span across connected documents.
Motivated by these two challenges, existing works propose textattributed graph representation learning, which combines GNNs with PLMs and TMs into a unified framework and infers document embeddings that preserve both contextualized textual semantics and graph connectivity.Such text-attributed graph methods can fulfill different applications, such as text classification, citation recommendation, question answering, document retrieval, etc. Existing works generally fall into two categories, PLM-based and TM-based.This lecture-style tutorial will cover recent development of textattributed graph representation learning, including both categories of methods, their applications, and future research directions.

Text-Attributed Graph and Preliminaries
E is a set of graph links where    ∈ E if there is a link between documents   and   .We model an undirected graph, i.e.,    ,  =    ,  , though it is straightforward to extend to directed graph.The neighbors of a document   are those directly linked to   , denoted as N ().

2.1.2
Text-attributed graph representation learning.Given text-attributed graph G as input, the goal of text-attributed graph representation learning is to design a method that outputs document embeddings Z D = {z  }  ∈ D , preserving both textual semantics D and graph connectivity E. Note that the method does not simply consider document as general attribute, but instead, specifically models language representation in rich text corpus D.

Extensions of text-attributed graph.
There are some extensions of text-attributed graph, such as its heterogeneous version, textual-edge scenario, temporal case, and hierarchical graph.

Graph neural networks (GNNs)
. GNNs [5,17] have neighbor aggregation, which incorporates both vertex attribute and graph connectivity.text-attributed graph is a specific type of graph where each vertex is a document.In this tutorial, we will briefly introduce GNNs as the fundamental of text-attributed graph.
2.1.5Pre-trained Language Models (PLMs).PLMs [10,14,16] are designed for text data.Their multi-head attention captures language semantics and outputs contextualized document embeddings.We will explain representative PLMs as preliminaries.
2.1.6Topic models (TMs).Topic models [4,15,28] are another class of models for text documents.They assume that documents are generated by a small number of latent topics, which summarize broad and distinct concepts.This tutorial will introduce TMs.
Existing works on text-attributed graph representation learning mainly fall into two categories, PLM-based and TM-based.

PLM-based Text-Attributed Graph Models
We first outline selected PLM-based models to be included in this tutorial at Table 1 (2nd category).Note that there are more PLMbased models for text-attributed graph, but to make this tutorial concrete and focused, we show selected and representative works.

2.2.1
PLMs for static text-attributed graph.The most basic scenario is a static text-attributed graph where we observe the whole graph at once.We organize static models into two subsets.
• Cascaded architecture.Documents are first independently encoded by PLMs, whose outputs are then aggregated by GNNs to obtain final document embeddings.We call such a method cascaded architecture, which has been widely adopted, such as TextGNN [37], AdsGNN [11], and GEAR [36].• Nested architecture.A drawback of cascaded architecture is that document encoding and graph aggregation are performed separately, which cannot well unify document and graph.Nested architecture iteratively performs document encoding and graph aggregation by nesting GNN layers alongside PLM layers, e.g., GraphFormers [23], GLEM [35].

2.2.2
PLMs for heterogeneous text-attributed graph.Documents are usually associated with auxiliary data, e.g., academic papers with authors and venues [29].Even for the same pair of linked documents (e.g., products on e-commerce), there may exist multiple types of edges (e.g., appearing in the same cart and sharing the same brand).These scenarios extend text-attributed graph to its heterogeneous version.Heterformer [9] captures heterogeneity.

2.2.3
Textual-edge text-attributed graph.Documents appear on vertices for most graphs.However, in many cases edges also have documents.For example, edges on email communication graph are coupled with textual email content.This observation extends textattributed graph to the textual-edge graph.Edgeformers [8] can infer both edge (document) embeddings and vertex embeddings.

TM-based Text-Attributed Graph Models
Another important class of text-attributed graph models are built on Topic Models (TMs).One key difference from PLMs is that TMs assume a small number of latent topics, which summarize distinct and broad concepts, to generate document content, while PLMs do not have latent topics.One advantage of latent topic structure is to offer semantic interpretability for document embeddings.We organize some recent Topic Models at Table 1 (3rd category).

2.3.1
TMs for static text-attributed graph.Besides PLMs, there are also some TMs designed for static text-attributed graph.

2.3.2
Temporal text-attributed graph.Documents are created over time, and a text-attributed graph is an accumulation of temporal process.For example, academic papers published over years cite existing papers.Modeling time can better preserve semantics.Dynamic topic model, NetDTM [27], is the first for temporal scenario.

2.3.3
TMs for heterogeneous text-attributed graph.As in Sec.2.2.2, a text-attributed graph may have its heterogeneous version.Besides PLMs, there are also topic models for heterogeneity.VGATM [32] designs a multi-layered text graph for topic modeling.

2.3.4
Hierarchical text-attributed graph.Connectivity across documents usually exhibits a hierarchical graph structure, e.g., an academic paper is extended by following works, which are then further developed by other papers.Hierarchical topic models, e.g., HGTM [34], are developed to preserve such scenario.

Applications and Challenges
We outline real-world applications built on text-attributed graph (Table 1 (4th category)), as well as its challenges for future research.

Application 1: Text classification.
Predicting the categories of academic papers helps researchers navigate within the website.Most text classification models focus on textual content only.SemiVN [30], G2P2 [19], and HGTM [34] incorporate document connectivity for text classification.
2.4.2Application 2: Citation recommendation.Academic citation graph is a specific example of text-attributed graph.Some models, such as NRTM [1] and GNCTM [22], work on citation graph to make personalized citation recommendations to researchers.
2.4.3Application 3: Question answering.Answering a userspecified question requires a model to reason on textual content.Document dependencies provide auxiliary data and improve reasoning accuracy.LinkBERT [24] achieves this goal by pretraining on a text-attributed graph and finetuning on question answering.
2.4.4Application 4: Document retrieval.Given a text query by online users, search engine aims to retrieve a list of related documents for users.HARP [12] proposes to use Webpage hyperlinks as complementary information for a more accurate retrieval.

2.4.5
Challenges and future research directions.Existing models still face some limitations.We list two of them.1. Explainability.Besides achieving promising result, we are also curious about why a model makes certain predictions and how to explain its behaviors.GNNExplainer [25] pioneers this research by providing explanation for GNNs on general graph, but it still lacks PLM or TM component for text-attributed graph explanation.A future research is to design a model that jointly incorporates both GNNs and PLMs/TMs, and provides explanations on text-attributed graph (e.g., which latent topics are important for prediction).
2. Hierarchical pretraining on text-attributed graph.Existing text-attributed graph pretraining treats all the documents equally.However, in many cases documents present a hierarchical instead of a flat structure.For example, survey papers summarize a broad area and regular papers deal with specific problems.Modeling such document hierarchy can better preserve textual semantics.

TUTORIAL SCHEDULE
We will provide a comprehensive review of text-attributed graph.The tutorial schedule is based on the outline at Sec. 2, with the total length of 180 min.Specifically, this tutorial is organized as Table 1.
Intended audience.This tutorial targets both practitioners looking for introductory lecture, and researchers interested in recent and future research directions of text-attributed graph, PLM-based and TM-based text-attributed graph models, and their applications.
Prerequisite knowledge.No specialized prerequisite knowledge is required, but it is helpful if the audience already has a basic background in data mining and graph data.
Potential learning outcomes.The audience is expected to understand the definition of text-attributed graph, the technical details of existing methods, and their real-world applications.
Materials.The tutorial will be mainly based on slides prepared by the presenters.We will also make a video recording of the whole tutorial presentation for the audience to replay and review.

Table 1 :
Tutorial outline, representative research works to be included in the tutorial, and schedule.
Delvin Ce Zhang is a Postdoctoral Researcher at Yale University.His research focuses on graph neural networks, pre-trained language models, and topic modeling.His works are published KDD, NeurIPS, ICML, AAAI, TKDE.His presentation won the Best Oral Talk Award Runner-up at Singapore ACM SIGKDD Symposium 2023.He independently taught an undergraduate course throughout the whole term at Singapore Management University in 2023.Menglin Yang is a Postdoctoral Researcher at Yale University.Prior to that, he obtained his Ph.D. from The Chinese University of Hong Kong.His research interest includes graph representation learning, non-Euclidean geometric learning, recommender systems, and large language model.He has organized tutorials on top-tier conferences, including KDD 2023 and ECML-PKDD 2022.Rex Ying is an Assistant Professor at Yale University.His research areas include algorithms for graph neural networks, geometric embeddings and explainable models.He is the author of many widely used GNN algorithms such as GraphSAGE, PinSAGE and GNNExplainer.Rex has worked on a variety of applications.He has served as committee members of AAAI, ICML, NeurIPS, KDD, WWW for 5 years, and area chair for LoG 2022.Hady W. Lauw is an Associate Professor at Singapore Management University, and the current Chair of the Singapore Chapter of ACM SIGKDD (KDD.SG).He publishes actively on AI and text mining, earning a Distinguished Paper Award at IJCAI-20 and an Outstanding Paper Nomination at AAAI-14.He has conducted tutorials in major conferences, RecSys-21, AAAI-19, IJCAI-11, CIKM-10.He has more than 10 years of university teaching experience.