Robust Data-centric Graph Structure Learning for Text Classification

Over the past decades, text classification underwent remarkable evolution across diverse domains. Despite these advancements, most existing model-centric methods in text classification cannot generalize well on class-imbalanced datasets that contain high-similarity textual information. Instead of developing new model architectures, data-centric approaches enhance the performance by manipulating the data structure. In this study, we aim to investigate robust data-centric approaches that can help text classification in our collected dataset, the metadata of survey papers about Large Language Models (LLMs). In the experiments, we explore four paradigms and observe that leveraging arXiv's co-category information on graphs can help robustly classify the text data over the other three paradigms, conventional machine-learning algorithms, pre-trained language models' fine-tuning, and zero-shot / few-shot classifications using LLMs.


INTRODUCTION
Text classification, as a fundamental task in natural language processing (NLP), has undergone significant evolution over the past few decades in many application fields, such as context understanding [20,21,42], content debiasing [61,62], spam detection [2], and taxonomy generation [25].Conventional methods transform the text via sparse feature representation, e.g., bag-of-words model [53].Recently, deep-learning-based approaches, such as long short-term memory (LSTM) [15], have been widely applied to better learn text representations.Subsequent improvements [22,51] attempt to capture the long-range dependencies of words for textual understanding.Most of these methods improve performances from

Data Collection Data Labeling Data Construction
Collect the metadata of survey papers.
Assign papers to corresponding categories in the proposed taxonomy.
Construct various types of data, such as graphs, and text.

Data Evaluation Data Visualization Data Storage
Evaluate the data quality in different paradigms.
Visualize the data.Store the data for future retrieval.the angle of model architecture but couldn't generalize well on specific types of text data, such as class-imbalanced data [33] or high-similarity data [31], which are commonly seen in daily lives.
Recent studies reveal that data-centric approaches could be a potential solution to enhance text classification performance [3,16].Compared to the model-centric approaches, which aim to design a well-generalized model on the given datasets, data-centric approaches usually optimize the model's outputs by manipulating the dataset [52].In this study, we aim to investigate robust data-centric approaches that can help improve text classification performance on class-imbalance datasets that contain similar textual information.To illustrate our data-centric approaches, we present the overall process in Figure 1.Our process is mainly divided into two stages, data development and data assessment.In the data development stage, we initially collected the metadata of Large Language Models (LLMs)' survey papers until November 30 ℎ , 2023, and then assigned each paper to the corresponding category in our new proposed taxonomy.In our collected dataset, on the one hand, the distribution of each category is not uniform, which leads to a substantial class imbalance issue.On the other hand, authors usually use similar terminologies to describe LLMs in the title and the abstract of these survey papers.Such a textual similarity introduces significant difficulties in text classifications.To embrace these two challenges, we conduct investigations into various types of data, such as attributed graphs and text data.In the data assessment stage, we first evaluate which types of data can yield superior classifications in four paradigms, conventional machine learning algorithms, graph structure learning, fine-tuning the pre-trained language models, and zero-shot/few-shot classifications using LLMs.Our evaluations reveal that leveraging the graph structure information of co-category graphs can help better classifications over the other three paradigms.After evaluating the data, we visualize various graph structures to illustrate the effectiveness of graph structure learning on co-category graphs.Last, we store our datasets for future retrieval. 1 Overall, our primary contributions can be summarized as follows: • We first investigate data-centric approaches that can help text classification on class-imbalance datasets that contain similar textual information.• We first collect the metadata of 112 literature reviews about Large Language Models (LLMs) and propose a new taxonomy for these papers.
• Extensive experiments indicate that graph structure learning on co-category graphs can robustly classify the text data and substantially outperform the other three paradigms.

RELATED WORK 2.1 Data-centric Artificial Intelligence (AI)
The success of AI models is inseparable from a large amount of high-quality annotated data [32,54].Compared to improving AI models, an increasing number of research works are dedicated to developing frameworks, commonly named Data-centric AI approaches, that can iteratively improve the data quality for AI systems [52].Most related papers can be divided into two categories, automatic approaches and collaborative approaches [52].The automatic approaches aim to automate the process of data manipulation, whereas the collaborative approaches involve human collaboration.Within the former category, the majority of works are classified based on the types of approaches, such as programming-based methods [26,28], learning-based [18,43], and pipeline-based methods [12,38].In the latter category, most works are assigned based on the extent of human involvement, such as full collaboration [27] or partial collaboration [4].

Graph Structure Learning
Graph Neural Networks (GNNs) have been widely used for graph structure learning [7,8,19,46,49,55,56,58,59].Bruna et al. [6] first extend convolution operations on graphs using both spatial methods and spectral methods.To improve the efficiency of the eigendecomposition of the graph Laplacian matrix, Defferrard et al. [10] approximate spectral filters by using K-order Chebyshev polynomial.Kipf et al. [23] simplify graph convolutions to a firstorder polynomial while achieving state-of-the-art performance for semi-supervised learning.Hamilton et al. [13] propose an inductivelearning approach that aggregates node features from corresponding fixed-size local neighbors.These GNNs have been proven to achieve extraordinary performance in graph structure learning.

Text Classification
Text Classification has been widely studied in recent years [24,47,48,60].In the late 20th century, machine-learning models were initially developed to classify text data [39].Since 2017, Transformer kicked off the era of large language models and has achieved a huge 1 Dataset and source codes: https://github.com/junzhuang-code/DCGSLbreakthrough in text understanding [45].On the one hand, by harnessing the power of Transformer [45], BERT [22] can better learn the bidirectional representations, significantly enhancing the performance across a wide range of contextual understanding [35][36][37].
Subsequent improvements, such as RoBERTa [34], DistilBERT [41], and Albert [29], made substantial contributions to this direction.On the other hand, inspired by the Transformer [45], researchers at OpenAI, introduced a series of Generative Pre-Training (GPT) models, such as GPT-1 [40], that integrate unsupervised pre-training with supervised fine-tuning.With iterative enhancements, GPT-3 achieved human-level classification performance on several NLP benchmarks [5].GPT-4 extended the capabilities to multi-modal learning and obtained remarkable advancements, leading the development of large language models [1].Besides employing language models, Yao et al. [51] first explore leveraging graph neural networks in text classification, which sparked enthusiasm for better understanding textual information via graph structure learning [17].

METHODOLOGY
In this section, we introduce our data-centric approaches in two stages, data development and data assessment.In the former stage, we introduce the process of data collection, data labeling, and data construction.For the latter stage, we mainly explain the evaluation of graph structure learning.

Data Development
In the data development stage, we divide the process into data collection, data labeling, and data construction.We introduce each step in detail in this section.

Data Collection.
In recent years, large language models attracted more and more attention.Related survey papers have also been continuously emerging in 2023.As shown in Figure 2, the trend has been increasing, with significant growth in March, July, and November of 2023.We scraped the metadata of survey papers about large language models from the arXiv website and further manually supplemented the dataset from Google Scholar.We updated the dataset weekly until November 30, 2023, and collected 112 survey papers in this study.
To better understand the collected papers, we present the word frequency in Figure 3  used in summary (abstract).These distributions suggest that the abstracts of these papers contain many similar terms, which increases the difficulty of text classification.Thus, the above observation motivates us to explore other methods, such as leveraging the graph structure information, to classify the papers.

LLMs
Figure 4: The mind map of survey papers about large language models.Besides "Comprehensive" and "Others" that are not included in the mind map, we highlight thirteen categories in our proposed taxonomy.The total number of classes in the labels is fifteen.
3.1.2Data Labeling.After collecting data, we further designed a new taxonomy and assigned each paper to the corresponding class.One benefit of providing the taxonomy is that a taxonomy can help newcomers understand the hierarchy of concepts.The mind map of the proposed taxonomy is presented in Figure 4. We highlight thirteen classes in the mind map.The total classes in the labels are fifteen, including "Comprehensive" and "Others" (Not presented in the mind map).To better understand the distribution of the classes, we present the class distribution in Figure 5.The distribution indicates that the class is extremely imbalanced, introducing a challenge to this classification task.
Note that we prefer to propose a new taxonomy instead of using the arXiv categories since the arXiv categories cannot reflect the concept hierarchy for LLMs.To illustrate this point, we present the distribution of survey papers across different arXiv categories in Figure 6.Top-2 frequent categories are "cs.CL" (Computation and Language), and "cs.AI" (Artificial Intelligence), which means that   Overall, we present the data description in Table 1.

Data Construction.
In the previous section, we explain the motivation for exploring graph structure learning.The goal of constructing attributed graphs is to utilize the graph structure information to classify survey papers into corresponding categories in the proposed taxonomy.Before constructing the graphs, We first define the attributed graphs as follows.
Definition 1.An attributed graph G denotes a graph structure that represents topological connections E among a set of vertices V associated with attributes.The topological relationship among vertices in G(V, E) can be represented by a symmetric adjacency matrix A ∈ R  × , where  is the number of vertices.Each vertex contains an attribute (feature) vector.All feature vectors constitute a feature matrix X ∈ R  × , where  is the number of features for each vertex.Therefore, the matrix representation of G(V, E) can be defined as G(A, X).
Based on Definition 1, we start by creating the term frequencyinverse document frequency (TF-IDF) feature matrices for both title and summary columns, where the term frequency denotes the word frequency in the document, and inverse document frequency denotes the log-scaled inverse fraction of the number of documents containing the word.TF-IDF matrix is commonly used for text classification tasks because it helps capture the distinctive words that can indicate specific classes.After establishing the TF-IDF matrices, we apply one-hot encoding on the arXiv's categories and then combine three matrices along the feature dimension to build the feature matrix X.
To leverage the topological information among vertices, we proceed to construct the graph structures to connect the attribute vectors.In this study, we are interested in three types of graphs, text graph, co-author graph, and co-category graph.
To enhance the text classification, Yao et al. [51] initially verified that long-distance lexical relationships can be effectively represented in a text graph.Thus, in the work, we follow the same settings as TextGCN [51] to build text graphs.Note that in the text graph, the aforementioned feature matrix remains unutilized as only paper vertices contain attribute vectors.To retain consistency, all entries in the feature matrix are uniformly set to unity.Correspondingly, solely paper vertices are endowed with labels, while all word vertices are uniformly assigned a new class, which remains untouched throughout both the training and testing phases.
Exploring the co-relationship among vertices is a common practice in graph structure learning [13].In our dataset, two attributes, "Authors" and "Categories", can be utilized to explore such corelationships as these attributes exhibit inherent connections among survey papers.Thus, we build co-author graphs and co-category graphs using these two attributes.In the co-author graph, we introduce an edge connecting two vertices (papers) if they share at least one common author.In the co-category graph, an edge is added between two vertices with at least one common category.In these two types of graphs, each vertex is assigned one class (taxonomy) as the label.Note that in this study all edges are undirected.
Besides constructing graphs, we compare the performances on text data, which includes both the title and abstract of survey papers.

Data Assessment
In this section, we mainly introduce how we evaluate the classification performances on our constructed attributed graphs.Moreover, we provide additional evaluation for the other three paradigms in the experiment section.After evaluating the data, we further visualize the graphs and store the datasets during the process.

Graph
Structure Learning in Text Classification.Given the well-built attributed graphs G(A, X), we aim to investigate whether data-centric graph structure learning using graph neural networks (GNNs) can help text classification.Before feeding the matrix representation, A and X, of the attributed graphs G into GNNs, we first preprocess the adjacency matrix A as follows: where is an identity matrix.D , =  A , is a diagonal degree matrix.
After preprocessing, we utilize GNNs to learn node representation.Note that in this study, a node in graphs could represent a word or a document.By doing so, we could transform the text classification tasks into the node classification tasks.This transformation underscores the versatility of GNNs in handling diverse tasks.Within these tasks, the layer-wise message-passing mechanism of GNNs serves as a foundation to capture intricate relationships in graph-structured data.For general expression, we formulate the layer-wise message-passing mechanism of GNNs as follows: where H ( ) is a node hidden representation in the -th layer.The dimension of H ( ) in the input layer, middle layer, and output layer is the number of features , hidden units ℎ, and classes , respectively.H (0) = X.W ( ) is the weight matrix in the -th layer. denotes a non-linear activation function, such as ReLU.
In general node classification tasks, a GNN is trained with groundtruth labels Y ∈ R  ×1 .In this study, we build the ground-truth labels based on our proposed taxonomy.To simplify the problem, each paper is assigned one primary category as the label, even if the paper sometimes may belong to more than one category.During training, we optimize GNNs with cross-entropy as follows.
where  , denotes a ground-truth label of the -th node in the -th class; ŷ, denotes a predicted label of the -th node in the -th class;   denotes the number of train nodes;  denotes the number of classes.The -th predicted label ŷ is computed by choosing the maximum probability of the corresponding categorical distribution.
In brief, we formalize the problem that we aim to solve via datacentric graph structure learning in this study as follows.
Problem 1.After constructing an attributed graph G( Â, X) and ground-truth labels Y, we train a graph neural network (GNN) on the train data and evaluate the classification performance on the test data.Our goal is to design a data-centric method that can help robustly classify the text data.

EXPERIMENTS
In this section, we verify the effectiveness and robustness of graph structure learning for text classification in our dataset.We further examine its superior performance over the other three paradigms.We investigate three types of attributed graphs, text graphs, coauthor graphs, and co-category graphs, for graph structure learning and present their statistics in Table 3.Note that the text graph consists of paper vertices and word vertices, and thus contains 16 classes because all word vertices are labeled as a new class, which is not touched during the training or testing phase.In the comparative analysis, we examine the classic machine-learning algorithms on the above feature matrix and evaluate the language models on the text data, which contains both title and summary.

Experimental Settings
To validate our methods, we split the train, validation, and test data as 60%, 20%, and 20%.After the split, we're aware that different splits will highly affect the performance on such a small dataset.So, we ran the experiments five times using random seed IDs from 0 to 4 and reported the mean values with corresponding standard deviations, mean (std).
We evaluate the classification performance by accuracy and weighted f1 score.Accuracy is a common metric on classification tasks, whereas the weighted f1 score provides a balanced measure of the class-imbalanced dataset.

Data-centric Graph Structure Learning Can
Help Text Classification.
We investigate whether leveraging the graph structure information can robustly help classify the text data in our dataset.In this experiment, we build graph structures based on the text data (including the title and summary) and the relationship of co-author and co-category.After building the graphs, we examine various graph structures on four classic graph neural networks, GCN [23], GraphSAGE [13], GIN [49], and TAGCN [11].
According to the results in Table 2, four GNNs fail to learn graph representation on both the text graph and the co-author graph.
For the text graph, we argue that the degradation of GNNs may be caused by excessively similar words in the summary of survey papers.When constructing the text graph, these word vertices connect with many paper vertices, resulting in the paper vertices being less distinguishable.For the co-author graph, we conjecture that it is challenging to categorize papers solely based on the sparse co-authorship in this dataset.
On the contrary, four GNNs can achieve great performance (evaluated by both accuracy and weighted F1 score) in most co-category graphs.We conducted an ablation study to examine various graph structures of co-category graphs.First, according to Figure 6, most papers are assigned as "cs.CL" and "cs.AI" in the arXiv categories.Thus, we study how the categories, "cs.CL" and "cs.AI", affect the performance by muting these two categories in a combinatorial manner.In Table 2, we observe that GNNs can maintain a comparable performance after removing either "cs.CL" or "cs.AI".However, the performance dramatically drops after removing both categories.This is possible since most node connections are significantly sparsified after these two categories are removed.In other words, even though both "cs.CL" and "cs.AI" do not directly map to the existing classes, either one can help connect the nodes and further strengthen the message-passing mechanism in GNNs, allowing GNNs to learn better node representation.
We visualize the co-category graphs in Figure 7.The visualization indicates that most nodes, such as the nodes labeled with      red or pink color, are clustered well even if we remove the category either "cs.CL" or "cs.AI".However, after removing these two categories simultaneously, we observe that node classifications gradually become disordered and many nodes are then isolated.This visualization helps us intuitively understand the effectiveness of graph structure learning.
We also visualize GCNs' hidden representation on the above co-category graphs in Figure 8, which shows that nodes are wellclassified in the hidden space even if either the category "cs.CL" or "cs.AI" is removed.However, the distribution of nodes tends to become chaotic when these two categories are removed simultaneously.The visualization verifies experimental results in Table 2.
To further assess the robustness of graph structure learning, we conducted another ablation study to examine how the categories, "cs.IR", "cs.SE", and "cs.RO", affect the classification performance as their names are similar to that of some classes in our proposed taxonomy.Note that our proposed taxonomy is not based on the arXiv categories.According to Table 2, the performances are wellmaintained no matter which category is removed.We argue that results are reasonable since the removals only drop a small number of edges and don't break the topological connections in the graph.
Besides examining various graph structures, we compare the performance of graph structure learning under different noise ratios ( ) in the train labels.Even though it's expected that the classification accuracy decreases as the noise ratio increases, the results in Figure 9 indicate that learning through co-category graphs can achieve robust performance across different noise ratios and stably outperform the other two graph structures.Overall, the above experiments verify the robustness of graph structure learning on co-category graphs.

Comparative Analysis
After verifying the effectiveness of graph structure learning, we further investigate the performance of several classic models in  three different paradigms on the classification tasks.Specifically, we first employ classic machine learning algorithms on the feature matrix (without leveraging the topological relationships).Second, we fine-tune the pre-trained language models on the text data for the downstream classification tasks.Third, we evaluate the zeroshot / few-shot classification capabilities of large language models.We first examine four classic machine-learning algorithms, Naïve Bayes Classifiers (NB), Support Vector Machines (SVM), Random Forest (RF), and Gradient Boosting (GB), and present the results of the first paradigm in Table 4.The results indicate that these machine-learning algorithms cannot perform well on this task.
Second, we examine whether fine-tuning the pre-trained language models on the text data can help achieve better classification.The results in Table 5 indicate that medium-size language models, such as DistilBERT [41], can achieve better performance on smaller text data.However, the performance may dramatically drop when the model size is too small, such as Albert [29].We argue that finetuning larger pre-trained language models, such as Llama2 [44], on smaller text data may cause overfitting issues, which leads to worse performance on larger models.
BERT RoBERTa DistilBERT XLNet Electra Albert BART DeBERTa Llama2 We further investigate whether leveraging noisy labels can help fine-tuning.Our previous experiments confirmed that graph structure learning can outperform fine-tuning in this classification task.Thus, we first generate noisy labels by GCN and then fine-tune the pre-trained language models with the noisy labels.The results in Figure 10 indicate that for some models, the performance achieved through training with noisy labels can surpass that of training with ground-truth labels.One possible reason is that training the model using noisy labels with a low noise ratio can be equivalent to a kind of regularization, improving the classification performance [57].Third, we evaluate the zero-shot and few-shot classification capabilities of three large language models, Claude, ChatGPT 3.5, and ChatGPT 4. We ran the experiments five times and presented the mean value with the corresponding standard deviation in Table 6.Among the large language models, ChatGPT 3.5 outperforms the other two models given that all models have not seen the data before (zero-shot).We further provide some hints to the models before classification (few-shot).For example, we release the keywords of the class "Trustworthy" to the models before classification.In this setting, both ChatGPT 3.5 and ChatGPT 4 can achieve higher accuracy and a weighted F1 score after obtaining some hints.Overall, all three LLMs cannot outperform graph structure learning, which reveals that in this task, LLMs still have room to improve.

Limitation
The experimental results in this study have demonstrated the effectiveness of leveraging graph structure information to classify the survey papers.However, constructing a graph structure may encounter certain constraints.For instance, we build co-category graphs based on the arXiv categories.When papers come from distinct fields, such as biology, physics, and computer science, the graph structure may be very sparse, weakening the effectiveness of graph structure learning.

Future Directions
In the future, our primary motivation extended from this study is to tailor GPT-based applications to assist readers in understanding survey papers more effectively.On the other hand, our collected datasets can potentially contribute to node alignment tasks, which involve the alignment of nodes in one or more graphs, such as co-category graphs and co-author graphs in this study.

CONCLUSION
In this study, we aim to investigate data-centric approaches that can help text classification on class-imbalance datasets that contain similar textual information.To build such a dataset, we collected the metadata of 112 LLMs' survey papers.In the experiments, we conduct a comparative analysis across four paradigms and demonstrate that graph structure learning outperforms conventional machinelearning algorithms, pre-trained language models' fine-tuning, and zero-shot / few-shot classifications using LLMs.Within graph structure learning, we explore three types of attributed graphs, text graph, co-author graph, and co-category graph, and observe that leveraging arXiv's co-category information can help robustly classify text data in our dataset.

Figure 1 :
Figure 1: The overall process of our data-centric approaches.The arrow points to the next step in the workflow.

Figure 2 :
Figure2: Trends of survey papers on large language models.We focus on the trends of the first released date.

u r v e y t a s k s p a p e r a p p li c a t io n s f ie ld d a t a k n o w le d g e c h a ll e n g e s m e t h o d s m o d e l s y s t e m s r e v ie w p e r f o r m a n c e e v a lu a t io n p r o c e s s in g d e v e lo p m e n t d ir e c t io n s t e c h n iq u e s in t e ll ig e n c e g e n e r a t io n r e s e a r c h e r s c a p a b il it ie s a g e n t s c h a t g p t o v e r v ie w e n g in e e r inis t in g in c lu d in g p r o v id e r e a s o n in g le a r n in g u s e d b a s e d u n d e r s t a n d in g p r o v id e s d is c u s s c h a ll e n g e s p r o m p t in g t r a in in g s u m m a r iz e e m e r g in g p r o v id in g r e v ie w r e la t e d p r o m is in g d r iv in g m a k e d e m o n s t r a t e d e n h a n c in g u s in g b e c o m e g e n e r a t in g m a k in g s e r v eFigure 3 :
Figure 3: The 30 most frequently occurring noun (left) and verb (right) keywords in the abstract.
T r u s tw o r th y C o m p re h e n s iv e P ro m p ti n g B io in fo r m a ti c s M u lt i-m o d a l & P re -t r a in in g R e c S y s & IR A d a p ta ti o n T u n in g R o b o ti c s G r a p h s S o ft w a re E n g in e e r in g O th e r s E v a lu a ti o n E d u c a ti o n L a w F in a n c

Figure 5 :
Figure 5: Distribution of classes in the proposed taxonomy.

Figure 6 :
Figure 6: Distribution of arXiv categories in our dataset.

Figure 7 :
Figure 7: Visualization of co-category graphs.We visualize the graphs by muting the categories.

Figure 8 :
Figure 8: Visualization of GCNs' hidden representation on co-category graphs in 2-dimension via t-SNE.Each dot represents one node and is labeled with one color.

Figure 9 :
Figure 9: Comparison of three types of graph structures under different noise ratios (nr).

Figure 10 :
Figure10: Comparison of fine-tuning the language models using ground-truth labels and noisy labels.

Table 1 :
Descriptions of data attributes in our dataset.

Table 2 :
Evaluation of data-centric graph structure learning on three types of attributed graphs.We also conduct an ablation study on the graph structure of co-category graphs.Rm denotes "Removed".

Table 4 :
Evaluation of classic machine learning algorithms on the feature matrix.We denote NB, SVM, RF, and GB as Naïve Bayes Classifiers, Support Vector Machines, Random Forest, and Gradient Boosting, respectively.

Table 5 :
Evaluation of fine-tuning the pre-trained language models on the text data.

Table 6 :
Evaluation of zero-shot and few-shot classification capabilities of three large language models, Claude, ChatGPT 3.5, and ChatGPT 4.