CARL-G: Clustering-Accelerated Representation Learning on Graphs

Self-supervised learning on graphs has made large strides in achieving great performance in various downstream tasks. However, many state-of-the-art methods suffer from a number of impediments, which prevent them from realizing their full potential. For instance, contrastive methods typically require negative sampling, which is often computationally costly. While non-contrastive methods avoid this expensive step, most existing methods either rely on overly complex architectures or dataset-specific augmentations. In this paper, we ask: Can we borrow from classical unsupervised machine learning literature in order to overcome those obstacles? Guided by our key insight that the goal of distance-based clustering closely resembles that of contrastive learning: both attempt to pull representations of similar items together and dissimilar items apart. As a result, we propose CARL-G - a novel clustering-based framework for graph representation learning that uses a loss inspired by Cluster Validation Indices (CVIs), i.e., internal measures of cluster quality (no ground truth required). CARL-G is adaptable to different clustering methods and CVIs, and we show that with the right choice of clustering method and CVI, CARL-G outperforms node classification baselines on 4/5 datasets with up to a 79x training speedup compared to the best-performing baseline. CARL-G also performs at par or better than baselines in node clustering and similarity search tasks, training up to 1,500x faster than the best-performing baseline. Finally, we also provide theoretical foundations for the use of CVI-inspired losses in graph representation learning.

Most of these existing graph SSL methods can be grouped into either contrastive or non-contrastive SSL.Contrastive learning pulls the representations of similar ("positive") samples together and pushes the representations of dissimilar ("negative") samples apart.In the case of graphs, this often means either pulling the representations of a node and its neighbors together [22] or pulling the representations of the same node across two different augmentations together [71].Graph contrastive learning methods typically use non-neighbors as negative samples [68,79], which can be costly.Non-contrastive learning [4,34,54,60] avoids this step by only pulling positive samples together while employing strategies to avoid collapse.
However, all of these methods have some key limitations.Contrastive methods rely on a negative sampling step, which has an expensive quadratic runtime [60] and requires careful tuning [66].Non-contrastive methods often have complex architectures (ex.extra encoder with exponentially updated weights [31,34,60]) and/or rely heavily on augmentations [4,60,72,75].Lee et al. [34] shows that augmentations can change the semantics of underlying graphs, especially in the case of molecular graphs (where perturbing a single edge can create an invalid molecule).
Upon further inspection, we observe that the behavior of contrastive and non-contrastive methods is somewhat similar to that

Proposed
Baseline Methods

CARL-G*
AFGRL [34] BGRL [60] G-BT [4]   of distance-based clustering [64]-both attempt to pull together similar nodes/samples and push apart dissimilar ones.The primary advantage of using clustering over negative sampling is that we can work directly in the smaller embedding space, preventing expensive negative sampling over the graph.Furthermore, there have been decades of research exploring the theoretical foundations of clustering methods and many different metrics have been proposed to evaluate the quality of clusters [7,15,32,47].These metrics have been dubbed Cluster Validation Indices (CVIs) [2].In this work, we ask the following question: Can we leverage well-established clustering methods and CVIs to create a flexible, fast, and effective GNN framework to learn node representations without any labels?
It is worth emphasizing that our goal is not node clustering directly-it is self-supervised graph representation learning.The goal is to develop a general framework that is capable of learning node embeddings for various tasks, including node classification, node clustering, and node similarity search.There exists some similar work.DeepCluster [8] trains a Convolutional Neural Network (CNN) with a supervised loss on pseudo-labels generated by a clustering method to learn image embeddings.
AFGRL [34] uses clustering to select positive samples in lieu of augmentations for graph representation learning and applies the general BGRL [60] framework to push those representations together.SCD [57] searches over different hyperparameters to obtain a clustering where the silhouette score is maximized.However, to the best of our knowledge, there is no existing work in selfsupervised representation learning that directly optimizes for CVIs, which, as we elaborate below, presents us with tremendous potential in advancing and accelerating the state of the art.
We fill this gap by proposing the novel idea of using Cluster Validation Indices (CVIs) directly as our loss function for a neural network.In conjunction with advances in clustering methods [35,43,51,52], CVIs have been improved over the years as measures of cluster quality after performing clustering and have been shown for almost 5 decades to be effective for this purpose [2,7,47,50].
Our proposed method, CARL-G, has several advantages over existing graph SSL methods by virtue of CVI-inspired losses.First, CARL-G generally outperforms other graph SSL methods in node clustering, clustering, and node similarity tasks (see Tables 2, 5  and 6).Second, CARL-G does not require the use of graph augmentations -which are required by many existing graph contrastive and non-contrastive methods [4,31,34,60,72,79] and can inadvertently alter graph semantics [34].Third, CARL-G has a relatively simple architecture compared to the dual encoder architecture of leading non-contrastive methods [31,34,60], drastically reducing the memory cost of our framework.Fourth, we provide theoretical analysis that shows the equivalence of some CVI-based losses and a well-established (albeit expensive) contrastive loss.Finally, CARL-G is sub-quadratic with respect to the size of the graph and much faster than the baselines, with up to a 79× speedup on Coauthor-CS over BGRL [60] (the best-performing baseline).
Our contributions can be summarized as follows: • We propose CARL-G, the first (to the best of our knowledge) framework to use a Cluster Validation Index (CVI) as a neural network loss function.• We propose 3 variants of CARL-G based on different CVIs, each with its own advantages and drawbacks.• We extensively evaluate CARL-G sim -the best all-around performer -across 5 datasets and 3 downstream tasks, where it generally outperforms baselines.• We provide theoretical insight on CARL-G sim 's success.
• We benchmark CARL-G sim against 4 state-of-the-art models and show that it is up to 79 × faster with half the memory consumption (with the same encoder size) compared to the best-performing node classification baseline.

PRELIMINARIES
Notation.We denote a graph as  = (V, E).V is the set of  nodes (i.e.,  = |V |) and E ⊆ V × V is the set of edges.We denote the node-wise feature matrix as  ∈ R × , where  is the dimension of raw node features, and its -th row   is the feature vector for the -th node.We denote the binary adjacency matrix as  ∈ {0, 1} × and the learned node representations as  ∈ R

Similarity Search Node Classi cation Node Clustering
(1) where  is the size of latent dimension, and   is the representation for the -th node.Let N() be a function that returns the set of neighbors for a given node  (i.e., N() = { |(, ) ∈ E}).

CVI-based loss
Let C be the set of clusters, C  ⊆ V be the set of nodes in the -th cluster, and  = |C| be the number of clusters.For ease of notation, let U ∈ [1, ]  be the set of cluster assignments, where U  is the cluster assignment for node .Let  = 1   ∈ V   be the global centroid and   = 1 | C  |  ∈ C    be the centroid for the -th cluster.

Graph Neural Networks
A Graph Neural Network (GNN) [22,33,74] typically performs message-passing along the edges of the graph.Each iteration of the GNN can be described as follows [49]: where Update and Aggregate are differentiable functions, and  (0)  =   .In this work, we opt for simplicity and use Graph Convolutional Networks (GCNs) [33] as the default GNN.These are GNNs where Update consists of a single MLP layer, and Aggregate is the mean of a node's representation with its neighbors.Formally, each iteration of the GCN can be written as:

Cluster Validation Indices
Clustering is a class of unsupervised methods that aims to partition the input space into multiple groups, known as clusters.The goal of clustering is generally to maximize the similarity of points within each cluster while minimizing the similarity of points between clusters [64].In this work, we focus on centroid-based clustering algorithms like -means [37] and -medoids [40].
Cluster Validation Indices (CVIs) [2] estimate the quality of a partition (i.e., clustering) by measuring the compactness and separation of the clusters without knowledge of the ground truth clustering.Note that these are different from metrics like Normalized Mutual Information (NMI) [12] or the Rand Index [46], which require ground truth information of cluster labels.Many different CVIs have been proposed over the years and extensively evaluated [2,50].
Arbelaitz et al. [2] extensively evaluated 30 different CVIs over a wide variety of datasets and found that the Silhouette [47], Davies-Bouldin* [32], and Calinski-Harabasz (also known as the VRC: Variance Ratio Criterion) [7] indices perform best across 720 different synthetic datasets.The VRC has also been shown to be effective in determining the number of clusters for clustering methods [7,16,38,50].As such, we focus on the silhouette index (the best-performing CVI) and VRC (an effective CVI -especially for choosing the number of clusters) in this work.

Silhouette.
The silhouette index computes the ratio of intracluster distance with respect to the inter-cluster distance of itself with its nearest neighboring cluster.It returns a value in [-1, 1], where a value closer to 1 signifies more desired and better distinguishable clustering.The silhouette index [47] is defined as Sil(C) = 1   ∈ V  (), where: and The runtime of computing the silhouette index for a given node is  (), which can be expensive if calculated over all nodes.We discuss this issue and a modified solution later in Section 3.1.

Variance Ratio Criterion.
The VRC [7] computes a ratio between its intra-cluster variance and its inter-cluster variance.Its intra-cluster variance is based on the distances of each point to its centroid, and its inter-cluster variance is based on the distance from each cluster centroid to the global centroid.Formally, For the purposes of this paper, we use Euclidean distance, i.e., Dist(, ) = ∥ −  ∥ 2 .
3 PROPOSED METHOD We propose CARL-G, which consists of three main steps (Figure 2).First, a GNN encoder Enc(•) takes the graph as input and produce node embeddings  = Enc(, ).Next, a multilayer perceptron (MLP) predictor network Pred(•) takes the embeddings by GNN and produces a second set of node embeddings  = Pred( ).We then perform a clustering algorithm (e.g., means) on  to produce a set of clusterings C. It is worth noting that the clustering algorithm does not have to be differentiable.Finally, we compute a cluster validation index (CVI) on the cluster assignments and backpropagate to update the encoder's and predictor's parameters.After training, only the GNN encoder Enc(•) and its produced embeddings  are used to perform downstream tasks, and the predictor network Pred(•) is discarded (similar to the prediction heads in non-contrastive learning work [31,34,60]).

Training CARL-G
As aforementioned in Section 2.2, we evaluate the silhouette index [47] and the VRC [7] as learning objectives.In order to use them effectively, we must make slight modifications to the loss functions.First, we must negate the functions since a higher score is better for both CVIs, and we typically want to minimize a loss function.Second, while Sil(C) = 1 and VRC(C) → ∞ are theoretically ideal, we find this is generally not true in practice.This is because the clustering method may miscluster some nodes and fully maximizing the CVIs will push the misclustered representations too close together, negatively impacting a classifier's ability to distinguish them.To bound the maximum values of our loss, we add  Sil and  vrc -the target silhouette and VRC indices, respectively.The silhouette-based and VRC-based losses are then defined as follows: where Upon careful inspection of Equation (3), we can observe that the computational complexity for the silhouette is  ( 2 ), while the complexity of VRC is only  (), where  ≪ .This is a critical weakness in using the silhouette, especially when the goal is to avoid a quadratic runtime (the typical drawback of contrastive methods).Backpropagating on this loss function would also result in quadratic memory usage because we have to store the gradients for each operation.To resolve this issue, we leverage the simplified silhouette [26], which instead uses the centroid distance.The simplified silhouette has been shown to have competitive performance with the original silhouette [62] while being much faster -running in  () time.As such, we also try the simplified silhouette, which can be written as: where  = U  is the cluster assignment for  and We use the same loss function as  Sil (Equation ( 7)), simply substituting  ′ () for  () (see Section 2.2.1), and name it  sim .

Clustering Method
-means.We primarily focus on -means clustering for this framework due to its fast linear runtime (although we do briefly explore using -medoids in Section 4.3.2below).The goal of -means is to minimize the sum of squared errors-also known as the inertia or within-cluster sum of squares.Formally, this can be written as: Finding the optimal solution to this problem is NP-hard [13], but efficient approximation algorithms [36,52] have been developed that return an approximate solution in linear time (see Section 5).
While -means is fast, it is known to be heavily dependent on its initial centroid locations [3,6], which can be partially solved via repeated re-initialization and picking the clustering that minimizes the inertia.
Poor initialization is typically not a large issue in -means use cases since the end goal is usually to compute a single clustering so we can simply repeat and re-initialize until we are satisfied.However, since we generate a new clustering once per epoch in CARL-G, poor initialization can result in a large amount of variance between epochs.
To minimize the chance of poor centroid initialization occurring during training, we carry the cluster centroids over between epochs.The centroids will naturally update after running -means since the embeddings  changes each epoch (after backpropagation with CVI-based loss).

Theoretical Analysis
To gain a theoretical understanding of why our framework works, we compare it to Margin loss -a fundamental contrastive loss function that has been shown to work well for self-supervised representation learning [22,68].We show that CVI-based loss (especially silhouette loss) has some similarity to Margin based loss, which intuitively explains the success of CVI-based loss.In addition, we show that CVI-based loss has the advantages of (a) lower sensitivity to graph structure, and (b) no negative sampling required.

Similarity analysis of CVI-based loss and Margin loss. Both
Margin loss and CVI-based loss fundamentally consist of two terms: one measuring the distance between neighbors/inter-cluster points and one measuring the distance between non-neighbors/inter-cluster points.This similarity allows us to analyze basic versions of our proposed silhouette loss and margin loss in the context of node classification and show that they are identical in both of their ideal cases.We further analyze the sensitivity of these losses with respect to various parameters of the graph to examine the advantages and disadvantages of our proposed method.To do this, we first define the mean silhouette and margin loss functions: Definition 3.1 (Mean Silhouette).We define the mean silhouette loss (removing the hyperparameter  Sil , focusing on the numerator (un-normalizing the index) and replacing min with the mean) as follows: where Definition 3.2 (Margin Loss).We define margin loss as follows: It is worth noting that this margin loss differs from the maxmargin loss traditionally used in graph SSL [68].We simplify it above in Definition 3.2 by removing the max function for ease of analysis.
Let L be the set of true class labels, and L  be the class label for a node  ∈ V. We define the expected inter-class and intra-class distances as follows: where ,  ∈ R + .Next, let i.e., G follows a stochastic block model with a probability matrix  ∈ [0, 1]  × of the form: Note that  does not necessarily equal 1 − .Finally, we define the inter-class clustering error rate  and intra-class clustering error rate  as follows: Claim 1.Given the above assumptions, the expected value of the simplified silhouette loss approaches that of the margin loss as  → 1,  → 0, and ,  → 0.
Proof.To find E [  ()], we first find E [()] and E [  ()]: Next, we take its limit as ,  → 0: To find E [ml()], we first find the expected value of its left and right sides: Substituting them back in, we get Taking its limit as  → 1,  → 0, we find lim ∴ lim

□
Since the two loss functions are identical in their ideal cases, one may wonder: Why not use margin loss instead?Well, the silhouettebased loss has two key advantages: 3.3.2Lower sensitivity to graph structure.The margin loss is minimized as  → 1 and  → 0. However,  and  are attributes of the graph itself, making it difficult for a user to directly improve the performance of a model using that loss function.On the other hand, the mean silhouette depends on  and , the inter/intra-class clustering error rates, instead.Even on the same graph, a silhouettebased loss can likely be improved by either choosing a more suitable clustering method or distance metric.This greatly increases the flexibility of this loss function.

No negative sampling.
Negative sampling is required for most graph contrastive methods and often requires either many samples [24] or carefully chosen samples [67,69].This is costly, often costing quadratic time [60].The primary advantage of noncontrastive methods is that they avoid this step [4,60].The simplified silhouette avoids this issue by only working in the  ×  embedding space instead of the  ×  graph.It also contrasts node representations against centroid representations instead of against other nodes directly.

EXPERIMENTAL EVALUATION
We evaluate 3 variants of CARL-G: (a) CARL-G Sil -based on the silhouette loss in Equation ( 7), (b) CARL-G VRC -based on the VRC loss in Equation (7), and (c) CARL-G sim -based on the simplified silhouette loss in Equation (8).We evaluate these variants on 5 datasets on node classification and thoroughly benchmark their memory usage and runtime.We then select the best-performing variant, CARL-G sim , and evaluate its performance across 2 additional tasks: node clustering, and embedding similarity search.

Node Classification.
A common task for GNNs is to classify each node into one of several different classes.In the supervised setting, this is often explicitly optimized for during the training process since the GNN is typically trained with cross-entropy loss over the labels [22,33] but this is not possible for graph SSL methods where we do not have the labels during the training of the GNN.As such, the convention [4,34,60,61,79] is to train a logistic regression classifier on the frozen embeddings produced by the GNN.
Following previous works, we train a logistic regression model with ℓ 2 regularization on the frozen embeddings produced by our encoder model.We compare against a variety of self-supervised baselines, including both GNN-based and non-GNN-based: Deep-Walk [45], RandomInit [61], DGI [61], GMI [44], MVGRL [24], GRACE [79], G-BT [4], AFGRL [34], and BGRL [60].We also evaluate our method against two supervised models: GCA [80] and a GCN [33].We follow [34,60] and use an 80/10/10 train/validation/test, early stopping on the validation accuracy.We re-run AFGRL and BGRL using their published code and weights (where possible) on that split 1 .Finally, we use node2vec results from [34] and the reported results of the other baseline methods from their respective papers.See Section 4.4 for implementation details.
4.0.2Node Clustering.Following previous graph representation learning work [24,34,41], we also evaluate CARL-G on the task of node clustering.We fit a -means model on the generated embeddings  using the evaluation criteria from [34] -NMI and cluster homogeneity.Following [34], we re-run our model with different hyperparameters (the embeddings are not the same as node classification) and report the highest clustering scores.Due to computational resource constraints, we choose to only evaluate CARL-G sim , the overall best-performing model.We report the scores of the baselines models from [34].
4.0.3Similarity Search.Following [34], we evaluate our model on node similarity search.The goal of similarity search is to, given a query node , return the  nearest neighbors.In our setting, the goal is to return other nodes belonging to the same class as the query node.We evaluate the performance of each method by computing its Hits@ -the percentage of the top  neighbors that belong to the same class.Similar to [34], we evaluate our model every epoch and report the highest similarity search scores.We use  ∈ {5, 10} and use the scores from [34] as baseline results.

Evaluation Results
4.1.1Node Classification Performance.We show the node classification accuracy of our three proposed methods along with the baseline results in Table 5. CARL-G Sil generally performs the best of all the evaluated methods, with the highest accuracy on 4/5 of the datasets (all except Wiki-CS).CARL-G sim generally performs similarly to CARL-G Sil , with similar performance on all datasets except Wiki-CS and Amazon-Photos, and still outperforms the baselines on 4/5 datasets.CARL-G VRC is the weakest-performing method of the three methods.It only outperforms baselines on 2/5 datasets.Since CARL-G sim is much faster than CARL-G Sil (see Section 3.1 and Figure 4a) without sacrificing much performance, we focus on CARL-G sim for the remainder of the evaluation tasks.
4.1.2Node Clustering Performance.We evaluate CARL-G sim on node clustering and display the results in Table 2.We find that it generally outperforms its baselines on 5/5 datasets in terms of NMI and 4/5 datasets in terms of homogeneity.CARL-G sim and AF-GRL [34] both encourage a clusterable representations by utilizing -means clustering as part of their respective training pipelines.

Similarity Search
Performance.We evaluate CARL-G sim on similarity search in Table 6, where it roughly performs on par with AFGRL, the best-performing baseline.This is surprising, as AFGRL specifically optimizes for the similarity search task by using -NN as one of the criteria to sample neighbors.

Resource Benchmarking
We benchmark the 3 variants of our proposed method against BGRL [60] (the best-performing baseline), AFGRL [34] (the most recent baseline), G-BT [4] (the fastest baseline), and GRACE [79] (a strong contrastive baseline).We time the amount of time it takes to train each of the best-performing node classification models.We remove all evaluation code and purely measure the amount of time it takes to train each method, taking care to synchronize all asynchronous GPU operations.We use the default values in the respective papers for AFGRL and BGRL: 5,000 epochs for AFGRL and 10,000 epochs for BGRL.We use 50 epochs for CARL-G, although our method converges much faster in practice.We also measure the GPU memory usage of each method.We use the hyperparameters by the respective paper authors for each dataset, which is why the methods use different encoder sizes.Note that the encoder sizes greatly affect the runtime and memory usage of each of the models, so we report the layer sizes used in Table 3.
Our benchmarking results can be found in Figures 4a and 4b.Table 3: GCN layer sizes used by the encoder for each method.
The layer sizes greatly affect the amount of memory used by each model (shown in Figure 4b).

CARL-G is fast.
In Figure 4a, we show that CARL-G sim is much faster than competing baselines, even in cases where the encoder is larger (see Table 3).BGRL is the best-performing node classification baseline, and CARL-G sim is about 79× faster on Coauthor-CS, and 57× faster on Coauthor-Physics.AFGRL is by far the slowest method, requiring much longer to train.

CARL-G
works with a fixed encoder size.We find that CARL-G works well with a fixed encoder size (see Table 3).Unlike AFGRL, BGRL, GRACE, and G-BT, we fix the encoder size for CARL-G across all datasets.This has practical advantages by allowing a user to fix the model size across datasets, thereby reducing the number of hyperparameters in the model.We observed that increasing the embedding size also increases the performance of our model across all datasets.This is not true for all of our baselines -for example, [34] found that BGRL, GRACE, and GCA performance will often decrease in performance as embedding sizes increase.We limited our model embedding size to 256 for a fair comparison with other models.

CARL-G sim uses much less memory for the same encoder size.
When the encoder sizes of baseline methods are the same, CARL-G sim uses much less memory than the baselines.The GPU memory usage of CARL-G sim is also much lower than (about half) the memory usage of a BGRL model of the same size.This is because BGRL stores two copies of the encoder with different weights.The second encoder's weights gradually approach that of the first encoder during training but still takes up twice the space compared to singleencoder models like CARL-G sim or G-BT [4].
4.2.4CARL-G sim 's runtime is linear with respect to the number of neighbors.In Section 3.1, we mention that the runtime of CARL-G Sil , the silhouette-based loss, is  ( 2 ).This was the motivation for us to propose CARL-G sim -the simplified-silhouette-based loss which has an  () runtime instead.In Figure 6, we verify that this is indeed the case on Coauthor-Physics, the largest of the 5 datasets.

Ablation Studies
We perform an ablation and sensitivity analysis on several aspects of our model.First, we examine the sensitivity of our model with respect to -the number of clusters.Second, we examine the effect of using -medoids instead of -means.Finally, we try to inject more graph structural information during the clustering stage to see if we are losing any information.

Effect of the number of clusters.
We perform sensitivity analysis on  -the number of clusters (see Figures 5a and 5b).We find that, generally, the accuracy of our method goes up as the number of clusters increases.As the number of clusters continues to increase, the accuracy begins to drop.This implies that, much like traditional clustering [50], there is some "sweet spot" for .However, it is worth noting that this number does not directly correspond to the number of classes in the data and is much higher than  for all of the datasets.DeepCluster [8] also makes similar observations, where they find 10,000 clusters is ideal for ImageNet, despite there only being 1,000 labeled classes.We study the effect of using -medoids instead of -means as our clustering algorithm.Both algorithms are partition-based clustering methods [64] and have seen optimizations in recent years [51,52].We find that the -means-based CARL-G sim generally performs better across all 4 of the evaluated datasets.The differences in node classification accuracy are shown in Table 4.

Does additional information help?
It may appear as if we are losing graph information by working only with the embeddings.If this is the case, we should be able to improve the performance of  CARL-G sim uses the same amount of memory but is slightly slower.Note that not all of the baselines use the same encoder size-see Table 3 for encoder sizes.our method by injecting additional information into the clustering step.We can do this by modifying the distance function of our clustering algorithm to the following: where  is the all-pairs shortest path (APSP) length matrix of G.This allows us to inject node neighborhood information into the clustering algorithm on top of the aggregation performed by the GNN.However, we find there is no significant change in performance for low  and performance decreases for high .This helps confirm the hypothesis that the GNN encoder is able to successfully embed a sufficient amount of structural data in the embedding.

Implementation Details
For fair evaluation with other baselines, we elect to use a standard GCN [33] encoder.Our focus is on the overall framework rather than the architecture of the encoder.All of our baselines also use GCN layers.Following [34,60], we use two-layer GCNs for all datasets and use a two-layer MLP for the predictor network.We implement our model with PyTorch [42] and PyTorch Geometric [19].A copy of our code is publically available at https: //github.com/willshiao/carl-g.We adapt the code from [55], which contains implementations of BGRL [60], GRACE [79], and GBT [4] to use the split and downstream tasks from [34].We also use the official implementation of AFGRL [34].We perform 50 runs of Bayesian hyperparameter optimization on each dataset and task for each of our 3 methods.The hyperparameters for our results are available at that link.All of our timing experiments were conducted on Google Cloud Platform using 16 GB NVIDIA V100 GPUs.

Limitations & Future Work
While our proposed framework has been shown to be highly effective in terms of both training speed and performance across the 3 tasks, there are also some limitations to our approach.One such limitation is that we use hard clustering assignments, i.e., each node is assigned to exactly one cluster.This can pose issues for multilabel datasets like the Protein-Protein Interaction (PPI) [81] graph dataset.One possible solution to this problem is to perform soft clustering and use a weighted average of CVIs for second/tertiary cluster assignments, but this would require major modifications to our method, and we reserve an exploration of this for future work.

ADDITIONAL RELATED WORK
Deep Clustering.A related, but distinct, area of work is deep clustering, uses a neural network to directly aid in the clustering of data points [1].However, the fundamental goal of deep clustering differs from graph representation learning in that the goal is to produce a clustering of the graph nodes rather than just representations of them.An example of this is DEC [63], which uses a deep autoencoder with KL divergence to learn cluster centers, which are then used to cluster points with -means.
Clustering for Representation Learning.There exists work that uses clustering to learn embeddings [8,65,78].Notably, DeepCluster [8] trains a CNN with standard cross-entropy loss on pseudolabels produced by -means.Similarly, [65] simultaneously performs clustering and dimensionality reduction with a deep neural network.The key difference between those models and our proposed framework is that we use graph data and CVI-based losses instead of traditional supervised losses.
Clustering for Efficient GNNs.There also exists work that uses clustering to speed up GNN training and inference.Cluster-GCN [11] samples node blocks produced by graph clustering algorithms and speeds up GCN layers by limiting convolutions within each block for training and inference.However, it is worth noting that it computes a fixed clustering, rather than updating the clustering jointly with our model (unlike CARL-G).FastGCN [10] does not explicitly cluster nodes but uses Monte Carlo importance sampling to similarly reduce neighborhood size and improve the speed of GCNs.
Efficient -means.Over the years, many variants and improvements to -means have been proposed.The original method proposed to solve the -means assignment problem was Lloyd's algorithm [36].Since then, several more efficient algorithms have been developed.Bottou and Bengio [5] propose using stochastic gradient descent for finding a solution.Sculley [52] further builds on this work by proposing a -means variant that uses mini-batching to dramatically speed up training.Finally, approximate nearest-neighbor search libraries like FAISS [29] allow for efficient querying of nearest neighbors, further speeding up training.

CONCLUSION
In this work, we are the first to introduce Cluster Validation Indexes in the context of graph representation learning.We propose a novel CVI-based framework and investigated trade-offs between different CVI variants.We find that the loss function based on the simplified silhouette achieves the best overall performance to runtime ratio.It outperforms all baselines across 4/5 datasets in node classification and node clustering tasks, training up to 79× faster than the best-performing baseline.It also performs on-par with the best performing node similarity search baseline while training 1,500× faster.Moreover, to more comprehensively understand the effectiveness of CARL-G, we establish a theoretical connection between the silhouette and the well-established margin loss.

A APPENDIX A.1 Full Proof of Equivalency to Margin Loss
Proof.For ease of analysis, we work with the simplified silhouette loss (Definition 3.1) and the non-max margin loss (Definition 3.2).Let L be the set of class labels, and L  be the class label for node .Let C  be the cluster assignment for node , and  = | | be the number of clusters/classes.We define the expected inter-class and intra-class distances as follows: where ,  ∈ R + .Next, let i.e., G follows a stochastic block model with a probability matrix  ∈ [0, 1]  × of the form: Note that  does not necessarily equal 1 − .We define the interclass clustering error rate  and intra-class clustering error rate  as follows: To find E [  ()], we first find E [()] and E [  ()]: and = Now, we can find E [  ()]: = Taking its limit as ,  → 0, we find lim We similarly break down the margin loss into two terms: •  → 1: We approach the case where an edge exists between each node of the same class.•  → 0: We approach the case where an edge never exists between nodes of different classes.
•  → 0: We approach the case where we always place two nodes in the same cluster if they are the same class.•  → 0: We approach the case where we never place two nodes in the same cluster if they are in different classes.Essentially, the ideal case for a margin-loss GNN is  → 1 and  → 0. Conversely, the ideal case for CARL-G is  → 0,  → 0. As we mentioned in Section 3.3, silhouette-based loss relies on the clustering error rate rather than the inherent properties of the graph.We show that a margin-loss GNN is exactly equivalent to a mean-silhouette-loss GNN under the above conditions; however, it also follows that some equivalence can also be drawn between them for different non-ideal values of , , , and , but we feel such analysis is out of the scope of this work.

A.3 Additional Experiment Details
We ran our experiments on a combination of local and cloud resources.All non-timing experiments were run on an NVIDIA RTX A4000 or V100 GPU, both with 16 GB of VRAM.All timing experiments were conducted on a Google Cloud Platform (GCP) instance with 12 CPU Intel Skylake cores, 64 GB of RAM, and a 16 GB V100 GPU.Accuracy means and standard deviations are computed by re-training the classifier on 5 different splits.The code and exact hyperparameters for this paper can be found online at https://github.com/willshiao/carl-g.

Figure 1 :
Figure 1: Comparison of our proposed methods with other baselines with respect to node classification accuracy and speedup on the Amazon-Photos dataset.See Figure 3 for results on the other datasets.

Figure 2 :
Figure 2: CARL-G architecture diagram.We describe the method in detail in Section 3.

Figure 3 :
Figure 3: Runtime v.s.accuracy plots.CARL-G sim , CARL-G Sil , and CARL-G VRC are our proposed methods.Speedup is relative to the slowest baseline (AFGRL).AFGRL and GRACE run out of memory on Coauthor-Physics.
Mean total training time.The y-axis is on a log.scale.

Figure 4 :
Figure4: Mean total training time (left) and max GPU usage (right) for each model.CARL-G VRC is the fastest with generally the least amount of memory used.CARL-G sim uses the same amount of memory but is slightly slower.Note that not all of the baselines use the same encoder size-see Table3for encoder sizes.

Figure 5 :
Figure 5: Node classification accuracy of CARL-G sim on Amazon-Photos and Coauthor-Physics with a different number of clusters.

Figure 6 :
Figure 6: Training time versus number of clusters for CARL-G sim on Coauthor-Physics.As expected (see Section 3.1), the training time is linear with respect to the number of clusters.

Table 1 :
[24]E[79]MVGRL[24]Comparison of different self-supervised graph learning methods.*: We use CARL-G sim as the representative method since it is the best-performing across all of the criteria.

Table 2 :
Node clustering performance in terms of cluster NMI and homogeneity.CARL-G sim outperforms the baselines on 4/5 datasets.

Table 5 :
Table of node classification accuracy.Bolded entries indicate the highest accuracy for that dataset.Underlined entries indicate the second-highest accuracy.OOM indicates out-of-memory.

Table 6 :
Performance on similarity search.Surprisingly, CARL-G performs fairly well on this task, despite not being explicitly optimized for this task (unlike AFGRL, which uses KNN during training).

Table 8 :
Dataset Statistics Statistics for the datasets used in our work.