B2-Sampling: Fusing Balanced and Biased Sampling for Graph Contrastive Learning

Graph contrastive learning (GCL), aiming for an embedding space where semantically similar nodes are closer, has been widely applied in graph-structured data. Researchers have proposed many approaches to define positive and negative pairs (i.e., semantically similar and dissimilar pairs) on the graph, serving as labels to learn their embedding distances. Despite the effectiveness, those approaches usually suffer from two typical learning challenges. First, the number of candidate negative pairs is enormous. Thus, it is non-trivial to select representative ones to train the model in a more effective way. Second, the heuristics (e.g., graph views or meta-path patterns) to define positive and negative pairs are sometimes less reliable, causing considerable noise for both "labelled'' positive and negative pairs. In this work, we propose a novel sampling approach B2-Sampling to address the above challenges in a unified way. On the one hand, we use balanced sampling to select the most representative negative pairs regarding both the topological and embedding diversities. On the other hand, we use biased sampling to learn and correct the labels of the most error-prone negative pairs during the training. The balanced and biased samplings can be applied iteratively for discriminating and correcting training pairs, boosting the performance of GCL models. B2-Sampling is designed as a framework to support many known GCL models. Our extensive experiments on node classification, node clustering, and graph classification tasks show that B2-Sampling significantly improves the performance of GCL models with acceptable runtime overhead. Our website[11] https://sites.google.com/view/b2-sampling/home provides access to our codes and additional experiment results.


INTRODUCTION
Graph Contrastive Learning (GCL) has recently emerged as an important branch of self-supervised graph representation learning [27,33].GCL methods project nodes in graphs into an embedding space where semantically similar (positive) nodes are closer while semantically different (negative) nodes are farther.The learned embedding can widely facilitate many graph-based applications such as node classification [7,21], graph classification [18], collaborative filtering [28], and community detection [23,32].
Technically, typical GCL methods consist of contrasting heuristics and contrastive objectives design [33].The contrasting heuristics define the positive and negative pairs to guide the contrastive training [7,15,35], and the contrastive objectives "pull" these positive pairs closer and "push" negative pairs farther in the embedding space.For example, graph augmentation is a representative way to generate contrasting pairs in homogeneous graphs.Based on the original graph , its augmented graph  ′ can be generated by edge removing [7] or feature masking [35], etc.Given an anchor node  ∈ , different heuristics define 's positive set D + and negative set D − (D + , D − ⊂  ∪  ′ ).On the other hand, some efforts focus on adjusting contrastive objectives such as Information Noise Contrastive Estimation (InfoNCE) [15,29,35], Jensen-Shannon Divergence (JSD) [7,21], and Triplet Margin loss (TM) [24] to GCL.
While considerable contributions have been made to the above technical components in GCL, less attention has been paid to GCL Sampling, i.e., how to effectively sample positive and negative graph node pairs to further boost the performance.Generally, GCL sampling has two problems, i.e., training pair representativeness and training pair noise.Training Pair Representativeness.Training pair representativeness indicates that among the sizeable negative pairs, some pairs are more useful than others in training the model in different training iterations [17,24].Thus, it is non-trivial to sample the most representative and useful pairs from the enormous negatives to  update the model.Despite that some negative sampling techniques have been well studied in contrastive learning (CL) for visual representations [4,8,16], we found that borrowing those solutions has limited effectiveness to boost GCL methods (see experimental results in Section 4.2).Taking graph node pairs as the training samples, GCL sampling has its own challenge in defining the diversity and representativeness in the negative sampling, regarding both the topological properties and the used contrasting heuristics.Training Pair Noise.Training pair noise indicates that heuristics of contrasting definitions for positive/negative pairs are sometimes unreliable.Some heuristics choose positive pairs via the connection strength between the neighbouring nodes [22], and some choose them by only considering the congruent nodes in augmented graphs [26,32,35].However, our empirical studies show that, (1) many neighbouring nodes can have very different semantics (e.g., two connected Facebook users can just be accidental friends, sharing less common interests) and (2) many ℎ-hop (ℎ can be large) nodes can share similar semantics (e.g., a ℎ-hop friend of a user is still in the same community or shares similar interests).Figure 1 shows a qualitative example where distant graph nodes can be semantically similar while the neighbouring nodes can be semantically different.More importantly, such cases are more prevalent than expected.Figure 2 shows a quantitative study where a considerable number of distant nodes share the same label (i.e., semantically similar) in graph-structured data.Given the complications of real-world graph semantics, any (topological) heuristics defining positive/negative pairs can only be generally correct but still suffer from considerable noise during the training.
In this work, we propose B 2 -Sampling (Balanced and Biased Sampling) technique to address the above challenges in a unified way.We address the first challenge by sampling for discrimination.Specifically, we propose a balanced sampling technique regarding graph structures.We define topological diversity and runtime embedding diversity over the training negative pairs to choose representative pairs for model training.We address the second challenge by sampling for correction.Specifically, we propose a biased sampling technique regarding our observed slow learning effect of the noisy graph node pairs.The effect indicates that, during the training, embeddings of noisy pairs are usually harder for the model to fit than that of clean pairs, with the assumption that most pairs are clean.Based on such an observation, we design a  noise-likelihood measurement based on how "smooth" the model can fit the embeddings of node pairs, and apply biased training on them to correct the potential noise.The balanced and biased sampling can be applied sequentially and interactively to further boost the performance of existing GCL models.We apply B 2 -Sampling on node-wise GCL methods such as GCA [35], GRACE [34] and HeCo [22] on eight datasets.Compared to the state-of-the-art (SOTA) CL/GCL negative sampling techniques (e.g., MoCHi [8], and ProGCL [26]), B 2 -Sampling can significantly boost the performance of those GCL methods.Meanwhile, our ablation studies further confirm the effectiveness of both sampling strategies.Given that B 2 -Sampling can universally be equipped to many GCL methods (e.g., node-wise and graph-wise), we have designed B 2 -Sampling as a framework to be integrated with existing GCL models.
In summary, the main contributions of our work are as follows: • Technique: We propose B 2 -Sampling, a novel GCL-oriented sampling technique to boost the performance from the perspective of contrasting pairs diversity (by balanced sampling) and noise (by biased sampling).• Tool: We build B 2 -Sampling framework based on our technique to integrate with any GCL models, facilitating the practical use.• Experiment: We conducted extensive experiments on GCL methods on various baseline techniques on eight datasets, evaluating the effectiveness of our B 2 -Sampling in node-level and graphlevel downstream tasks.Intuitively, we correct  to  ′ for a distribution closer to ground truth  * and then learn P against  ′ with an improved sampling strategy to achieve better performance.Thus we maximize:

PROBLEM DEFINITION
where I (•; •) represents the mutual information (MI) between two distributions, is the representative negatives set, and In this work, we design biased sampling to re-adjust the distribution of the sampling space to make  ′ closer to  * , so as to maximize I ( * ;  ′ ). 2 We sample representative contrasting pairs through balanced sampling to train the model and learn a P closer to  ′ , so as to maximize I ( ′ ; P).Balanced and biased sampling interacts with each other to achieve our research goal.

METHOD
Overview.Figure 3 shows how our B 2 -Sampling can serve as a plugin into the general GCL paradigm.Given a graph , different heuristics are designed to derive positive and negative node pairs.Moreover, a contrastive objective is used to measure the distance between the embedding of a pair of nodes, and compares it against the estimated "label" (i.e., positive or negative) in the sampling space.B 2 -Sampling fits in-between the contrasting heuristics and contrastive object modules, and (1) selectively sample representative (negative) pairs in the space to train the model and (2) adaptively re-adjust the sampling space.
In practice, the two-phase sampling strategy can be applied repetitively.Given the distribution of existing sampling space   and its resulted embedding Z  ( indicates the training iteration), we apply balanced sampling technique on   regarding Z  to select representative negatives subset D − ′ in the first phase.During the training, we collect the runtime learning dynamics for the training pairs to guide the correction of the sampling space, reassigning sets of positive pairs D + and negative pairs D − .As a result, we will have  +1 and Z +1 .By this means, we evolve the sampling space iteratively and boost the learning performance finally.

Balanced Sampling
Similar to traditional contrastive learning, GCL has the challenge of selecting representative negative pairs.We formalize the problem of balanced sampling as follows.Given a limited budget  ( > 0) and the negative pair set D − , we aim to sample a subset D − ′ , i.e., Here, K represents the information diversity over graph node pairs regarding topological diversity and embedding diversity.Appendix A.1 gives its theoretical explanation.The topological diversity measures how representative a pair of nodes resemble in the graph (input space), and the embedding diversity measures how diverse the embeddings of sampled node pairs are distributed in the embedding space (output space).To achieve the goal defined in Equation 2, we define and measure topological or embedding diversity of the training pairs, as the estimation for K (•).
We adopt the evenness measure [1] to sample negative pairs.Intuitively, we require that the sampled negative pairs are uniformly distributed over diversified topological and embedding distances.
Topological Diversity.We use the shortest path distance between a pair of nodes as an indicator of topological diversity (Appendix A.2 gives explanations).The shortest path distance preserves a balance between computational cost and structural informativeness, compared to other structures such as loops and triangles.
Figure Figure 4(a) shows the empirical distribution over the shortest path distance of node pairs on the dataset of Cora 3 .The empirical frequency of different shortest path distances are generally imbalanced, which can hardly be mitigated by the random sampling strategy (the distribution in blue in Figure 4(b)).In contrast, our shortest path distance-weighted sampling results in a "flatten" distribution (in orange) as shown in Figure 4(b), leading us to sample a balanced number of negative pairs with more diversified shortest path distances.
We achieve balanced sampling with topological diversity as follows: Given an anchor node   , its negative nodes set D −  , and a negative node   ∈ D −  , assume that the shortest path distance   (•) between   and   equals to , i.e.,   (  ,   ) = , then the shortest path distance-weighted sampling probability   to select 3 We adopt sampling without replacement to strike a good balance between overwhelmingly large categories and duplicated sampling.
is estimated by: Embedding Diversity.To select the negative samples uniformly distributed over the embedding space, we estimate the distribution of high-dimensional vectors during the training and sample the pairs regarding their high-dimensional embedding diversity.
The probability distribution  norm (•) of an Euclidean distance   between any two normalized -dimensional embeddings is [6]: where the coefficient 2 in an -dimensional space when  ≥ 128 [24,31].
It means that, given an anchor node   , random sampling its negative nodes is more likely to have the nodes around √ 2-away (in embedding space).Thus, for the anchor node   , we define the embedding distance-weighted sampling probability   for selecting its negative node   as: where  is a parameter to provide the least probability to sample any negative nodes [24], z  =   (  ) and z  =   (  ) are the normalized embeddings of node   and   ,   (z  , z  ) is their embedding distance.By this means, we are able to sample negative pairs uniformly regarding the embedding distance during the whole training process.Finally, we sample a negative node   by: where parameter  is a coefficient to balance the effect of two distance-weighted sampling strategies.Moreover, Figure 4(c) shows the shifted sampling distribution (in orange) from the original sampling distribution.Thus, given an anchor node   , we obtain its representative negative node set D − ′  through balanced sampling.The InfoNCEbased balanced sampling (B1_S for short) loss 4 is defined as:

Biased Sampling
Biased sampling is designed to mitigate the noisy negative pairs introduced by heuristics   (•), i.e., maximizing I ( * ;  ′ ) (see Equation 1).The challenge lies in that the ground-truth labels (i.e., true positive and negative pairs) are not available in practice.Different from some researchers de-noising by fitting the overall distribution with mixed distributions [26], we investigate the discrepancies between the noisy pairs (i.e., false negative pairs) and clean pairs (i.e., true negative pairs) from a dynamic perspective.
Our rationale is that the model fits the noisy and clean pairs in a different manner during the training process: Noisy pairs are usually in-distribution pairs with incorrect labels, which means that they produce similar signals as the majority of the normal training pairs but diverge with different labels.Therefore, they can cause a contrary learning effect.Specifically, assume that we have a clean pairs set where z  ini • and z  end • are the learned embeddings of node  • at epoch  ini and  end , respectively.Figure 5 shows our empirical investigation on the model learning efficiency of the noisy and clean samples on two datasets (i.e., Cora and ACM), we draw learning speeds of clean and noisy pairs under different shortest path distance.We can see that the model is "slower" to learn on the noisy pairs, compared to the clean pairs.We call such a phenomenon as slow learning effect of noisy data, introduced by any contrasting heuristics.We include its tests of statistical significance in Appendix A. 3.
Based on such an empirical effect, we track the learning speed of the training pairs and correct their "labels" when some pairs manifest the slow learning effect.Specifically, we introduce a hyperparameter  to sample the pairs of the slowest learning effect.Thus, given an anchor node   , its new positive node set As hard negative samples have been proved helpful for learning [16], we can have a conservative 5  to avoid selecting them.The biased sampling (B2_S for short) loss is defined as: log e  (z,z )/ e  (z,z )/ + 5  can be simply set as 0 or values of first five percent of learning speed (sorted in the ascending order) at the first three shortest path distance length.where  is the full node set.After biased sampling, the positive and negative pairs are reassigned, and an adapted sampling distribution  ′ is adjusted.

EXPERIMENTS 4.1 Experimental Setup
We evaluate B 2 -Sampling with the following research questions: • RQ1 (Overall Experiment): How effective is our B 2 -Sampling compared to the popular negative sampling techniques in the known CL and GCL? • RQ2 (Ablation Study): How balanced sampling and biased sampling can contribute to the overall performance?• RQ3 (Sensitivity Analysis): How the runtime configuration of B 2 -Sampling affects the overall performance?• RQ4 (Applicability Study): Whether B 2 -Sampling can also boost the performance of graph-level GCL methods?
4.1.1Base Methods: 3 Representative GCL methods.GRACE [34], GCA [35] and HeCo [22] are three representative GCL methods in representing nodes of homogeneous/heterogeneous graphs.Table 2 shows their key components in GCL paradigm designing, and detailed descriptions are in Appendix B.1.We equip these base methods with B 2 -Sampling and other six popular CL/GCL negative mining techniques to see how can they boost their performances.

RQ1: Effectiveness
We evaluate GCL base models + CL/GCL sampling strategies on node classification and node clustering tasks.The best results are shown in bold, and the second-best results are underlined."↑" and "↓" refer to performance improvement and drop compared with base models respectively.Overall, our B 2 -Sampling performs the best and consistently improves the performances of three base models on different node-level tasks on all datasets.While other baselines are unable to provide continuous improvements over the base models and even worsen them.
Node Classification.As shown in Table 3 and Table 4, we see that B 2 -Sampling always performs better than all baselines on all datasets.Generally, CL negative mining strategies (i.e., DCL, HCL, MoCHi, Ring) bring limited improvements or degrade the performances of GCL-based models on most datasets.Since they are designed for mining negatives in vision, failing to leverage the topological structures.RS performs better even than some welldesigned negative mining strategies, perhaps because it selects negatives following the exact distribution of the whole negatives.ProGCL improves the performances of base models on most datasets but sometimes with a margin.Our investigation on the debiased sampling approach of ProCL shows that it usually introduces the risks of misrecognizing false negatives (∼40%).In contrast, the biased sampling strategy based on the slow learning effect in our B 2 -Sampling carries a misrecognizing risk of 0%-20%.Our B 2 -Sampling enhances the ACC of GCA by 0.3% to 1.8%, and ACC on Cora even outperforms some supervised node representation learning methods.It enhances HeCo by 0.3% to 2.8% in Micro-F1 scores.Moreover, for a test of statistical significance, we conduct twosample t-tests on SOTA baselines (underlined) and our B 2 -Sampling.Alternative hypothesis is H 1 :metric(SOTA)<metric(B 2 -Sampling).The p-value in the last line shows that all the p-values are smaller than 0.05, indicating that B 2 -Sampling outperforms SOTA baselines with statistical significance.Furthermore, we verified that our B 2 -Sampling also achieves strong robustness performance as the noise ratio of negative pairs increases.Please see appendix B.4 for details.
Node Clustering.Table 5 shows the results of HeCo enhanced by baselines and B 2 -Sampling.We can see that most baselines improve HeCo on ACM but worsen it on DBLP and Freebase, indicating they fail to handle the noisy-label problem (especially in DBLP and Freebase).ProGCL performs well but is inapplicable to Freebase since it cannot distinguish the positive and negative distributions according to the similarity of embeddings (i.e., '-' in Table 5 without reasonable results).Our B 2 -Sampling recognizes the positive and negative pairs by their learning speeds but not the similarity of embeddings, achieving significant improvements over HeCo on all datasets.Specifically, B 2 -Sampling improves NMI of HeCo by 4.1%  to 6.8%, and ARI by 3.7% to 9.6%, demonstrating its effectiveness in detecting the community structure of graphs.

RQ2: Ablation Study
B 2 -Sampling consists of balanced sampling (B1_S) and biased sampling (B2_S).We respectively disable them to evaluate their contributions to the overall performance.We conduct the ablation study based on HeCo with Micro-F1 measurement to evaluate the performance on the node classification task, and NMI and ARI measurement on the node clustering task.
As shown in Table 6, B1_S performs better on DBLP while B2_S performs better on ACM.The difference largely lies in different datasets suffering from imbalanced and noisy-label problems to different degrees.The positive pairs in the DBLP are much more abundant than those in the ACM.Thus, the boosting performance of B2_S in the DBLP is less significant than that in the ACM.Overall, both of the sampling components are helpful for training.
To further compare the effects of B 2 -Sampling, B1_S, and B2_S, we visualize the embedding distance distribution of positive and negative pairs in Figure 6.We compare embeddings learned by HeCo, HeCo+B1_S, HeCo+B2_S, and HeCo+B 2 -Sampling in a pairwise way.From a visual point of view, the less the overlapping area between two distributions, the better one performs than the other.Overall, our B 2 -Sampling achieves less overlap between positive and negative pairs distributions (see Figure 6(a)), demonstrating its strong ability to discriminate positive and negative pairs. Figure 6(b) and 6(c) show that both B1_S and B2_S can draw positive pairs closer and pushes negative pairs farther on the embedding distance compared with the original HeCo.In addition, B1_S and B2_S show respective advantages in learning positive pairs and negative pairs (see Figure 6(f)).B 2 -Sampling learns the best embeddings based on the advantages from B1_S and B2_S.

RQ3: Sensitivity Analysis
We perform sensitivity analysis on three critical hyper-parameters in B 2 -Sampling: the sampling ratio  for selecting representative negative samples, the coefficient  to balance the shortest path distance weight and embedding distance weight in balanced sampling, and the threshold  in biased sampling to correct labels of noisy samples.We report the Micro-F1 and NMI values on ACM  and Freebase by varying ,  in Figure 7(a) and Figure 7(b), and  in Figure 7(c) and have the following conclusion: (1) Balanced sampling strategy is robust against a variety of  and .We observe that the performance drops slightly with  increasing, showing more negative samples does not mean better performance.(2) A smaller  is more practical choice for B 2 -Sampling.We can see that, when  < 0, the values of Micro-F1 and NMI are stable with the increase of ; In contrast, when  > 0, Micro-F1 and NMI decrease with the increase of .Overall, when the learning speed is not "that slow", the sampled pairs are likely to have the correct label.

RQ4: Applicability Study
We probe into the applicability of B 2 -Sampling on graph-level learning tasks.Since our balanced sampling leverages the local topological property (shortest path distance) among node pairs, which is inadaptability among graphs, we take biased sampling as a light version of B 2 -Sampling and apply it to graph-level GCL models.GraphCL [29] and MVGRL [7] are two well-known GCL models based on graph-level contrastive losses, we adopt seven datasets from their works and conduct graph classification tasks.As shown in Table 7, our biased sampling consistently enhances the performance of base models, showing its effectiveness and applicability.

RELATED WORKS 5.1 Graph Contrastive Learning
Graph contrastive learning is an increasingly popular self-supervised learning approach [7,19,21,26,29,34,35].They usually apply GCL methods in L-L contrasting mode define positive and negative pairs on node level, i.e., the positive and negative samples are node pairs.For example, given an anchor node, GCC [15] designs its positives and negatives in other networks to learn transferable structural node representations; GRACE [34] treats its congruent node from another augmented graph as the positive one and all left nodes as negatives; GCA [35] adopts the same designation as GRACE but further equips GRACE with adaptive data augmentation, learning important patterns underneath the input graph.Our B 2 -Sampling can be easily applied to GCL methods in L-L contrasting mode and make a further improvement.
GCL methods in G-G/G-L contrasting modes define positive and negative pairs on graph-level, in (ℎ, ) or (ℎ, ℎ)shaped.For example, given a graph , DGI [21] and MVGRL [7] apply graph augmentation to it to get another mutant graph  ′ , and then take nodes in  as positives while in  ′ as negatives; GraphCL [29] takes multiple generated augmented graphs based on  as positives, and other graphs in the same minibatch are negatives.InfoGraph [18] encodes multiple graphs and maximizes the MI of the graph-node, graph-edges, and graph-context pairs to obtain representations of substructures of different scales.They output graph embeddings for graph-level tasks.Since our balanced sampling in B 2 -Sampling considers the local topological property (shortest path distance) among node pairs, which is inadaptability among graphs, we can take biased sampling as a light version of B 2 -Sampling and apply it to GCL methods in these two modes.

CL/GCL Sampling
There are two kinds of noteworthiness negative samples, false negative samples and hard negative samples, which guide a CL/GCL method to correct its mistakes more quickly [4,14].False negative samples are samples with the same labels but are treated as dissimilar pairs because of the contrasting heuristics in CL/GCL methods.Hard negative samples are those pairs that are mapped nearby in embedding space but should be far apart.Some CL sampling strategies develop debasing terms to avoid contrasting false negative pairs.For example, DCL [4] decomposes the data distribution into positive and negative distributions and develops a debiased contrastive objective to relieve the sampling bias.Meanwhile, plenty of CL sampling strategies are interested in hard negative samples: they adopt different hard negative mixing strategies [8,10] or build a tunable sampling distribution that prefers hard samples [16] to generate diverse and informative negative samples.In spite of their promising performance in the field of computer vision, they bring limited improvement or even performance drop when applied to graphs [26,33].
Recently, some researchers pay more attention to hard negative samples in graph-structured data.ProGCL [26] distinguishes true and false negatives by fitting a beta mixture model on the similarities of embeddings and proposes two strategies based on this: ProGCL-weight re-weights positive and negative terms in the denominator of loss; ProGCL-mix synthesizes more hard negatives.However, ProGCL-weight may allocate more weights to negative samples which are easy to train.Moreover, our empirical experiment shows that once two distributions are mixed, especially if one distribution is the minority, there is limited information to decompose the overall distribution.M-Mix [30] follows i-Mix [10] and dynamically assigns different mixing weights when generating hard negatives.One of its modules, M-Mix-up, utilizes an adjacency matrix for denoising in graphs.Different from them, our B 2 -Sampling samples informative negatives by measuring topology and embedding diversities and further corrects the labels of false negatives by our slow learning effect observation.

CONCLUSION
In this work, we propose B 2 -Sampling, a two-phase sampling strategy applicable to a class of GCL methods for further boosting performance.Balanced sampling in phase one selects representative negative pairs with diversified shortest path distances and embedding distances to consistently provide information for training.Biased sampling in phase two corrects the potential false negative pairs regarding their slow learning effect to denoising.Through extensive experiments on different node-level and graph-level downstream tasks, our B 2 -Sampling performs the best compared to various baselines.Our evaluation shows that B 2 -Sampling is easily compatible with node-wise GCL methods in local-local contrasting mode, and its light version (biased sampling) is also applicable to GCL methods in G-G, G-L contrasting modes, showing its superiority.

A METHOD A.1 Explanation of Balanced Sampling
Assume that the samples in D − follows a Gaussian distribution N (,  2 ), and after balanced sampling, the selected samples in D − ′ follows N ( ′ ,  ′2 ).We use the information entropy H to represent the diversity K, and the entropy H of N (,  2 ) is: As we claims in Section 3.1, after balanced sampling,  ′ > , and the entropy of selected samples in D − ′ is larger than samples in D − .Therefore, we obtain more diverse samples after balanced sampling.

A.2 Topological Diversity
During balanced sampling, we need to measure the topological distance between nodes within the negative pairs to explore the topological diversity.We find that the shortest path distance is strongly associated with topological diversity.As shown in Figure 8, the x-axis indicates the indices of shortest path distance, and the y-axis indicates the ratio of negative pairs.We can see that the negative pairs lie in all shortest path distances.Following the shortest path distance, we can sample various negative pairs with different distances, capturing the topological diversity of the graph and measuring topological diversity from local to global.GRACE [34], GCA [35] and HeCo [22] are three representative methods for representing nodes of homogeneous information networks and heterogeneous information networks with GCL paradigms, respectively.They adopt variants of InfoNCE loss and design nodenode level positive and negative pairs: • GRACE focuses on contrasting embeddings at the node level.It generates two graph views by corruption and learn node representations by maximizing the agreement of node representations in these two views.For a given node in homogeneous graphs, the congruent node in the augmented graph is defined as its positive sample, and the remaining nodes in two views (the original graph and augmented graph) are negative samples.• GCA enhances GRACE by adopting adaptive edge removing (ER) and adaptive feature masking (FM) for graph augmentation.Adaptive graph augmentations help GCA learn representations that are insensitive to perturbation on unimportant nodes and edges.Its definition for the contrasting pairs is the same as GRACE's.• HeCo selects meta-path view and network schema view according to structure characteristics of heterogeneous information networks for the graph augmentation.For a given node, its positive and negative samples are determined by the number of metapaths connecting them.If two nodes are connected by many meta-paths, they are positive samples (i.e., a node can have multiple positive samples), and all left nodes are negative samples.Nodes embedding in a pair are from different views, realizing cross-view self-supervision.

B.2 Datasets
The eight datasets used in this paper are from four kinds of networks:    • Movie Knowledge Base: Freebase-Movie13 is a heterogeneous information network where nodes represent movies labeled by genres, and edges represent the relations among actors, directors, and producers.Their statistic details are shown in Table 10.All model parameters are initialized with Glorot initialization [5], and trained using Adam SGD optimizer [9] on all datasets.We list four crucial hyper-parameters: the negative sampling ratio , balance coefficient  in the first phase, and threshold  and the used shortest path distance  in the second phase, in Table 9.

B.4 Robustness against Noise Negative Pairs
To explore the robustness against noisy negative pairs, we further conduct a robustness experiment on the node classification task (based on the experiments in Table 3).Specifically, we test our B 2 -Sampling on Cora and Amazon-Photo (Photo) when the noise ratio of negative pairs increases from 5% to 20%.As shown in Figure 9, we can observe that (1) our B 2 -Sampling can still achieve significant improvements over the base model GCA across different noise ratios; and (2) B 2 -Sampling also consistently outperforms the SOTA baselines (i.e., underlined methods in Table 3).

Figure 1 :
Figure 1: (1) P1 and P2 are (neighbouring) papers written by the same author Dr. Jiawei Han, but in different domains.(2) P1 and P3 are distant papers (i.e., their shortest path distance is 6), but in the same domain.

Figure 2 :
Figure 2: The numbers of node pairs with the same label (positive pairs) and different labels (negative pairs) at each shortest path distance in Cora and ACM.Distant nodes (with long shortest path distance) can have similar semantics while close nodes (with short shortest path distance) can also have different semantics.

Figure 3 : B 2 -
Figure 3: B 2 -Sampling serves as a plugin in the overall graph contrastive learning paradigm, which adaptively re-adjusts and corrects the node pairs, regarding the graph structure and runtime node embedding, from the sampling space.

Frequency
(a) RS v.s.SS (b) RS v.s.ES Shortest Distance Freq Embedding Distance De (a) RS v.s.SS (b) RS v.s.ES RS v.s.SS (c) RS v.s.ES (a) Cora

Figure 4 :
Figure 4: (a) shows the empirical distribution of shortest path distances on Cora.(b) and (c) show effects of shortest path distance-weighted sampling (SS) and embedding distanceweighted sampling (ES), compared to random sampling (RS).
and a noisy pairs set D  = {(  ,   )|  (  ,   ) ≁  * }.Given that D  and D  can provide conflicting signals to the model, it is harder for the model to fit D  , compared to fitting D  , when the training set is D  ∪ D  and |D  | ≫ |D  |.Metaphorically, noisy pairs make the model more "struggled" to learn their embeddings.Slow Learning Effect.Given a training pair (  ,   ), the training process starts from the epoch  ini and ends at epoch  end , we measure its learning speed by:

Figure 5 :
Figure 5: Slow Learning Effect: Box plots of the average learning speed of noisy and clean pairs (at three kinds of shortest path distances) in Cora and ACM datasets.

Figure 6 :
Figure 6: Densities of embedding distance between nodes with the same label (positive pairs at the top) and different labels (negative pairs at the bottom) for embeddings trained on ACM with four different training strategies.
This work was supported by the National Key Research and Development Program of China (2022YFC3303600), National Natural Science Foundation of China (62137002, 62293553), Innovative Research Group of the National Natural Science Foundation of China (61721002), Innovation Research Team of Ministry of Education (IRT_17R86), Natural Science Basic Research Program of Shaanxi (2023-JC-YB-293), the Youth Innovation Team of Shaanxi Universities, Minister of Education, Singapore (MOET32020-0004), the National Research Foundation, Singapore, and Cyber Security Agency of Singapore under its National Cybersecurity Research and Development Programme (Award No. NRF-NCRI_TAU_2021-0002) and the Cyber Security Agency under its National Cybersecurity R&D Programme (NCRP25-P04-TAICeN).

Figure 9 :
Figure 9: Robustness against Noise Negative Pairs.• Academic Coauthor Networks: DBLP 7 contains heterogeneous nodes (author, paper, conference, term) and edges, while Coauthor-CS 8 is a homogeneous academic network.Both of their target nodes are authors, linked by co-author relationships.• Academic Reference Networks: Wiki-CS 9 and Cora 10 are reference networks, in which nodes represent articles and edges represent their citations.ACM 11 is a heterogeneous network and the target nodes, papers, are linked by authors and subjects.• E-commerce Networks: Amazon-Computers and Amazon-Photo 12 are homogeneous information networks where nodes represent goods and edges represent co-purchase relation.
Given a graph  = { , } consisting of | | =  nodes and || =  edges, we denote X ∈ R  × = {x 1 , . . ., x  } as the node attributes matrix, where x  ∈ R  is the attribute information of node   .

Table 2 :
General descriptions of the GCL base models: GRACE, GCA, and HeCo.
[12,20] Academic Reference Networks: Wiki-CS, Cora, ACM; (3) E-commerce Networks: Amazon-Computers (Computers), Amazon-Photo (Photo); (4) Movie Knowledge Base: Freebase-Movie.Their statistic details and descriptions are in Appendix B.2.4.1.3Baselines.We compare B 2 -Sampling with six popular CL/GCL negative mining strategies: DCL[4], HCL[16], Ring[25]and MoCHi[8]are hard negative mining strategies for computer vision.They try to debiase (DCL), utilize hard negatives (HCL) or semi-hard negatives (Ring), or synthesize new negative points (MoCHi) to improve the quality of visual representations; ProGCL[26]aims to eliminate the bias in graph-structured data by fitting a beta mixture model (BMM); Random sampling (RS for short) is a simple baseline which uniformly samples negative pairs with a fixed probability.4.1.4Implementationdetails.B 2 -Sampling is implemented withPyTorch.We adopt the same evaluation metrics (shown in Table2) and experimental settings (e.g., multiple runs and random seeds) used in GRACE, GCA and HeCo to perform the node classification and clustering task.Models are trained in an unsupervised manner.The obtained embeddings are fed to a simple Logistic Regression classifier (for a node classification task) or clustered by the K-means algorithm (for a node clustering task).For homogeneous datasets: Wiki-CS, Cora are from public splits[12,20], and Coauthor-CS, Amazon-Photo, Amazon-Computers follow a 1:1

Table 5 :
NMI and ARI (%) of HeCo and HeCo + CL/GCL sampling strategies on node clustering.

Table 6 :
Ablation study for B 2 -Sampling on node classification and node clustering.B1_S and B2_S are short for balanced sampling and biased sampling respectively.

Table 7 :
ACC (%±std) gains by applying biased sampling to GraphCL/MVGRL on different datasets in the graph classification task.
For a tests of statistical significance of Figure5, we adopt a twosample t-test on noisy and clean pairs in different shortest path distances.Our alternative hypothesis is H 1 :   <   , and the null hypothesis is H 0 :   >   .The results are shown in Table8, showing that the learning speed of noisy samples is slower than that of clean samples with statistical significance.

Table 8 :
A two-sample t-test on noisy and clean pairs.

Table 10 :
Statistics of datasets."-" indicates the node features are not provided and needed.