Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis

The success of graph neural network-based models (GNNs) has significantly advanced recommender systems by effectively modeling users and items as a bipartite, undirected graph. However, many original graph-based works often adopt results from baseline papers without verifying their validity for the specific configuration under analysis. Our work addresses this issue by focusing on the replicability of results. We present a code that successfully replicates results from six popular and recent graph recommendation models (NGCF, DGCF, LightGCN, SGL, UltraGCN, and GFCF) on three common benchmark datasets (Gowalla, Yelp 2018, and Amazon Book). Additionally, we compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations. Furthermore, we extend our study to two new datasets (Allrecipes and BookCrossing) that lack established setups in existing literature. As the performance on these datasets differs from the previous benchmarks, we analyze the impact of specific dataset characteristics on recommendation accuracy. By investigating the information flow from users’ neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure. The code to reproduce our experiments is available at: https://github.com/sisinflab/Graph-RSs-Reproducibility.


INTRODUCTION AND RELATED WORK
The world of recommender systems (RSs) is experiencing a revolutionary shift, thanks to the emergence of graph neural network-based models [9,26,58] (GNNs).These groundbreaking models are designed to represent users and items as a bipartite, undirected graph, unlocking a whole new level of high-order relationships that were previously almost unattainable.Not only they do achieve better accuracy than their predecessors, but they are also setting a new standard for modern recommender systems [20,28,47,79].In recent years, great effort has been devoted in creating GNN-based models that address the critical issues of existing models, such as the over-smoothing phenomenon [12] and scalability issues [87].These cutting-edge models are taking the world of recommender systems by storm and ushering in a new era of accuracy [41,47,51,59,81].Over the past ten years, the application of neural techniques rooted in graph representation learning, such as graph convolutional networks [35] (GCNs), has introduced a fresh perspective on traditional collaborative filtering (CF) approaches.Rather than relying solely on user-item interactions for optimization [29,36,55], GCN-based methods enable the extraction of both short-and long-distance user preferences toward items [71].By incorporating multi-hop relationships into the embeddings of users and items, these learned profiles yield more precise recommendations, as evidenced in the literature [28,47].Nevertheless, more researchers obtained different accuracy outcomes in independent experiments and began questioning the graph collaborative filtering (graph CF) prominence [96].
The original GCN layer employs message-passing techniques to refine the node representations of users and items through the iterative aggregation of their respective multi-hop neighbor nodes.While early attempts focused on simple aggregation methods [68,87], recent solutions have advanced the field by exploring the inter-dependencies between nodes and their neighbors [71], designing simplified versions of the graph convolutional layer [14,28] and learning multiple nodes' views [78,89] augmented via self-supervised and contrastive learning to improve model accuracy.
To filter out noisy neighbors and uncover hidden preference patterns, a complementary research field emerged that focuses on learning importance weights through attention mechanisms, such as those employed in the graph attention network [69] (GAT).While some models aim to recognize meaningful user-item interactions at a higher level [67,70], others disentangle relations on a finer-grained scale [73,92].The recent advancements in GCN-based techniques have opened up new avenues for more accurate and effective recommendation systems.
Reproducibility is the cutting-edge research task in which researchers replicate experimental results using the same data and methods [7,15,16,65].In the case of graph CF, several factors contribute to the lack of reproducibility.Firstly, many graph CF studies copy previous results found in the literature for the same datasets, which makes it challenging to compare and reproduce results across different studies.Secondly, such studies do not provide the implementation of the adopted baselines, which makes it difficult to assess the effectiveness of different models.Furthermore, graph CF studies frequently do not provide complete information since they do not always share the experimental setups, such as hyper-parameter settings and training procedures.This lack of transparency makes it challenging to reproduce results and verify the validity of the findings.The lack of reproducibility in graph CF is a significant issue because it undermines the research's credibility and hinders the field's progress.To address this problem, researchers should strive to provide more detailed descriptions of their experimental setups and make their code and datasets publicly available.Additionally, the research community should work together to establish standard evaluation metrics and experimental protocols to promote reproducibility and facilitate comparison across different studies.
To this aim, this work reports on a notable reproducibility effort to re-implement and replicate the results of six stateof-the-art (both well-established and recent) papers on graph collaborative filtering, namely, NGCF [71], DGCF [73], LightGCN [28], SGL [78], UltraGCN [47], and GFCF [59].In particular, we provide an in-depth experimental analysis of the papers, conducting the experiments from scratch on the three datasets adopted in the original papers: Gowalla [37], Yelp 2018 [28], and Amazon Book [27].Notably, the investigation extends the previous works by incorporating state-ofthe-art classical collaborative filtering baselines such as UserkNN [56], ItemkNN [57], RP 3  [49], and EASE  [61] to correctly position the graph CF methods in the recommender systems state-of-the-art.The study's findings reveal that RP 3  ranks as the second-best method with the Yelp 2018 dataset, indicating that the original papers would have needed a more comprehensive evaluation.To this end, the evaluation benchmark incorporates two additional datasets, Allrecipes [21] and BookCrossing [97], which are common in the recommendation literature but uncommon in the graph CF-specific literature.However, surprisingly, the rankings significantly differ on the Allrecipes dataset, and the mathematical formulation of the graph CF methods is not sufficient to account for these outcomes.This observation leads to further investigation to comprehend the experimental results.Examining the dataset topological characteristics shows that the overall number of users and items and the average user and item degree vary from dataset to dataset.This observation may indicate the amount of information transmitted from node to node in the computational graph.According to the mathematical background, the analysis of the results is then threefold, focusing on the impact of (i) the coldness/warmness of a user, (ii) the popularity of the enjoyed items, and (iii) the size of the user neighborhood and the coldness/warmness of the neighbors.The users are partitioned in quartiles accordingly, and the experiments are re-evaluated to obtain more fine-grained results that motivate the outcomes for all the considered datasets.
Overall, the study aims to comprehensively answer several research questions, including: RQ1.Is the state-of-the-art (i.e., the six most important papers) of graph collaborative filtering (graph CF) replicable?RQ2.How does the state-of-art of graph CF position with respect to classic CF state-of-the-art?RQ3.How does the state-of-art of graph CF perform on datasets from different domains and with different topological aspects, not commonly adopted for graph CF recommendation?
RQ4.What information (or lack of it) impacts the performance of the graph CF methods across the various datasets?
The following introduces the background and the experiments to answer the outlined research questions.First, in Section 2, we present the background technologies and the reproducibility details to conduct our study.Then, in Section 3, we report the reproducibility results, whose insights are complemented by adding novel classic CF baselines (i.e., Section 4).Furthermore, an investigation upon graph topology sheds light on the discrepancies of the graph CF approaches on two introduced datasets (i.e., Section 5).By reinterpreting the concept of users' node degree as information flow from the multi-hop neighborhoods to the user, we unveil the behavior of the graph and classic CF.
Finally, Section 6 wraps up the main take-home messages and paves the way to novel directions for future work.Codes and datasets to reproduce our paper are available here: https://github.com/sisinflab/Graph-RSs-Reproducibility.

BACKGROUND AND REPRODUCIBILITY ANALYSIS
The current section is aimed to provide the background about selected state-of-the-art methodologies in graph CF and their reproducibility details as presented in the original papers.First, the main aspects about graph-based models are introduced to conduct a chronological analysis of the strategies behind each algorithm (Section 2.1).Then, we assess the experimental settings as reported in the original works by focusing on the chosen baselines (Section 2.2), the datasets involved (Section 2.3), and the training-testing protocol adopted in each case (Section 2.4).

Graph collaborative filtering
In graph CF, users, items, and their interconnections are viewed as a bipartite and undirected graph.Let U and I be the sets of users and items in the recommendation system, respectively.Then, let R ∈ R | U | × | I | be the user-item interaction matrix where, in an implicit feedback scenario,   = 1 if user  ∈ U interacted with item  ∈ I, 0 otherwise.We build the adjacency matrix A ∈ R ( | U |+| I | ) × ( | U |+| I | ) indicating the bi-directional connections linking users and items in R: We use the set of users and items, along with the adjacency matrix, to formally define the user-item bipartite and undirected graph G = {U ∪I, A}.By associating users' and items' nodes to embeddings, the vast majority of approaches iteratively update their representations at different hop distances through the message-passing schema [9,23].
For this work, we select and reproduce the results for six widely-recognized state-of-the-art approaches in graph CF, namely, NGCF [71], DGCF [73], LightGCN [28], SGL [78], UltraGCN [47], and GFCF [59] (refer to Section 3).This selection is motivated by two aspects: (i) such models are adopted as baselines in recent works from top-tier venues (see the second column in Table 1); (ii) their strategies cover a wide spectrum of techniques in graph CF.To provide a chronological overview of such techniques, in the following, we report their main aspects: • NGCF.Neural graph collaborative filtering [71] (NGCF) is among the pioneer approaches in graph CF.Its messagepassing schema works by aggregating the neighborhood information and the inter-dependencies among the ego and the neighborhood nodes (note that a normalized Laplacian adjacency matrix is used during the message-passing).• DGCF.Disentangled graph collaborative filtering [73] (DGCF) assumes that user-item interactions can be disentangled into independent intents, where each stands for a specific aspect describing the user's preference towards the item.
The model learns a set of weighted adjacency matrices refining the user-item importance related to a specific intent.
• LightGCN.Light graph convolutional network [28] (LightGCN) suggests that a more light-weight formulation of the graph convolutional layer proposed by Kipf and Welling [35] can lead to superior accuracy performance in the recommendation scenario.Specifically, the architecture removes feature transformations and non-linearities.
• SGL.Self-supervised graph learning [78] (SGL) is among the first attempts to bring the lesson-learned from selfsupervised [31] and contrastive [34] learning to graph CF.Built upon a LightGCN-based convolutional layer, the model learns different views of nodes by performing node/edge dropout and random walk operations on the graph topology.A self-supervised contrastive loss component is added to encourage the consistency among different views of the same node and the divergence among different nodes.
• UltraGCN.Ultra simplification of graph convolutional network [47] (UltraGCN) addresses some crucial issues in graph CF.Specifically, the authors propose a novel message-passing schema that mathematically approximates the infinite-layer propagation through a single (simplified) node update iteration.The adjacency matrix is normalized through a modified Laplacian formulation that accounts for the asymmetric weighting of connected nodes in user-user and item-item connections.Moreover, two loss components are introduced to tackle the over-smoothing effect and learn from the usually-unexplored type of node relationships such as item-item.
• GFCF.Graph filter-based collaborative filtering [59] (GFCF) questions the role of graph convolutional network into recommendation by leveraging graph signal processing theory.By showing that several existing approaches in CF may fall into one unified framework based upon graph convolution, the authors eventually propose a closed-form algorithm that proves to be a strong baseline against other trainable and computationally-expensive (graph-based) approaches in CF.Thus, the method represents the only exception to the message-passing models presented above.

Analysis on reported baselines
Table 1 reports on the baselines each graph-based approach was tested against in the original paper.By categorizing them into classic and graph CF we first observe that, with the only exception of UltraGCN, all graph-based recommendation systems are generally compared only against 1-2 classical CF solutions (MF [29,55]-and/or VAE [38,44]-based approaches in most cases).However, the recent literature [2,3,15,16,96] has raised several concerns about usuallyuntested strong CF baselines, such as nearest-neighborhood approaches (e.g., UserkNN [56] and ItemkNN [57]), random-walk techniques (e.g., RP 3  [49]), and other autoencoder-based solutions (e.g., EASE  [61]).Differently from the classical CF baselines, we notice that most of the works compare their proposed approaches against a wide (and shared) range of graph CF solutions.This is easily explainable given the conceptual and logical similarities among the graph CF baselines and the proposed approaches.Moreover, besides a limited subset of graph CF baselines (i.e., HOP-Rec [83] and GRMF [53]), the vast majority of tested graph algorithms [14,43,64,68,87] are based upon the graph convolutional network architecture.Interestingly, we observe that only a subgroup of our selected six graph Table 1.Analysis of baselines used in each of the selected graph-based models, categorized into classic and graph CF.A colored tick '✓' denotes when one of the baselines is also among the selected set of graph-based approaches for our study.

Families Baselines
Models NGCF [71] DGCF [73] LightGCN [28] SGL [78] UltraGCN [47] GFCF [59] Used as graph CF baseline in (2021 -present) CF approaches (up to a maximum of three approaches if we consider UltraGCN) is generally compared against the proposed approach.While we could justify this point with chronological motivations (e.g., DGCF could have not been tested on SGL, UltraGCN, and GFCF), we deem this to be an important lack in the existing literature.
Under the above considerations, and differently from the previous works, we compare the accuracy performance of the selected six graph CF approaches against strong CF techniques (UserkNN, ItemkNN, RP 3  and EASE  ), while providing a complete evaluation setting which involves all the selected graph methods, where they are put against one another (refer to Section 4).To our knowledge, this work is one of the first attempts [96] to fill this gap.

Analysis on reported datasets
Table 2 displays the datasets adopted to train and test the reviewed graph-based recommender systems, as reported in the original papers.Notably, we recognize a total of seven recommendation datasets spanning different domains such as social networks (i.e., Gowalla), points-of-interest (i.e., Yelp 2018), e-commerce (i.e., the Amazon product categories and Alibaba-iFashion), and movies (i.e., Movielens 1M).It is worth pointing out that when we set the '✓' for the same dataset on different models, we are stating that the authors from the original works used the exact same dataset setting, that is, the original user-item interaction data and splitting/filtering strategies.A deeper analysis shows that there exists a subset of three datasets (i.e., Gowalla [37], Yelp 2018 [28], and Amazon Book [27]) which is utilized in the majority of graph CF works.For the sake of reproducibility, we replicate the original results calculated on such datasets for the six graph CF approaches (although the SGL paper does not provide results on Gowalla).Given the limited set of shared datasets among all the approaches, we include novel, never-investigated datasets to assess if their recommendation accuracy remains consistent on other domains and/or topologies.

Analysis on experimental comparison
As a final analyzed dimension, we discuss the protocol for the experimental comparison between the baselines and the proposed approach in each selected work.Being the pioneer model in the domain, the authors from NGCF train all proposed baselines from scratch.In the DGCF paper, the authors directly report the results of some baselines which are shared with NGCF and train the other baselines from scratch.In a similar manner, the authors by LightGCN, SGL, and UltraGCN copy the result values from the original papers, while the remaining models are trained from scratch.Finally, the authors from GFCF reproduce the results from LightGCN as the baselines are exactly the same.
Regarding the copy-paste of baseline results, authors often justify this approach by stating that they used the same experimental settings (such as dataset splitting/filtering) as their (graph) CF baselines.Additionally, it is worth noting that some authors are shared among the studies being investigated.
To remove all doubts, and differently from the mentioned works, we re-implement all algorithms by carefully following their original codes, and train/evaluate them through Elliot [1,45].Our goal is to provide a fair and repeatable experimental environment for the selected graph CF approaches, by using the hyper-parameter settings as indicated in each paper and/or shared online code to assess to what extent we can reproduce the original results.The reader may refer to Section 3.1 for a whole description of our settings.

REPLICATION OF PRIOR RESULTS (RQ1)
This section focuses on how the replication of the experiments from the six state-of-the-art papers on graph CF stated before has been set up.It starts by defining the evaluation protocol applied to compare these methods in their respective works (Section 3.1).After that, we present our replication results (Section 3.2).

Settings
The experimental setup adopted in the first part of this study is designed primarily to replicate the results of the models included in this analysis [28,31,43,47,59,71].As mentioned earlier, we use the three most common datasets in this scenario to show the results of our replicability study.Specifically, we use Gowalla, Yelp 2018, and Amazon Book as provided in the public repositories of NGCF1 and LightGCN2 .All the proposed models (except SGL) use the same datasets with the same filtering/splitting.The authors state that they adopt a random split based on the 80/20 hold-out (i.e., for each user, 80% of the interactions is used to create the training set, while the remaining 20% constitutes the test set).Thus, each user-item interaction is treated as positive; all others are considered unfavorable.In addition, the authors leave 10% of the training as a validation set for tuning the hyper-parameters.However, this portion of the dataset is not indicated in the papers' extra material.
The adopted evaluation protocol, all-unrated-item [60], is shared across all the analyzed papers: for each user, we retain all candidate items with whom she does not interact with in the training set.The quality of recommendations is assessed by the Recall and the nDCG on the top-20 recommendation lists for each user.Each work performs its own tuning of the hyper-parameters (the Recall@20 is used as validation metric), by reporting on the search hyperparameter spaces.Moreover, the best configurations on each dataset are usually provided in the respective papers and/or repositories.Thus, we set the hyper-parameters on each model-dataset as the best ones declared by the authors.
The careful reader would notice that the results reported in Table 3 for NGCF (see the 'Original' column) differ from those shown in the in-proceedings version [71].The reason is that the authors modified and recalculated the results obtained for the model and baselines due to errors in the pre-processing of the Yelp 2018 dataset and in the calculation of the nDCG.Thus, for the sake of fair reproducibility, and only in this case, we consider the results reported in the arXiv (most updated) version of the paper [72].

Results
Table 3 compares the results reported by the six papers focused on our study with those obtained in our implementation (using the tuned parameters specified in each work, as explained before).The new experiments closely approximate the original ones, with the most significant performance shift being in the 10 −3 order.There are no noticeable distinctions in metrics, dataset, or algorithm used.
More specifically, in an algorithm basis, we observe the performance of GFCF is the best replicated one.This might be due to this method being the only one with a closed-form algorithm, hence, no perturbations from random initializations are expected.The rest of the approaches evidence a similar (high) level of replication, although the shift for NGCF and DGCF rarely achieves the 10 −4 order for the two metrics in all the datasets.In any case, considering the random initializations and stochastic learning processes [33], our replication of these approaches could be considered a success.
No significant differences were found among the three datasets.SGL was not originally reported for Gowalla, so it was omitted from the table as we compared reported results with our implementations using the same hyper-parameters.
In summary, these results confirm that, as discussed before, even though authors of these papers re-used the performance values from other papers just by copy-pasting them, this did not hurt the reproducibility of these approaches.As previously stated, our assumption for this behavior (which is not a safe practice in general [7]) is that the experiments of the original papers were all comparable because some authors are shared across contributions, which should guarantee that the settings and implementations of the algorithms are the same.

BENCHMARKING GRAPH CF APPROACHES USING ALTERNATIVE BASELINES (RQ2)
In line with recent reproducibility works (such as [15]) that evidenced certain problems regarding the choice and optimization of the baselines used for comparison, in this section we assess how graph CF approaches perform relatively to classical CF baselines.Here, we specify first how the experiments are prepared (Section 4.1), and the corresponding results are shown (Section 4.2).

Settings
We expand our investigation by examining four classic CF models to enhance the replicability analysis.Specifically, we select four models whose accuracy performance has rarely been compared with the graph-based CF approaches replicated in this study.The decision to include UserkNN, ItemkNN, RP 3 , and EASE  is purposeful.We refer to [16] and (more recently) [2], which demonstrated the competitiveness of these baseline models compared to more recent approaches when a shared benchmark for comparison is employed among all involved methodologies.Furthermore, we also consider two unpersonalized approaches (i.e., MostPop and Random).The two models act as benchmarks to assess the effectiveness of customized methods compared to a user-agnostic solution.
For a fair comparison, the configuration delineated herein elucidates how the four classic CF models are tuned following the exact same training/test splitting reported in Section 3 and the same experimental protocol.The only difference is that (for obvious reasons) we need to explore the hyper-parameters of each classic CF model introduced in the comparison.Similarly to what the authors do in the original graph CF works, we retain the 10% of the training to generate a validation set, but decide to explore 20 distinct configurations for each model through the state-of-the-art Tree-structured Parzen Estimator (TPE) hyper-parameter search [8].For every model, the final results correspond to the accuracy measure on the test set by setting the hyper-parameter configuration providing the best Recall@20 results on the validation set.

Results
Table 4 shows the results of the graph CF models (as previously replicated in Table 3) with the additional baselines.First, it is worth noting that, even though none of these baselines gets the best results in any of the three datasets considered, they achieve the second-best performance in Yelp 2018 (refer to RP 3  with nDCG).
Second, none of the models in the reference family achieve competitive performance.While this is expected for the Random algorithm, it is an indication that either none of these datasets evidence a strong popularity bias or (considering the way they were processed) such bias was removed.
Third, some of the classic CF approaches (such as RP 3  and UserkNN in Gowalla, and RP 3  and EASE  in Yelp 2018) demonstrate better performance than some of the state-of-the-art graph CF methods, in particular, they perform better than NGCF, DGCF, and LightGCN.This result is in line with recent experimental comparisons [3,6,15] where these baselines outperform other methods based on matrix factorization or neural networks.Moreover, to some extent, the fact that some graph CF methods are outperformed should not be surprising, since, as shown in Table 1, none of these baselines were included in the original papers where the graph CF approaches were proposed.

EXTENDING THE EXPERIMENTAL COMPARISON TO NEW DATASETS (RQ3 -RQ4)
This section aims to provide a full picture from an experimental point of view on two new datasets: Allrecipes and BookCrossing.First, Section 5.1 introduces the experimental settings followed to obtain the results presented in Section 5.2.Then, Section 5.3 discusses these results in more detail, aiming to explain the insights derived from them.

Settings
Motivated by the previous results, we further enrich our analysis by investigating the behavior of all tested models on two datasets that have never been considered in any previous study involving graph-based approaches for recommendation, namely, Allrecipes [21] and BookCrossing [97].Table 5 shows some statistics of these datasets, where we purposely decide to report both the benchmarking datasets for graph CF (i.e., Gowalla, Yelp 2018, and Amazon Book) and the newly introduced ones.On the one hand, Allrecipes exhibits quite discordant characteristics compared to the other the lowest ratio between the number of users to items across all datasets, and a much higher density than all the others.In summary, the newly introduced datasets serve as a foundation to assess the performance in different (and never-explored) topological settings for graph CF baselines.
To adhere to the experimental setup presented so far, we adopt the all-unrated-item evaluation protocol, and split the two datasets with a random hold-out solution, ensuring an 80:20 proportion.Differently from the replicability study, we now perform a TPE-based hyper-parameter tuning for all models, as the best hyper-parameters for each graph-based approach is not known in advance; for this, we (again) use the 10% portion of the training set as validation set.We run 20 different settings within the search space provided in the original papers.The models' best configurations are selected through the Recall@20 on the validation.

Results
Table 6 provides a full comparison between unpersonalized methods, classical CF approaches, and the graph CF methods under analysis.In line with our previous section experiments, classic CF methods (in particular, RP 3  and EASE  ) are very competitive compared to graph CF approaches, even in novel datasets like the ones included in this analysis.
Specifically, the results in BookCrossing are dominated by these baselines, while in Allrecipes, the MostPop stands out, evidencing a strong popularity bias.
These results highlight that, among the graph CF techniques, those that maintain their performance in novel domains are UltraGCN (best one in Allrecipes and third among its type) and LightGCN (second best in both domains).While the nature of these two datasets is clearly different (as shown in Table 2, Allrecipes is smaller and it contains more users than items, instead of the other way around as in BookCrossing), the relative performance of the best graph CF methods is competitive.However, for some of them, the performance drop is significant, reaching an accuracy lower than that of any other classic CF baseline.
To bring light into some of these behaviors, the next section discusses in more detail how the ranking of the graph CF methods changes depending on the dataset, and hypothesize which dataset characteristics may be tied to these effects.

Discussion
To further validate and explain the reasons behind the results reported in Table 6, in the following we perform a twofold analysis.First, we rank all the selected graph-based recommendation models on all the tested datasets to assess their relative improvement across all settings and provide another perspective on the results from Table 6.Then, we propose a more nuanced study on the measured accuracy performance by investigating its (possible) dependence on the specific dataset characteristics, namely, the node degree as viewed at multiple hops.

5.3.1
Graph-based models' ranking.In Table 7, we rank the six graph CF recommender systems under analysis according to the calculated Recall@20 and nDCG@20, for both the original datasets (i.e., Gowalla, Yelp 2018, and Amazon Book) and the novel datasets we introduced (i.e., Allrecipes and BookCrossing).Moreover, we also indicate the relative improvement of each model with respect to the worst-performing algorithm on that dataset.
The trend on the three original datasets is quite steady, with UltraGCN and GFCF being the two best-performing approaches in almost all cases, and the remaining graph techniques ranked as in descending chronological order (confirming the findings from the recent literature).In terms of relative improvements, we observe large performance differences mainly on the Amazon Book setting.
By focusing on the two additional datasets (i.e., Allrecipes and BookCrossing), the rankings corroborate some of the previous outcomes, but also introduce novel and unexpected considerations.While UltraGCN seems to preserve its role of leading approach in the two scenarios (in BookCrossing it is ranked as third but with minimum margin to the second one), we notice how GFCF's performance is very fluctuating, as it even stands in the last position on Allrecipes with large performance difference to the other models (the same goes for DGCF).Noticeably, LightGCN gets up to the top of the ranking in both settings, indicating that a careful hyper-parameter tuning could be beneficial to outperform most of the other approaches, even the ones that should surpass it according to the literature (such as SGL).As final remarks, NGCF poor performance is again confirmed in such different dataset settings.

Analysis on the node degree.
As already observed in Table 5, the average node degree of users and items represents one of the main aspects discerning each dataset from the other ones.For this reason, we decide to reason about its possible influence on the models' performance.In this respect, instead of limiting our analysis to the sole definition of node degree (i.e., number of recorded interactions for each user and item), and given the ability of graph-based approaches to distill the collaborative signal by stacking multiple layers [71], we propose a novel investigation which reinterprets the node degree as information flow from neighbor nodes to the user nodes after multiple hops.Note that we only consider users as the ending nodes of such a flow because we are interested in assessing how the accuracy recommendation measures (which are generally calculated user-wise) may be influenced by this aspect.
Table 7. Graph-based recommender systems, ranked according to their Recall@20 and nDCG@20 on all the tested datasets.For each model, we also report its relative improvement with respect to the worst-performing approach on the same dataset (in green).NGCF ( -) NGCF ( -) GFCF ( -) DGCF ( -) *SGL is not classifiable on the Gowalla dataset as results were not calculated in the original paper [78].
Before diving into the results and discussion, we provide some useful intuitions and formulations which may help understand our analysis.With reference to Figure 1, we introduce the definition of information flow at one, two, and three hops.We decide to limit our focus on the first three explored hops because (i) graph-based recommender systems built upon the message-passing schema usually tend not to iterate over the third aggregation layer, and (ii) the investigation of more than three hops would not be meaningful from a recommendation perspective.As a matter of fact, we interpret each of the three hops as follows: • at one hop (Figure 1a), users receive the information coming from the items they interacted with; in other words, this is an indication of the activeness of users on the platform; • at two hops (Figure 1b), users receive the information of the other users co-interacting with the same items; in other words, this is an indication of the influence of items' popularity on users; • at three hops (Figure 1c), users receive the information coming from the items interacted by the other users involved in co-interactions; i.e., this is an indication of the influence of co-interacting users' activeness on users.
Let us formalize such definitions.The information received by users at one, two, and three hops is calculated as: where is the vector of the information that all users receive from the nodes in their ℎ-hop, 1 U ∈ R 1× | U | and 1 I ∈ R | I | ×1 are row and column vectors with 1 repeated |U| and |I| times, respectively, while ⊙ is the Hadamard product performed in broadcast.
In light of the above, the study assesses the accuracy performance of graph-based recommender systems on user groups considering the information received from the one, two, and three hops neighborhood.Following other analyses in the literature, we decide to split users into quartiles according to the information values (i.e.,  (ℎ) U ). Thus, we consider four groups: (i) users whose values are below the 25% of the distribution, (ii) users whose values are above the 25% and below the 50% of the distribution, (iii) users whose values are above the 50% and below the 75% of the distribution, and (iv) users whose values are above the 75% of the distribution.
Figure 2 displays the percentage variation in accuracy performance (measured by nDCG) across quartiles relative to the average value reported in Table 6.The figure illustrates how the quality of recommendation performance fluctuates amongst different clusters of users.For example, a method indicating a 50% improvement in the fourth quartile would suggest that users in this cluster, typically more active (1-hop) or also interested in popular items (2-hop), receive more accurate recommendations with respect to the average user.This observation implies that a non-discriminatory recommendation system should produce no variation across quartiles, with values overlapping the 0% dashed line.The second necessary preliminary to understand the outcome of the experiments is the interpretation of the quartiles for the different hops.In the 1-hop, the fourth quartile pertains to warm users interacting most with the platform, while the first quartile represents cold users interacting less frequently.In the 2-hop, high values in the fourth quartile indicate active users who enjoy popular items, resulting in dense subgraphs.The first quartile, in contrast, consists of less active users interacting with niche items in less dense subgraphs.The 3-hop, which includes user neighbors, generates the highest values when active users interact with popular items enjoyed by warm users (i.e., their neighbors).However, it is essential to note that the plots offer no insight into overall accuracy (which is in Table 6).
When considering the recommendation performance according to the corresponding cluster (depicted in Figure 2), it is crucial to note that none of them demonstrate ideal recommendation behavior.Instead, these systems tend to favor warm users or densely interconnected subgraphs located in the fourth quartile.Despite this trend, the 1-hop plots for graph Collaborative Filtering (CF) and classic CF methods in Allrecipes and BookCrossing graphs demonstrate minimal disparities between different recommendation approaches.Even though they all favor the fourth quartile over the first one, the coldness/warmness of a user marginally impacts how much the method is biased toward these types of users.
The lone exception to this trend is GFCF, which exhibits even greater penalization towards the first three quartiles (varying on the three hops from, approximately, -45% to +115%, and thus exceeding the plots' upper bound).As such, this system only provides satisfactory recommendation performance for users in the fourth quartile.
Regarding the 2-hop, there are several interesting insights to be gained.Firstly, the recommendation methods exhibit a higher overall slope, favoring the users who enjoyed popular items over the cold users who enjoyed niche items.
While this may seem like an obvious observation, the plot confirms that user coldness/warmness alone is not a sufficient indicator of high-quality recommendations.Instead, the 2-hop reveals that combining user coldness/warmness and item popularity is useful for identifying such users.A second noteworthy aspect is that the Allrecipes dataset highlights three distinct behaviors among the graph CF methods.UltraGCN, DGCF, and LightGCN exhibit similar performance and display less discriminatory behavior across quartiles.It is interesting to note that these models also perform best overall (see Table 6).On the other hand, SGL and NGCF show a higher slope that is comparable to classic CF methods.
Also, their corresponding performance is similar in Table 6.A third observation concerns GFCF, which performs poorly across all quartiles except for the fourth.Its behavior is even more accentuated than in the 1-hop analysis.Additionally, NGCF, SGL, and GFCF are graph CF algorithms performing differently according to user warmness and item popularity.
Meanwhile, all algorithms in BookCrossing, and the classic CF in Allrecipes, exhibit the distribution over the quartiles across methods.
Finally, in the 3-hop, for the BookCrossing dataset, the information pertaining to neighbors does not contribute significantly to the results, as indicated by the similarity between the 2-and 3-hop plots.Meanwhile, in Allrecipes, the best models (UltraGCN, DGCF, and LightGCN) exhibit more consistency in performance across all quartiles, as demonstrated by a more even distribution of results (less variations across the quartiles).However, this pattern is not evident in NGCF, SGL, and GFCF, which exhibit a more disparate range of results across the quartiles.

CONCLUSION AND FUTURE WORK
This study replicates the results of six graph CF methods, namely NGCF, DGCF, LightGCN, SGL, UltraGCN, and GFCF, and expands the research to include state-of-the-art recommendation strategies like UserkNN, ItemkNN, RP 3 , and EASE  .The observed high rankings of the latter ones highlight the need for more comprehensive evaluations.
After the initial study on the standard Gowalla, Yelp 2018, and Amazon Book datasets, experiments are extended to two additional datasets, Allrecipes and BookCrossing, which reveal substantial ranking variations compared to the initial datasets.Thus, the study introduces and analyzes the information flow in the graph and discovers that 2-hop information (combining user activeness and item popularity) is a valid indicator of CF behavior and could motivate the recommendation performance.The experimental results call for further investigations into the diversity and fairness of the considered methods and whether graph methods effectively mitigate popularity bias.

Fig. 1 .
Fig. 1.A toy user-item graph where the ego user node (highlighted) receives the information flow from the (a) 1-, (b) 2-, and (c) 3-hop neighbor nodes (highlighted).Arrows' direction is a visual representation of the information flow.

Fig. 2 .
Fig. 2. Percentage variation between the nDCG on user quartiles and the average nDCG value across all users (indicated as the dashed line), for each model-dataset setting.Rows refer to user quartiles when considering (a) 1-, (b) 2-, and (c) 3-hop.

Table 2 .
Analysis of the datasets adopted in each graph-based approach.

Table 3 .
Results of our replicability study on Gowalla, Yelp 2018, and Amazon Book for the selected state-of-the-art graph-based recommender systems.We calculate the performance shift between our conducted experiments and the original ones (as reported in their papers).Note that models have been sorted out according to the chronological order.

Table 4 .
Graph-based CF solutions tested against unpersonalized (i.e., reference) and classical CF approaches on Gowalla, Yelp 2018, and Amazon Book.While results for the graph-based approaches have been directly reported from our reproducibility study (see above), classical CF recommender systems have been fine-tuned on the two datasets to find their best configurations.Boldface and underline refer to best and second-to-best values, respectively.
[96]ults for EASE  on Amazon Book are taken from BARS Benchmark[96].

Table 5 .
Statistics calculated on the training sets of Gowalla, Yelp 2018, Amazon Book, Allrecipes, and BookCrossing.We indicate the number of user-item interactions through' Edges' while 'Avg.Deg.(U)' and 'Avg.Deg.(I)' refer to users' and items' average node degree (i.e., average interaction number).Although it has a comparable density, users are more numerous than items, with a much lower average user and item node degrees compared to the other standard graph CF datasets.On the other hand, BookCrossing displays

Table 6 .
Graph-based CF solutions tested against unpersonalized (i.e., reference) and classical CF approaches on Allrecipes and BookCrossing.Boldface and underline refer to best and second-to-best values, respectively.