Counterfactual Graph Augmentation for Consumer Unfairness Mitigation in Recommender Systems

In recommendation literature, explainability and fairness are becoming two prominent perspectives to consider. However, prior works have mostly addressed them separately, for instance by explaining to consumers why a certain item was recommended or mitigating disparate impacts in recommendation utility. None of them has leveraged explainability techniques to inform unfairness mitigation. In this paper, we propose an approach that relies on counterfactual explanations to augment the set of user-item interactions, such that using them while inferring recommendations leads to fairer outcomes. Modeling user-item interactions as a bipartite graph, our approach augments the latter by identifying new user-item edges that not only can explain the original unfairness by design, but can also mitigate it. Experiments on two public data sets show that our approach effectively leads to a better trade-off between fairness and recommendation utility compared with state-of-the-art mitigation procedures. We further analyze the characteristics of added edges to highlight key unfairness patterns. Source code available at https://github.com/jackmedda/RS-BGExplainer/tree/cikm2023.


INTRODUCTION
Current research in recommender systems is increasingly focusing on beyond-utility perspectives, such as explainability [37] and fairness [33], also in response to the recently issued regulations [14].However, such perspectives are usually considered separately.For instance, research into explainability has focused on merely justifying why a certain item has been included within the recommended list [10,23], often without inspecting whether and why adding that item might lead to disparate impacts on demographic groups.Conversely, existing methods to mitigate unfairness have often relied on mathematical formulations of fairness principles, but rarely informed from explanatory analyses on such unfairness [7,13,15,24].Concerted efforts towards explaining unfairness in recommendation have been recently made [11,16].Unfortunately, none of them has led to a mitigation procedure that leverages the identified explanations to mitigate the measured unfairness.A first attempt to inform a mitigation procedure through explanation techniques was proposed by [12].However, their method operates on Graph Neural Networks (GNNs) to detect the graph nodes affecting unfairness in classification tasks, limiting its adoption to networks of user-item interactions and to the recommendation task in general.
In this paper, we propose an approach that augments a user-item interactions graph to counteract consumer unfairness [5] across demographic groups in recommendation.Following works that used counterfactual techniques in GNNs [10,23,26,36], our framework aims to modify the top- lists generated by a GNN-based recommender system by adding edges to the graph used in the model inference step, such that the altered recommended lists are fairer across demographic groups of consumers.The augmentation mechanism is guided by a two-term loss function that selects the minimum set of edges to solve the targeted task.Specifically, we assume that the actions of the users in a demographic group led the model to advantage them.Thus, we hypothesize a counterfactual world where the disadvantaged users can benefit from new edges to improve their recommendation utility.If our approach accomplishes this task, the additional edges represent an explanation (i) of the fairness-related knowledge missing from the non-augmented graph, and (ii) of the underlying mitigation process of our method.On two public real-world data sets, our method demonstrates greater reliability than state-of-the-art (SOTA) techniques in mitigating consumer unfairness.We also describe the original unfairness by analyzing the user and item nodes involved in the added edges.

RELATED WORK
Despite the abundance of consumer unfairness mitigation procedures in recommendation [5,7,13,15,24,25,34,35], the aspects that lead such techniques to successfully improve fairness still remain nebulous.It is also uncertain whether unfairness explainability methods in recommendation [11,16] could be leveraged to mitigate and not only explain the issue.In GNNs literature, emphasis was put on improving explainability [3,9,17,18] and fairness [1,27,31] for several downstream tasks.[10,23,26,36] adopted counterfactual methods to modify (e.g.perturbation) the graph topology, but only a few of them were studied for recommendation [10,23], and not necessarily considering unfairness issues.The information provided by explanation methods on graphs was leveraged for additional tasks in [12,19].Despite fairness was contemplated in [12], such works were devised for classification purposes on monopartite graphs.

METHODOLOGY 3.1 Problem Formulation
3.1.1Recommendation Task.The goal is to predict whether or the level of interest that a user  ∈  may have for an unseen item  ∈  .The interactions between users and items can be represented by a bipartite graph  = ( , , ), where  ∪  ( = | | + | |) is the set of nodes and  is the set of edges connecting such nodes.Let  be a  ×  adjacency matrix representing , missing links can be predicted by any GNN, defined as  (, ) → R ∈ R | | × | | , where R, represents the linking probability between  and , and  is parameterized by the weight matrix  .A list of the  items with the highest probability in R is recommended to each user .

Mitigation
Task.Our goal is to make  produce altered yet fairer recommendations.To this end, we leverage counterfactual reasoning techniques [23,26] to generate a minimally augmented version of , i.e.Ã, such that the utility estimates across consumers' groups are not systematically different when  uses Ã instead of  during inference.We base our fairness notion on demographic parity, emphasized in top- recommendation by prior work [6,24,35].According to it, we aim to minimize the following loss function: L   quantifies fairness, by using the operationalized demographic parity function; L  controls the distance between  and Ã.

Augmentation Mechanism.
Similarly to [26], we developed a mechanism tailored for bipartite graphs to modify their topology.
[26] uses a matrix  to perturb A, i.e.Ã =  ⊙  , while our method reduces the memory usage by augmenting  with a predefined set of  edges, based on a vector .We generate Ã by augmenting the missing entries in  with the entries in  through a function ℎ : is derived from a real valued vector p, as done in [26,29], by applying a sigmoid transformation before rounding values ≥ 0.5 to 1 and values < 0.5 to 0. We initialize p = −5, ∀ ∈ [0, ), such that p ≈ 0 after the sigmoid transformation and it is guaranteed Ã = .

Augmented Graph Generation.
The core process of the augmented graph generation is carried out by an extended version of  , denoted as f ( Ã, ; p) → R, that shares the same implementation with  , but  remains constant as an additional input.Differently from  , f (i) leverages the vector p as parameter to perform the augmentation mechanism, resulting in Ã, (ii) retrieves a matrix R with linking probabilities altered by the usage of Ã during inference, and (iii) updates p according to (1).In other words, f performs an iterative process to augment the graph until a fairness requirement estimated on R is satisfied.The final augmented matrix Ã represents a distorted version of  in a counterfactual world.If Ã makes  produce recommendations with fairer estimates on the perturbation set, Ã provides a counterfactual explanation of the prior unfairness on the same set, similarly to [16].We expect that adopting Ã could also mitigate the unfairness on the evaluation set.

Loss Function
Optimization.Let G be the set of demographic groups, we operationalize L   (see (1)) as in recent works [5,35]: where  G  and R G  denote the adjacency and linking probabilities sub-matrices related to the users in the -th group,  is a recommendation utility metric.We selected Normalized Discounted Cumulative Gain (NDCG) as the latter in the evaluation phase, but, due to its non-differentiability, we adopted an approximated version [28,35] 1 , i.e.  , to optimize L   .We denote the data subset from which the ground truth labels are taken to measure   during the optimization process as the perturbation set.With focus on a binary setting as prior studies [2,22,24], we define the subsets where G  , G  are the disadvantaged and advantaged groups respectively.The group with lower (higher) utility on the perturbation set is denoted as disadvantaged (advantaged).Our approach aims to increase the utility of the disadvantaged group (not to reduce the advantaged group's one).Edges are only added to user nodes in   .Conversely, L  (see (1)) can be any differentiable distance function [26]: where  is a sigmoid function2 to bound L  in the range [0, 1] as L   ,  is a scaling factor that balances the two losses.We set  = 0.5 to give more importance to the mitigation task, i.e.L   .

Sampling Policies
Even if the loss function in (1) guides the edges selection, the set of edges to add could be vast.The user and item nodes of this set are described by several properties, which could support or obstruct our method.Thus, we applied several sampling policies to narrow the set of edges (connected to user nodes in   ) to be added: • BM (Base): the base algorithm with no sampling applied.
• ZN (Zero NDCG): selects the users with no relevant items in their top- recommendation lists, i.e.  @ = 0. Ψ  % and Ψ  % denote parameters to sample the user set  and the item set  respectively.We fix Ψ  % = 35% and Ψ  % = 20%.These policies were selected factoring in the way each demographic group interacts with the items (IP, SP), common phenomena described in recommendation literature (ZN, LD), the aggregation operation in GNNs models (LD, FR).We distinguish between policies of type U (ZN, LD, SP, FR) or I (IP) if the sampling is applied on the user or item set respectively.We also contemplated inter-group combinations between policies U and I, but intra-group ones are excluded not to lead to excessive reduction of the user or item set.

EVALUATION
Our experiments aim at answering the following questions: • RQ1: Do the edges selected by our method positively impact the recommendation fairness on the perturbation set? • RQ2: To what extent the unfairness is mitigated by our method in comparison with state-of-the-art procedures?

Experiment Settings
4.1.1Data Preparation.We relied on the artifacts of [5], which performed a fairness assessment on two corpora: MovieLens 1M (ML-1M) [20], and Last.FM 1K (LFM-1K) [8].The advantaged groups and their representation w.r.t. to the related sensitive attribute are Males (M) (71.7%) and Younger (Y) (56.6%) on ML-1M, Females (F) (42.2%) and Older (O) (42.2%) on LFM-1K.We extended LFM-1K with the time information to each user-artist pair as the timestamp of a given user's last interaction with a given artist's song.Following [5], for each data set we arranged the interactions list of each user in ascending order of recency, and split the sorted lists with a ratio 7:1:2 to include each subset in the train, validation and test set respectively.The validation set was used (i) to select the training epoch where the model reached the highest NDCG on the nonaugmented data, (ii) as the perturbation set for our method.During the evaluation step, the edges selected by our method were added to the training set, and, if present, removed from the other two sets.
Based on an encoder-decoder architecture, GCMC [30] reconstructs the user-item relevance matrix by predicting the relevance of the missing entries in the adjacency matrix.
Table 1: Mitigation performance of our method's policies: the relative difference in ΔNDCG between the scores measured on the perturbation set before and after applying each policy is reported.Negative values denote unfairness was mitigated by the respective policy, i.e. |ΔNDCG| was reduced (−100% denotes optimal mitigation).Leveraging high-order connectivities in the user-item interaction graph, NGCF [32] propagates embeddings in the latter by injecting the collaborative signal into the embedding process.
LigthGCN [21] is lightened to include only the neighborhood aggregation and propagate a single embedding in the graph as the weighted sum of the user and item embeddings.
We optimized the hyper-parameters under a grid search strategy.

RQ1: Edges Augmentation Analysis
If our method successfully mitigates the model unfairness by adding the selected edges, the characteristics of the nodes composing such edges could describe a possible cause of the original unfairness (before the graph was augmented).Under a given policy, the features of the sampled nodes (Section 3.3) characterize the added edges.Thus, Table 1 depicts the unfairness mitigation performance of all the policies to highlight which nodes characteristics affect the unfairness the most.Such performance is the relative difference in |ΔNDCG| between the scores measured on the perturbation set before and after a policy was applied, where ΔNDCG = NDCG   − NDCG   is the difference between the average NDCG for   and   .Some settings are positively affected by all policies, while others can successfully be augmented only by specific policies.Indeed, this aspect is highlighted by ZN+IP, the only policy mitigating unfairness across gender groups on ML-1M for GCMC and NGCF, with a noteworthy change of −100% for the latter.While the individual policies ZN (disadvantaged users with no relevant items out of the 10 recommended, i.e.  @10 = 0) and IP (items mostly preferred by the disadvantaged users) could not report a similar result, their combination added interactions to the females, i.e.   , that were able to reduce the gap in NDCG between gender groups.Some policies systematically excel more than others under the same settings, such as ZN across age groups on ML-1M (GCMC), and on LFM-1K (LightGCN, NGCF).Hence, adding interactions to the  between the scores measured before and after each method was applied on gender and age groups.Multiple points per method indicate the use of multiple models for each method.Positive (negative) values on the x-axis (y-axis) denote an increment (decrement) in NDCG (ΔNDCG).

RQ2: Mitigation Procedures Comparison
In this section, we evaluate the trade-off between the recommendation utility and the unfairness mitigation performance of our method in comparison with SOTA algorithms.Based on the similarity to our evaluation protocol, we relied on the framework shared by [5] 3 and compared the mitigation procedures used for top- recommendation with our method.Given our focus on the mitigation task, we only considered models reporting high utility levels 4 , which could effectively solve the recommendation task and reflect existing biases as in real-world scenarios.The GNN-based recommendation systems used with our algorithm satisfy this property.
The following results regarding our method pertain to the policies that reported the lowest |ΔNDCG| on the perturbation set (Table 1).
Figure 1 highlights the extent to which each method affected the recommendation utility in NDCG (x-axis) and the disparity in the latter between user groups (y-axis).Our approach reports the best mitigation performance, given by the points labeled as Ours being systematically the lowest ones on the y-axis.Moreover, except for GCMC on ML-1M, our algorithm systematically reported a positive (right side of x-axis) or negligible impact on the recommendation utility, whereas the other methods decreased it in several settings.In Table 2 we report the resulting levels of utility (NDCG) and fairness (ΔNDCG) after each algorithm was applied.Determining which setting is the best one depends on a specific application requirements and to what extent fairness is relevant.In terms of recommendation utility, our approach made the GNN-based models among the most effective in various settings, and it led to utility disparity levels significantly lower than the other systems reporting a high NDCG, e.g.LFM-1K on gender groups.Hence, our algorithm demonstrates greater reliability in mitigating unfairness and in improving recommendation utility than the other methods.

CONCLUSIONS
In this paper, we proposed an augmentation method that leverages explanation techniques to mitigate consumer unfairness in recommendations generated by a GNN-based system.Our experiments show our technique as more reliable to mitigate unfairness than SOTA algorithms, and as able to increase the overall recommendation utility.Analyzing the augmented graph, we discovered that confining the algorithm to the disadvantaged user nodes who received no relevant items in their recommendation list positively affects the utility of the latter.However, the augmentation had a limited impact on deep GNNs (GCMC, NGCF), primarily due to the diminished influence of the graph in the prediction process.It is also unclear whether our approach could improve the disparity in other metrics, given that it solely focuses on the NDCG.Future works will consider new policies, objective functions, and the adoption of our method to other models, not necessarily based on GNNs.

Figure 1 :
Figure 1: On the x-axis (y-axis) the relative difference, denoted as Rel.Diff., in recommendation utility NDCG (utility disparity ΔNDCG) • LD (Low Degree): selects the Ψ  % of user nodes with the lowest degree, i.e. fewest interactions in the training set.• SP (Sparse): denoting a user 's density as the average popularity of the items  interacted within the training set, it selects the Ψ  % of users with the lowest density (highest sparsity), i.e. mostly interacting with niche items.• FR (Furthest): selects the Ψ  % of furthest user nodes from   , where the distance from   ∈   is computed as the shortest paths lengths average between   and all   ∈   .• IP (Item Preference): following [4], we estimate the extent to which an item is preferred by   ; thus,  is reduced by selecting the Ψ  % most preferred items by the same group.

Table 2 :
Recommendation utility (NDCG) and utility disparity (ΔNDCG) after applying each method on user groups.For each setting, the best and second best scores are in bold and italic respectively.ZN could improve their recommendation utility.Indeed, with an in-depth inspection, we observed that the policies reducing ΔNDCG have a negligible effect on the NDCG of   , proving our approach focuses only on improving the utility of   .Some policies consistently reducing ΔNDCG regardless of the model (ZN+IP on ML-1M for gender ; ZN, SP, FR on LFM-1K for age) suggest unfairness originates at the data set level, but other policies not working for different models (ZN, FR+IP on ML-1M for age ; LD on LFM-1K for age) underline that the bias is model-dependent.