FairRec: Fairness Testing for Deep Recommender Systems

Deep learning-based recommender systems (DRSs) are increasingly and widely deployed in the industry, which brings significant convenience to people's daily life in different ways. However, recommender systems are also shown to suffer from multiple issues,e.g., the echo chamber and the Matthew effect, of which the notation of"fairness"plays a core role.While many fairness notations and corresponding fairness testing approaches have been developed for traditional deep classification models, they are essentially hardly applicable to DRSs. One major difficulty is that there still lacks a systematic understanding and mapping between the existing fairness notations and the diverse testing requirements for deep recommender systems, not to mention further testing or debugging activities. To address the gap, we propose FairRec, a unified framework that supports fairness testing of DRSs from multiple customized perspectives, e.g., model utility, item diversity, item popularity, etc. We also propose a novel, efficient search-based testing approach to tackle the new challenge, i.e., double-ended discrete particle swarm optimization (DPSO) algorithm, to effectively search for hidden fairness issues in the form of certain disadvantaged groups from a vast number of candidate groups. Given the testing report, by adopting a simple re-ranking mitigation strategy on these identified disadvantaged groups, we show that the fairness of DRSs can be significantly improved. We conducted extensive experiments on multiple industry-level DRSs adopted by leading companies. The results confirm that FairRec is effective and efficient in identifying the deeply hidden fairness issues, e.g., achieving 95% testing accuracy with half to 1/8 time.


INTRODUCTION
Recommender systems (RSs) effectively bridge the gap between quality content and people who might be interested, benefiting and facilitating potential consumers and content providers in many fields, e.g., e-commerce, social media, etc.
A recommender system (RS) collects the feedback data of users in the real-time recommendation and updates itself on a periodic (e.g., daily) basis.In the dynamic process of data collection, model training or updating, item display, user clicks, etc., various biases may be introduced, which will subtly accumulate in the feedbackupdate loop of RSs.[49].If these biases are not disclosed in time, they could eventually lead to serious fairness issues, some of which are known as the echo chamber and the Matthew effect.For example, it is shown that in the MOOC platforms, courses taught by teachers from the United States are over-exposed by RSs, reducing the chances of teachers from the rest of the world being exposured [14].
In recent years, with the rise of deep learning, deep learningbased recommender systems are increasingly and widely deployed in the industry, influencing millions of users in the world on a daily basis.For example, Microsoft released Deep Crossing [33] for advertisement recommendations for its products.Google released the Wide&Deep [10], which was applied to application recommendations in Google play.More recently, different types of deep recommendation models have been developed and introduced in both academia and industry [16,41].In traditional statistical/collaborative filtering-based RSs, the correlation between the recommendation output and input features can be directly and explicitly obtained, which makes fairness issues easily detected by expert check.However, In DRSs, complicated DNN-based feature extraction and representation are unexplainable to humans.Therefore, it is challenging to analyze the deeply hidden fairness issues, and calls for a fine-grained fairness testing system for DRSs.
Fairness testing of deep learning models is an emerging research area in software engineering that aims to expose multiple kinds of fairness issues of a deep learning model, e.g., individual discrimination [23], group disparity [17], etc., using various kinds of search strategies, e.g., random sampling [13], probabilistic sampling [38], gradient-based approach [48] and symbolic execution [3].Despite the significant progress, existing fairness testing approaches are mostly for traditional deep classification problems and are essentially not applicable to DRSs, given the following critical challenges.First, there still lacks a systematic understanding and mapping between the fairness metrics from existing deep learning research [11,43] and the customized testing requirements of DRSs from a system point of view.For instance, a DRS user may not only require accurate recommendation recommendation results but also diverse recommendation results.Second, the industry urgently calls for an efficient fairness testing approach which can be used to periodically evaluate their recommender systems and identify the discriminated user groups in time.The reason is that DRSs use multiple sensitive attributes of sparsely distributed users, forming a vast search space with a much more extensive collection of candidate groups (e.g., millions) than traditional fairness testing fields.Even worse, the large amount of user and behavior data makes it even more challenging to develop an efficient fairness testing algorithm for DRSs.
In this work, we propose FairRec, a novel unified fairness testing framework specifically designed for DRSs to address the above challenges.As shown in Figure 1, FairRec supports the testing of multi-dimensional fairness metrics such as model performance, diversity, and popularity, etc., meaning it considers not only how the model performs for different users but also how badly they might be influenced by user-tailored fairness issues such as the echo chamber and the Matthew effect, so as to meet the practical needs of different users on fair recommendation.Moreover, FairRec embraces a novel fairness testing algorithm based on the carefully designed doubleended discrete particle swarm optimization (DPSO) algorithm that can improve the testing efficiency by magnitude (compared to exhaustive search) while ensuring high testing accuracy.Lastly, given the testing report of FairRec, by adopting some simple re-ranking mitigation strategy on those identified disadvantaged groups, the overall fairness of the DRSs can be significantly improved.
In summary, we make the following contributions: • We establish a systematic understanding and mapping between the diverse testing requirements and the existing fairness evaluation metrics to meet the various practical testing needs of DRSs.• We propose a novel fairness testing approach based on doubleended discrete particle swarm optimization (DPSO) algorithm which significantly improves the testing efficiency by magnitude while ensuring high testing accuracy.together with all the data and models which could benchmark and facilitate future studies on testing and debugging of DRSs.

PRELIMINARIES
In this section, we briefly introduce the relevant background of DRSs and multi-attribute fairness, then formalize our problem.

Deep Recommender System
A recommender system takes the historical behavior of users as input and outputs the items3 each user might be interested in.Let U be a user set and V be an item set, where |U| = , and |V | = .In this paper, we focus on deep learning-based recommender systems, which are usually more complicated and widely used in practice.A DRS usually contain a recall layer and a ranking layer.For any user   ∈ U, the recall layer will initially select  ′ candidates from V based on the user's historical behavior.The ranking layer usually consists of a Deep Click-Through-Rate (CTR) Prediction model which will estimate the user's preferences for all candidate items, and finally selects  items to form the user's recommendation list    .
Definition 1. Deep recommender model Let   be the feature vector of user   ,   = (  , ŝ ) , where   denotes its sensitive attributes such as gender, age, and ŝ denotes its non-sensitive attributes.Let   be the feature vector of item   , the deep learning model is trained to predict the probability    that the user   clicks the item   .A deep recommender model will then output    , which is a vector of  items that have the top− highest probability values, as the recommendation result for each user   .

Fairness of Recommender System
There is a lack of consensus on the definition of fairness in tasks using deep learning.From a high level, the most widely recognized concepts of fairness are individual fairness [15,23] and group fairness [17].In personalized recommendation scenarios, individual fairness, which requires that two similar individuals should be treated similarly enough, is rather tricky to use.
In this work, we focus on group fairness which requires groups differing only in sensitive attributes to be treated with little distinction.Especially we focus on the fairness of groups formed by multiple sensitive attributes [13].Definition 2. Multi-attribute group Let  ∈ N be the size of vector variable of the sensitive features .Let  1 ,  2 . . . . . .  be the possible sets of values of its entries  1 , Let M (  ,    ) represents the performance value regarding an evaluation metric of a recommendation result for a user.Given two multi-attribute groups and a metric in concern, we use the following absolute value to measure the distance between the performance of the recommendations they receive, i.e., which enables us to define multi-attribute group fairness.
Definition 3. Multi-attribute group fairness For the entire user population, we use the maximum gap w.r.t. an evaluation metric M between two groups to measure the group fairness of a DRS, which has been widely adopted in previous work [13,30].Given any metrics M, the unfairness of a DRS can be defined as follows: M measures the degree of group unfairness of a DRS w.r.t an evaluation metric M. The larger the value, the more unfair a DRS is w.r.t the metric.The value itself and also the user groups that correspond to the value form important parts of our testing results.

Problem Formalization
Fairness testing refers to any activity designed to reveal fairness bugs [9].For a given DRS R that recommends  items from the item set for each user , the primary goal of FairRec is to 1) measure the multi-attribute group fairness score defined above.Note that the reported score   M is parameterized by a performance evaluation metric M, which allows FairRec to measure the system fairness from different perspectives.Along with the measurable fairness score, 2) FairRec also reports a certain user-defined number of advantaged user groups and disadvantaged groups, which could facilitate the model developer to further debug the model and improve the model's group fairness.To achieve this goal, we need to find the testing candidates (user groups) for which the performance values regarding a metric differ as much as possible, such that we can accurately uncover the severity of unfairness problem of a DRS.

THE FAIRREC FRAMEWORK
In this section, we first provide an overview of our testing framework, and then present the detailed metrics and testing approach.

System Overview
We present the framework of FairRec in Figure 1, which consists of three main components, i.e., the input module, the testing module and the results display module.Concretely, in the input module, the user and item data, as well as the recommendation models, should be included.Notice that FairRec is a model-agnostic testing system, it only requires black-box queries access to the target recommender models.In addition, the testing requirement of interest such as the sensitive user attributes  (e.g., gender and age) and fairness metrics M of interest should also be configured.In the testing module, FairRec first loads the data and models and then identifies the advantaged and disadvantaged groups concerning the specified fairness metric via our specifically designed double-ended discrete particle swarm optimization algorithm according to the configured requirements.Specifically, more details of our designed effective and efficient search-based testing algorithm will be included in Section 3.3.Finally in the display module, FairRec displays a multi-dimensional testing report that includes the overall fairness evaluation results as well as the details of the found disadvantaged groups, aiming to provide insight on the bias mitigation followed.

RS-Tailored Evaluation Metrics
Group fairness needs to be measured in terms of an evaluation metric (EM) for traditional deep classification tasks or deep recommendation tasks.However, different from the classification tasks where the commonly agreed evaluation metric is the prediction accuracy, it is more than challenging to define 'accuracy' for a personalized recommendation task given the flexibility, diversity, and subjectivity of user's practical needs.In fact, how to define the metrics which can properly reflect these aspects is still popular ongoing research in the RS community [40,46].
In this work, we focus on fairness from a user's perspective and systematically adopt five RS-tailored evaluation metrics in FairRec.
Note that FairRec is flexible to incorporate more evaluation metrics later.These metrics are selected considering multiple practical needs, i.e., 1) model performance metrics (AUC, MRR and NDCG), 2) diversity metric (URD to measure the echo chamber effect), and 3) popularity metric (URP to measure the impact of Matthew effect on users).In the following, we elaborate the details of these metrics in the top-k recommendation.
EM1.1 Area under Curve (AUC) [26].It is defined as the area under the receiver operating characteristic curve, which is plotted with the true positive rate as the vertical coordinate and the false positive rate as the horizontal coordinate.AUC is the most commonly used metric to measure the performance of DRSs.AUC measures how likely an item of interest is ranked higher than another out of interest in a recommended list.A higher AUC value means a DRS recommends more accurately.
EM1.2 Mean Reciprocal Rank (MRR) [39].It measures on average, how top user-target items are ranked in their recommended lists.MRR is calculated with reciprocal ranks as below: where    = 1 denotes that the item   is the target item of   and  (  ) denotes the ranking position of   in the recommendation list of   .If for a user, there is no target item in his list, then the fraction in the sum is assigned 0. A higher MRR value means the target items are ranked higher (which can lead to more clicks), indicating more accurate recommendation.EM1.3 Normalized Cumulative Gain for k Shown Recommendations (NDCG@k) [21].It is a commonly used metric for evaluating the performance of a recommender system based on the graded relevance of the recommended items, which varies from 0.0 to 1.0, with 1.0 representing the ideal recommendation result, i.e., users get the recommendations they are interested in and their favorite items get the most exposure.
where  denotes the sorting position of the item, and  (  ) denotes the model prediction output of   .  and   are calculated based on the recommended list and the ideal list of   , respectively.A higher value of (NDCG@k) indicates better recommendation.
It can be noticed that in general, the metrics AUC, MRR and NDCG@k all describe how well the recommendations match the interests of users.However, sometimes a user may not only want to be shown the items that he already knows he would be interested in, but also want to see other types of items.We need recommendation diversity metric to capture this.
EM2 Diversity (URD) [31].It measures the level of diversity in recommendations for users.A higher value indicates that users receive a more varied set of recommendations, thereby reducing the impact of the echo chamber effect.Specifically, we use the intra-list similarity [50] to calculate the diversity of a recommendation list, (5) where ,  denotes the sorting position of the item in    ,   denotes the -th item and (  ,   ) denotes the similarity between   and   .In this work, we use the Jaccard similarity to calculate the diversity of the recommended items.
EM3 Popularity (URP) [2].It measures how well the popularity of the recommended items matches that of the user's preference.A higher value indicates the user is more affected by the Matthew effect, e.g., a user prefers niche movies but only be recommended with very popular ones.How popular an item   is in general is calculated by where T denotes the training set and  ( ) denotes the number of times item  has been interacted in the training set.Based on Equation.6, we use the following absolute value to define the URP metric.The smaller the value is, the better the popularity of the recommended items matches the popularity preference of the user.
where   and    denote the historical interaction list and the recommendation list of   , respectively.

Search-based Testing Algorithm
Given a specific evaluation metric, the core challenge of multiattribute group fairness testing is to measure the maximum gaps between the most advantaged and disadvantaged user groups.This is highly non-trivial for realistic DRSs due to the following challenges: (i) First, compared to traditional deep learning fairness testing literature which only target a single or few sensitive attributes, there might be dozens of sensitive attributes (e.g., gender, age, nationality, etc.) with multi-dimensional discrete values in a DRS.The combinations of these attributes maps the testing candidates into an extremely high-dimensional and sparse search space (e.g., the number of user groups to search for could be in millions); (ii) Even worse, in an industrial DRS, there can be a huge amount of users (e.g., 360K in some of our experiment) who are sparsely distributed in the search space.The complexity of testing such a real DRS inherently requires extremely high testing budget while it is often required to be completed within an acceptable limited time, making high testing efficiency critical.
In FairRec, we propose a novel double-ended discrete particle swarm optimization (DPSO) algorithm to address the challenging search problem for a DRS and make the testing much more efficient while ensuring high accuracy.The Particle swarm optimization (PSO) algorithm is an evolutionary computation technique [22] to search for the optimal solution of an optimization problem, derived from the simulation of the social behavior of birds within a flock.Previously, PSO has been shown to be effective and efficient in solving multiple kinds of testing problems [28,44].
Algorithm 1 shows the overall testing workflow of FairRec.The inputs include the test dataset , the set of sensitive attribute , and the testing budget, i.e., the maximum number of iteration .At line 1, DPSO first sets up the search space, the optimization objective of testing and initialize the particle swarms.From line 2 to line 6, DPSO will iteratively evaluate the testing objective and update the particle swarms until the testing budget is exhausted.Several important customization and optimizations are particularly made for the fairness testing problem of DRS.
First, to effectively evaluate the maximum gap between two user groups, we propose a bi-end search solution with two particle swarms running simultaneously, where one particle swarm is designed to find the most advantaged group    and the other swarm is designed to find the most disadvantaged group    in an opposite direction according to Equation 1. Specifically, each particle represents a user group segmented by  sensitive attributes, which is a point in the -dimensional search space.For instance, the group consisting of 30-year-old male doctors (i.e., gender=male/0, age=30 and occupation=doctor/9) can be denoted as the point (0, 30, 9) in the 3-dimension search space.
Second, during the testing process, each particle flies through the search space and memorizes its individual best position  (i.e., the grouping condition that results in a current maximum fairness gap between the divided groups), which is shared with the other particles.The global optimal grouping condition  can then be obtained from the  of the two swarms (with different goals in mind) from the whole search history.In each search iteration, particles in the two swarms share better positions with each other and dynamically adjust their own position and velocity in each dimension according to the calculated fitness (i.e., the difference between the metrics of the two divided groups) of its own and other members.The whole search process is iterated until it finally converges to the global optimal position , and the difference in the group-fairness-metric values  M (   ,    ) (Refer to Equation 1) is reported as the fairness measurement of the DRS.Note that the algorithm can be easily configured to memorize a certain number of disadvantaged groups for the final report through maintaining a list of  for them during the search process.
We then introduce the carefully designed optimizations of the two key steps in DPSO to further improve the testing efficiency.Initialize.As mentioned above, users are sparsely distributed in the multi-dimensional partitioned feature space.We therefore propose to initialize the two particle swarms according to the actual distribution of the testing candidate users to quickly locate the target area in the search space.The concrete initialization scheme is illustrated in Algorithm 2. Firstly, we set the search boundary    (line 1) and the probability distribution    (line 2) of each sensitive attribute   ∈ .For example, given   = "gender", assume a group contains 4 females (  = 0) and 6 males (  = 1), we have    = [0, 1] and    = [0.4,0.6].Then, we initialize the particle swarms based on the distribution of users, where   and   denote the particle swarms with search goals of    and    , respectively (line 3-4).Finally the velocity  and individual best  of every particle  are initialized (line 5-8).We also visualize the distribution-based initialization method we proposed and the traditional initialization method in Figure 2 for a more intuitive comparison, in which the green circles represent the actual users distribution in testing candidates.The red stars in Figure 2(a) denote the particles initialized via random initialization, while the red stars in Figure 2(b) are the particles initialized in Fair-Rec based on the modeled distribution.It can be seen from Figure 2 that the distribution-based initialization we proposed significantly improves the particle coverage, which allows the generated particles to be more concentrated on the target region, thus improving the effectiveness of searching for the two target groups.
Update.Algorithm 3 illustrates the details of fairness evaluation and update.After initialization, all particles will move toward the target groups at each iteration, guided by  and  (line 3).The velocity and position of all particles is updated according to the formula defined as follows, where    and    are the position and velocity of i-th particle at the k-th iteration,  1 and  2 are random numbers between 0 and 1, and   is a random number that obeys the standard normal distribution.Each particle   will update the next search direction under the dual guidance of the individual best position   and the global best position .Constants  1 and  2 are the weighting factors of the stochastic acceleration terms, which pull particle   towards   and  positions.To improve the ability of the particles to escape from the local optimal position, we introduce the thermal motion [37], as the first term of the Equation.8, where  denotes the inertia factor which controls the magnitude of the thermal motion.
After each particle updates its position, the fairness fitness will be calculated and stored in the InfoBase (line 8-9).The InfoBase provides a platform for all particles to share information, thus avoiding duplicate computation (line 4-6).After each iteration,   of each particle   will be updated (line [11][12][13][14][15][16] and the two swarms update the candidate global best position  using the shared information (line [17][18].Notice that the update directions of the two swarms are opposite since the search targets are different.

EXPERIMENT
In this section, we first describe our experimental setup, and then evaluate the effectiveness and efficiency of FairRec by answering several key research questions.

Experimental Setup
Datasets.We adopt four widely used large-scale public recommendation benchmark datasets in our experiments, the details of these datasets are as follow: MovieLens [18] is composed of 1 million movie rating data from 6,040 users for 3,883 movies.The sensitive attributes in this dataset are gender, age, and occupation, which can divide users into 294 groups.LFM360K [7] contains music listening history data from 360,000 users with sensitive attributes of gender, age and nationality.In our experiment, a interaction data of 260,000 users and 140,000 music tracks is reserved for training after filtering out those data with missing values, which can be divided into 53,058 groups based on sensitive attributes.BlackFriday4 is a sales dataset obtained from the Kaggle website which consists of 550,068 purchases from 5,889 users for a total of 3,566 items.The sensitive attributes in this dataset are gender, age, occupation, city category, years of stay in the current city, and marital status, based on which users can be divided into 8,820 groups.Amazon [42] electronics dataset includes purchase behavior data from 175,878 users with 944,347 reviews for 29,391 items on Amazon.It does not contain any personal user information.Referring to the setup of [25], we categorize users based on their historical behavior regarding the number of interactions, total consumption, and the highest price of items purchased, and set them as the sensitive attributes.Each attribute contains ten values from 0 to 9, thus users can be divided into 1,000 groups.Following the setting of [19], we divide the last positive interaction data of users into the test set and the remaining into the training set.For each positive interaction, we randomly sample one item without user interaction as a negative interaction in the training set.We further randomly sample 49 items that users have not interacted with and combine them with the target item to form the candidate items for recommendation during testing.
Recommendation Models.We adopt four classic deep CTR prediction models for the recommendation, i.e., Wide&Deep [10], DeepFM [16], DCN [41] and FGCNN [27], owing to the exhaustive academic efforts on it and tremendous amount of industry applications.These models are trained on the above four datasets following the setup of [34].In the testing stage, the top-5 recommendation list is generated from the 50 candidate items according to the predicted CTR in the descending order.The evaluated recommendation performance of the reproduced models are shown in Table 1.Baselines.We are not aware of a fairness testing method specialized for DRSs from the perspective of multi-sensitive attributes.Since Themis [13] and TestSGD [47] have explored the issues of multi-attributes group fairness in the classification tasks, we take them as the baselines by extending them to our tasks.For fair comparison, all three methods group users based on the same sensitive attributes and quantify unfairness using the metrics introduced in Section 3.2.Specifically, in Themis, the target groups are found via brute-force enumeration based on sparse values of different attributes, which we consider as the globally optimal target groups, and the fairness scores it obtains represent the accurate values.In TestSGD, a parameter  is used to improve testing efficiency by filtering out groups with users less than  percent of the total users.Then it calculates the fairness scores of the remaining groups and finds the target groups by traversal.In our experiment, we set  to 0.01, 0.005, and 0.002 for datasets of different scales to investigate the impact of  on the testing results of TestSGD.Moreover, to mitigate the impact of long-tail distribution of super subgroups, we also filter out those groups with users less than 0.001 percent of the population for all the testing methods.
Implementation.All experiments are conducted on a server running Ubuntu 1804 operating system with 1 Intel(R) Xeon(R) E5-2682 v4 CPU running at 2.50GHz, 64GB memory and 2 NVIDIA GeForce GTX 1080 Ti GPU.To mitigate the effect of randomness, all experiment results are the average of 5 runs.The hyper-parameters used in our experiments are: a) inertial factor  = 0.09, b) the acceleration parameters  1 =  2 = 2, and c) the velocity limits  * = 2, which are obtained by experimental analysis and more details of the analysis will be discussed in the following section.

Research Questions
RQ1: Is FairRec effective enough to reveal and measure the unfairness of a DRS?
To answer this question, we evaluate FairRec on the four datasets and four DRSs using the five metrics described in Section 3.2.The main results are summarized in Table 2, in which each entry denotes the multi-attribute group fairness scores (Definition 3), which is the difference in the metric-performance values of the most advantaged and the disadvantaged groups as found by the corresponding testing method.The larger the score, the more unfair is the DRS.The rows underlined with a dotted line denote the results of Themis, which are optimal values (corresponding to the optimal target groups) for each fairness metric.In order to evaluate the effectiveness of TestSGD and FairRec, we define the testing accuracy as the ratio of the entry of a testing method to that of Themis.
Either TestSGD or FairRec may end up with the locally optimal target groups.The bold value in each column denotes the most accurate result, or the one closest to that of Themis.Table 2 presents us a variety of observations and insights as follows.
First, FairRec almost always obtains the most accurate testing results (occupying the bold values on ∼90% cases), i.e., achieving ∼95% testing accuracy compared to 94% and 43% for TestSGD.A closer look reveals that for all the metrics, FairRec achieves an average of 95.27%,97.37%,92.38% and 92.83% of the optimal results on MovieLens, LFM360K, BlackFriday and Amazon respectively.Besides, the performance of FairRec is rather stable despite the random nature of DPSO.For instance, even for LFM360k which has the most user groups, the test accuracy of FairRec only fluctuates in a small range of 88.68%-100% due to the specific optimization we designed for DRSs, as mentioned in Section 3.3, which greatly improves the stability of FairRec.
The testing results can also be reliably used as evidence to assess the fairness deficiencies of different DRSs.For example, considering MovieLens and    (whether user gets diverse recommendations), with Themis we can conclude that Wide&Deep is the most unfair with a difference of 0.1786 (for FGCNN it is 0.1346), indicating that Wide&Deep presents users with the least diverse recommendations and echo chamber is easier to occur.With FairRec, we can get a similar conclusion that Wide&Deep is more unfair than DCN and FGCNN.However, TestSGD  =0.01 would conclude incorrectly that Wide&Deep is the most fair, deriving the smallest    value.It is worth noticed that FGCNN exhibits the most serious fairness issues among the four DRSs in most cases.A reasonable explanation is that the feature intersection process of FGCNN is too complex, which cause it to be more sensitive to the individual information and tends to amplify individual differences.
In addition, we also designed a set of experiments to compare the testing accuracy of the three methods given a fixed testing budget, as shown in Table 3.The testing time are dynamically set as follows,   = 0.005 * (  +   ), where   and   denote the number of users and groups after being divided based on sensitive attributes, respectively.
In Table 3, the largest value in each column is marked bold, representing the biggest difference in the target groups found in the given time.Note that different from Table 2, here the group difference found by Themis is not always the largest anymore, while that found by FairRec is for most cases (e.g., 79% compared to 62% of Themis and 55% of TestSGD).The reason is that for the fixed-time testing, Themis degraded to random search which may not be able to find the target groups with the globally maximal difference, while FairRec is more effective than Themis and TestSGD in general.The results on BlackFriday and Amazon are attached in the Appendix.
Second, how much a DRS is fair is not purely determined by the model itself, but also influenced by the choice of the fairness metrics.For instance, for BlackFriday, DeepFM achieves the smallest fairness score (being the most fair) regarding    (0.0591 by Themis), while it also the most unfair regarding    , among the four DRSs.What's more, we observe that fairness performance regarding different metrics may not be correlated in an expected way.For example, while all the three metrics    ,    and    describe how well the recommendations capture user interest, a DRS may not achieve the same levels of fairness for them.Given MovieLens, DeepFM is the most fair regarding    , but is also the most unfair regarding    .
To capture the observations above more in detail, we depict heat maps of correlation between different metrics with regard to DRSs, for the four datasets in Figure 3.Given a map, each block represents the correlation value between the two corresponding metrics (computed with their values taken for the four DRSs).We can see that for all the four datasets,    is negatively correlated with    , while positively correlated with    .This implies that if the replacement of the DRS improves    , then it may increase (decrease)    (   ).Also, in general the popularity metric    seems to be negatively correlated with    , meaning if we want the system to know better whether users prefer popular items, then we may end up with the sacrificed performance.More importantly, while it may usually be perceived that more diverse recommendations might reduce performance, we observe the opposite, namely    and    (or    ,    ) are positively correlated sometimes.Hence it is barely possible to achieve simultaneous improve on fairness of all the metrics, by choosing DRSs (none of the four DRSs is the most fair for all the metrics).And for a DRS developer, s/he has to consider the specific requirements of the application and carefully make trathe appropriate metrics.
Third, we compare FairRec with the vanilla PSO to validate the delicate designs of our DPSO algorithm in FairRec.As shown in Figure 4, where the horizontal axis represents the different data sets, and the vertical axis represents the improvement obtained by the optimized method.On the four datasets, the test results of the optimized DPSO algorithm have improved by 4.90%, 103.01%, 22.3% and 9.35% respectively, while the average time consumption only increased by 11.01%.In particular, in the case of sparse user distribution, the test results of the optimized algorithm improve by 103.01% on average for LFM360K (the largest dataset), while the time consumption increases by only 8.75%.The reasons behind is that our method makes the particles more concentrated in the densely distributed region of the user and then move quickly towards the optimal solution, thus automatically filtering many invalid groups and saving a lot of time for computation.
With all the observations above, we get the following answer: Answer to RQ1: FairRec is effective enough, even in limited time, to reveal and measure the unfairness of DRSs, achieving ∼95% testing accuracy with ∼half to 1/8 time.Given fixed testing budget, FairRec is the most effective.To answer this question, we evaluate the efficiency of Fair-Rec and make comparisons with Themis and TestSGD.The main quantitative comparison results are shown in Table 2.It is clearly observed that FairRec can achieve comparable testing performance to Themis (∼95% testing accuracy) in around ∼half to 1/8 of the time Themis requires.Moreover, the advantage of FairRec in efficiency becomes more significant when there are more users and sensitive attributes in the testing candidates.For instance, on the LFM360K dataset, FairRec has promoted the testing efficiency by more than 5 times and 1 time compared to Themis and TestSGD  =0.002 , respectively.Specifically, Themis requires more than 4.5 hours for testing just 260, 000 users, which is almost infeasible in the industry scenarios with critical efficiency needs.
Considering that numbers of users and groups are two critical factors affecting the testing efficiency, in this experiment, we further explore their impact on the testing efficiency of the three testing approaches.The results are shown in Figure 5, in which the horizontal axis denotes the number of groups after division and vertical axis represents the total number of users.From Figure 5, we can see that the time consumption of Themis increases exponentially as the number of groups and users increases.Concretely, in the test against DCN, as the number of users increased from 20,000 to 100,000 and the number of groups increased from 10,000 to 40,000, the time consumption increased from 756 seconds to 9768 seconds.Such level of efficiency is unacceptable to the realistic industry recommendation scenarios, where there are usually tens of millions or even hundreds of millions of users.It also shows a similar trend on the efficiency of TestSGD as the number of users grows.When the total number of users remains the same and the number of groups increases, i.e., the user distribution becomes more sparse, TestSGD thus filters out more groups that do not meet the number requirement, so its testing efficiency becomes higher.However, it comes at the expense of testing effectiveness and the improvement is not significant enough either, which is also corroborated in Table 2.
In comparison, the growth trend on time consumption of Fair-Rec is relatively smoother when the number of users and groups increases.Concretely, the time overhead of FairRec increases from 114 seconds to 1242 seconds as the test scenario changes from the simple to the most complex, which are only 15% and 12% of the time required by Themis under the same conditions, respectively.We believe such high efficiency mainly benefits from the three delicate designs in our FairRec.First, the distribution-based particle initialization modeled from testing candidates can effectively improve the coverage of particles, thus helping to quickly locate the target area in the search space.Second, the specifically designed    double-ended search scheme can boost the search process and guarantee the search convergence.Besides, the designed InfoBase can help different particle swarms share information with each other, which would avoid many unnecessary computational overheads.
Answer to RQ2: FairRec's testing efficiency outperforms Themis and TestSGD in a significant margin and is more efficient in more complex test scenarios.
Hyperparametric Experiment.Usually, the size of the initialized populations and the maximum number of iterations influence the accuracy and efficiency of the testing.We design the hyperparametric experiment to analyze the effect of these two hyperparameters.Define  =   /|  |, which indicates the ratio of the initialized particles to the set of user groups.The result is shown in Figure 6(a).The smaller  is, the more difficult it is for the particles to collect enough information in a high-dimensional search space, resulting in low accuracy.As  increases, the accuracy of FairRec gets more stable, and the testing result is getting closer to the optimal result while the testing time becomes longer.We suggest choosing  in the range of [0.002, 0.005] in the complex case, which can ensure both high testing accuracy and testing efficiency.Then we set  = 0.005 to discuss the impact of the number of iterations as shown in Figure 6(b).As   increases, the overall testing accuracy improves and eventually approaches the optimum.In this case, we suggest choosing   in the range of [15,20].Note that in the relatively simple cases such as MovieLens and Amazon where   ≤ 1, 000, we set the  in the range of [0.1, 0.2] and   in the range of [5,10] to get better results.Our results on the two hyperparameters provide testers hindsight of how to make a trade-off between testing efficiency and accuracy.RQ3: Can we use the the testing results of FairRec to improve the fairness of DRSs?
Besides uncovering and evaluating the severity of fairness problem, our testing also aims to provide insight and guidance on improving the group fairness of a DRS.Note that our testing report not only includes the multi-attribute group fairness scores as shown in Table 2, but also provides a user-defined number of advantaged groups and disadvantaged groups (remember that the InfoBase records the metric values of each group).In this way, a model developer can target any specific disadvantaged groups (not necessarily the worst one) regarding a specific metric for bias mitigation.
We adopt a simple re-ranking based method [14] in the experiment to improve the recommendation performance over disadvantaged groups to show the usefulness of fixing these groups in improving the overall fairness.We selected the 10% groups with the worst performance in ,  ,  , and   for optimization, respectively.To better demonstrate the effect of bias mitigation, we first directly show the changes in the relevant metrics for these groups before and after optimization, as shown in Table 4.For example, considering MovieLens and Wide&Deep,  increased by 87.72%, meaning that disadvantaged groups have more access to their target items.In addition, considering Amazon and DeepFM,   decreases by 29.79%, which indicates that the recommendation results are more in line with popularity preferences of disadvantaged groups.The change of overall fairness score is shown in Table 6 in the Appendix due to space limit.
Answer to RQ3: The test results of FairRec can provide valuable insight and guidance for bias mitigation.

RELATED WORK
Fairness testing.This work is closely related to fairness testing of deep learning models.There are many existing works focusing on individual fairness testing [12,48].They try to generate test cases that are discriminated on sensitive attributes, e.g., by changing the value of a sensitive attribute.In terms of group fairness, a number of definitions have been proposed [6,29].In regards to DRSs, our main objective is to identify and address any discrimination that may be present towards real users within the recommendation system, rather than attempting to expose all instances of discrimination across the population.Therefore, while individual fairness testing methods based on generated inputs is not applicable, we focus on group fairness testing in this work.However, the current level of attention towards group fairness testing is deemed insufficient in contemporary scholarship.Si et al. [35] proposed a statistical testing framework to detect whether a model satisfy multiple notions of group fairness, such as equal opportunity, equalized odds, etc. Galhotra et al. proposed Themis [4,13] to measure software group discrimination with two metrics, i.e., group discrimination score and causal discrimination score.Themis groups users by various sensitive attribute values and then gets the group fairness score for each group by brute-force enumeration.Zhang et al. proposed TestSGD [47] to measure group discrimination.It automatically generates an interpretable rule set (i.e., reference for grouping) and each rule can be used to dichotomize the population.TestSGD filters the groups with a size smaller than a certain threshold like 0.01 and then takes brute-force enumeration through the remaining groups to obtain their fairness scores.However, the testing methods based on group fairness mentioned above are mostly focused on software and classification systems, which are intrinsically different to DRSs.
Fairness of recommender systems.The fairness of recommender systems is nowadays receiving more and more attention [5,32].Recommender systems involve multiple stakeholders, including users, content/product providers, and side stakeholders [1].There has been some work discussing fairness from user perspective [20,24].Yao et al. in [46] proposed four group fairness metrics to evaluate collaborative filtering models.They dichotomize the population by gender and study sexism in recommender systems on the MovieLens dataset as well as on synthetic datasets.Li et al. [25] divided users into active and inactive groups according to their behavioral history in the e-commerce recommender system.They find that users less active were treated significantly worse than those who are more active.Wu et al. [45] introduced a sensitive attribute predictor to measure the association between news recommendations and gender, and used it to evaluate the fairness of the news recommender system.However, the above mentioned works all use a single attribute to evaluate the fairness of recommender systems, which cannot capture the deeply hidden fairness issues in a multi-attribute perspective (Refer to Table 5 in Appendix).In addition, they do not propose efficient methods to test the group fairness for DRSs.
Unfairness mitigation.Methods for unfairness mitigation can be divided into three main categories: pre-processing, in-processing and post-processing [8].Pre-processing methods ensure fairness by eliminating the bias present in the training data.Ryosuke et al. [36] proposed a pre-processing method based on pairwise ordering for weighting pairs of training data, which can improve the fairness of the ranking model.In-processing methods consider fairness constraints in the model training process.For example, Yao et al. in [46] proposed four unfairness metrics, which can be optimized by adding fairness terms to the learning objective.Wu et al. [45] introduced an unfairness mitigation approach for news recommendation using decomposed adversarial learning, which eliminated the biases brought by sensitive features to ensure users get unbiased recommendations.Post-processing method adjusts the output of the base recommendation model to mitigate unfairness.Li et al. [25] employed a re-ranking algorithm to produce new recommendation lists for users by leveraging the initial recommendation results, thereby meeting the fairness constraint.Compared with the previous two methods for mitigating unfairness, post-processing methods do not require retraining the existing recommendation models, saving a significant amount of resources, and are also more practically feasible in real-world systems.In this work, we employ post-processing methods to demonstrate how the results of Fair-Rec can help disadvantaged groups and alleviate fairness concerns across the entire DRS.

CONCLUSION
In this work, we study the problem of multi-attribute group fairness testing on deep learning based recommender systems by presenting FairRec, a systematical fairness testing framework built upon a specifically designed search-based testing algorithm.We answer the three key research questions through quantitative and qualitative experiments on four extensively used recommendation models over four public benchmarks.The experiment results demonstrate that FairRec is effective and efficient in revealing the fairness problems of DRSs from different perspectives, and outperforms the compared baselines by a significant margin.Furthermore, it also shows that our testing results can provide insight and guidance for mitigating the bias of the tested DRSs with little negative impact on the recommendation performance.Our work may shed new light on the research of building more fair deep recommender systems from a testing point of view and benchmark future testing research in this area.Furthermore, FairRec has the potential to be used as a generic testing framework in other systems for multi-attribute group fairness testing.

Figure 2 :
Figure 2: Comparison of different initialization methods.

Figure 3 :
Figure 3: The correlation between different metrics.

Figure 4 :
Figure 4: The effectiveness improvements of FairRec compared to the vanilla DPSO.

Figure 5 :
Figure 5: Comparison of the efficiency of the three methods.The horizontal axis indicates the number of groups after division and vertical axis represents the total number of users, with unit of ten thousand.The contour lines in the graph indicate the time required for the test.Intuitively, denser contours indicate a higher slope of time growth.The different colored areas in the graph indicate the distribution of the values of the test time, with blue areas indicating smaller values and red indicating more significant time consumption.

Figure 6 :
Figure 6: The impact of  and  on efficacy of FairRec.

•
We evaluate FairRec with multiple industry-level deep recommendation models on popular large-scale datasets in this field.Our experiments confirm that FairRec is significantly more effective and efficient than the previous available methods, e.g., achieving ∼95% testing accuracy with ∼half to 1/8 time.•Weuncover the relation between the fairness performance w.r.t the different evaluation metrics in the experiments.•We implement and release FairRec as an open source toolkit2 2 • • • ,   respectively.The set of multi-attribute groups defined by sensitive attributes can then be expressed as the Cartesian product S =  1 ×  2 × • • • ×   , an element of which describes a specific user group e.g., group    with Algorithm 1:  of FairRec Input: Test Dataset , Sensitive Attributes , Maximum iterations   Output: Fairness score   M , target groups    ,    1   ,   =  (, ); 2  = 0 ; 3 while  <   do 4   ,   , InfoBase =   (  ,   );    =  (InfoBase); 8    = (InfoBase); 9   M =    −    ; 10 return    ,    ,   M

Table 1 :
Recommendation performance of the models.

Table 2 :
The main results of fairness testing from the perspective of effectiveness and efficiency.

Table 3 :
The fairness testing results on MovieLens and LFM360K within limited time.
RQ2: How efficient is FairRec compared with existing work?

Table 7 :
The fairness testing results on BlackFriday and Amazon within limited time.