Automated Fairness Testing with Representative Sampling

The issue of fairness testing in machine learning models has become popular due to rising concerns about potential bias and discrimination, as these models continue to permeate end-user applications. However, achieving an accurate and reliable measurement of the fairness performance of machine learning models remains a substantial challenge. Representative sampling plays a pivotal role in ensuring accurate fairness assessments and providing insight into the underlying dynamics of data, unlike biased or random sampling approaches. In our study, we introduce our approach, namely RSFair, which adopts the representative sampling method to comprehensively evaluate the fairness performance of a trained machine learning model. Our research findings on two datasets indicate that RSFair yields more accurate and reliable results, thus improving the efficiency of subsequent search steps, and ultimately the fairness performance of the model. With the usage of Orthogonal Matching Pursuit (OMP) and K-Singular Value Decomposition (K-SVD) algorithms for representative sampling, RSFair significantly improves the detection of discriminatory inputs by 76% and the fairness performance by 53% compared to other search-based approaches in the literature.


INTRODUCTION
Fairness has become an increasingly central factor in assessing the performance of Machine Learning (ML) models.As these models continue to grow in terms of accessibility and impact, they are set to take the helm as the principal decision-makers across a multitude of domains such as healthcare [19], nance [16], transportation [6], social media and marketing [4] and many more.
ML models are, at their core, software programs designed to make predictions or decisions based on patterns found in data.These models, although more complex than traditional software, are still subject to the same principles of software testing.In fact, testing ML models also require novel testing strategies (e.g.[32]).Fairness testing is another form of software testing that focuses on evaluating a ML model's behavior with respect to its fairness or bias towards certain features.In the context of fairness, errors can manifest themselves as discriminatory outcomes or biased predictions favoring or disadvantaging speci c demographic groups.Fairness testing aims to identify such biases and assess the model's performance across di erent subgroups to ensure equitable results for all users.
For instance, we can consider the case of a real-world salary dataset comprising data on gender, salary, and other related factors.In a scenario where the dataset exhibits a gender-based salary gap, the ML model would internalize and perpetuate this imbalance in its predictions.This, in turn, fosters a self-perpetuating cycle of unfairness, where future salary predictions continue to mirror and intensify the initial bias.This hypothetical case underscores the urgent need for fairness testing to detect and prevent any unfair tendencies demonstrated by ML models.
Another important example can be given from ChatGPT.Since it was released, ChatGPT has gained exponential popularity and had a wide spectrum of applications.In the studies on the fairness performance of ChatGPT, Li, and Zhang [21] state that ChatGPT still has fairness issues even though it has a better fairness performance than the small models.Large corporations such as Google [27] and Microsoft [25] also show interest and engage in studies related to fairness testing.They develop guides and repositories with the aim of encouraging developers to boost the fairness level of their technological products.
Researchers and practitioners enthusiastically search ways to identi cation and elimination of discriminatory patterns to guarantee equitable outcome [24].Nevertheless, it's important to understand that simply addressing discriminatory behavior in extreme cases isn't enough.We believe that it is equally vital to consider the common cases that arise in real-world contexts, as these frequently represent the majority of situations and are thus of substantial consequence.
To tackle this challenge of identifying the fairness of ML models, we propose representative sampling in this study as a highly e ective strategy.By crafting a subset of the main dataset that accurately mirrors its inherent dynamics, representative sampling enables ML models to be tested against a more realistic and comprehensive collection of data points.This methodology guarantees that the assessment of fairness is attached to comprehensive and detailed understanding of the data, thereby empowering researchers to e ectively uncover and address discriminatory behavior.
Our research demonstrates that the utilization of representative sampling, speci cally via the implementation of Orthogonal Matching Pursuit (OMP) and K-Singular Value Decomposition (K-SVD) algorithms, substantially improves the detection of discriminatory inputs and enhances the fairness performance of ML models.Through empirical analysis on two datasets, we demonstrate that representative sampling outperforms both AEQUITAS [32], the base approach that our study is inspired of, and a random sampling from the dataset by focusing more on the common cases.This implies that the strategic use of representative sampling is a key step forward in fostering fair and equitable automated decision-making systems.

RELATED WORKS
This section discusses the extent of the study and related works on fairness testing, representative sampling, OMP, and K-SVD.

Fairness Testing
There have been several researchers that have directed their e orts towards developing classi ers that mitigate discrimination [11,13,15,20,35].These works primarily concentrate on the theoretical aspects of classi er models with the aim of attaining fairness during the classi cation process.This objective is accomplished either by pre-processing the training data or by modifying existing classi ers to limit discriminatory outcomes.
Galhotra et al. [14] de ned group and causal fairness rst.According to their de nitions, group fairness refers to the concept that the distribution of outcomes across di erent groups based on certain characteristics should be equivalent or similar.This means, to ensure fairness, the proportions of individuals receiving a certain outcome (e.g., receiving a loan) should be roughly the same across the de ned groups.These groups can be based on single or multiple characteristics such as age, race, or a combination of these.The concept of group fairness strives to minimize group discrimination, which can be measured by the disparity between the highest and lowest outcome fractions across di erent groups.On the other hand, causal (individual) fairness refers to the principle that when software is given two individuals who are identical in all aspects except for a set of speci c characteristics (e.g., race or age), it must produce the same output.The software is considered fair with respect to these characteristics if, regardless of di erences in these characteristics, the same output is produced for all individuals with identical pro les in all other respects.This concept emphasizes the need for software outputs to be independent of certain characteristics and aims to eliminate causal discrimination, which is quanti ed by the fraction of inputs for which the software produces di erent outcomes based on these characteristics.
To detect individual discrimination, Galhotra et al. pioneered the development of THEMIS fairness testing tool.This innovative tool was designed with the capability to address both group and causal fairness detection issues, thereby o ering a comprehensive solution to discrimination detection.However, it's important to note that their strategy, both in the case of group and causal discrimination detection, heavily relies on random test case generation for identifying discriminatory instances.This approach of using random data point generation, while practical and straightforward, may not always yield the most comprehensive or representative results.Despite its simplicity, it can overlook patterns or instances of discrimination that might be more evident with a more directed or representative sampling approach.Nonetheless, the work of Galhotra et al. in creating THEMIS has signi cantly advanced the eld of fairness testing and has provided a solid foundation for further developments in discrimination detection.
FairTest [29] utilizes manually crafted tests to assess discrimination scores across four di erent categories.Their approach revolves around using the indirect correlation patterns that exist between attributes, such as the relationship between salary and age, in order to generate test cases.
FairML [1] employs an iterative methodology that relies on an orthogonal projection of input attributes to enhance the interpretability of black-box predictive models.By utilizing this iterative technique, it becomes possible to quantify the relative reliance of a black-box model on its input attributes.This relative dependence can then be used to evaluate the fairness or level of discrimination that is present in the model by assessing the relative signi cance of the input attributes.
AEQUITAS [32] is a tool created for detecting and eliminating individual biases, thus advancing the fairness performance of these models.The tool intelligently identi es potential bias in ML applications, paving the way for improved fairness and application utility.Its automated processes negate the possibility of human error in bias detection, ensuring a superior level of accuracy.Moreover, AE-QUITAS continuously strives to enhance the fairness performance of the ML models, serving as a safeguard against any unintentional or intentional introduction of bias into the algorithms.Since it o ers an elegant solution to fairness problem and creates an standalone tool, we decided to use it as our base study.We explain the detailed approach of AEQUITAS in the 'Base Study' section.
AEQUITAS is still regarded as one of the state-of-the-art solutions in fairness testing eld, and hence, various studies propose di erent testing approaches that are built upon AEQUITAS.For instance, Aggarwal et al. [2] propose a black box method called Symbolic Generation (SG), which emphasizes symbolic execution and local explainability to generate e ective test cases.SG demonstrates better performance than AEQUITAS when applied to Decision Tree and Random Forest models.However, AEQUITAS outperforms SG when dealing with MLP models.Aggarwal et al. [2] does not present any results regarding retraining e ectiveness, a key metric for evaluating the usefulness of discriminative inputs identi ed by these methods.
Another study by Zhang et al. [36] introduce a white-box method named Adversarial Discrimination Finder (ADF) which employs gradient computation and clustering.ADF is also compared with both AEQUITAS and SG, o ering insights into the retraining e ectiveness for both approaches.While ADF performed quite closely to AEQUITAS, SG fails to consistently surpass AEQUITAS' performance.Moreover, Zhang et al. [36] demonstrate that SG is ine ective in terms of running time, whereas ADF succeeds in reducing the time required for test generation.In summary, although ADF proves to be a good alternative in terms of running time, it still could not signi cantly outperform AEQUITAS' retraining performance.

Representative Sampling
Representative sampling is a statistical method used in research to ensure that the chosen sample accurately re ects the population that it is intended to represent.This approach is crucial for enhancing the generalizability and validity of the ndings, as it helps to minimize selection bias and facilitates accurate extrapolation of results to the broader population.Random sampling and representative sampling, while similar, di er fundamentally in their underlying goals: random sampling seeks to achieve a fair chance of selection for each member of a population, while representative sampling aims to accurately mirror the population's key characteristics.To attain a representative sample, multiple methods can be used, such as strati ed sampling, cluster sampling, or quota sampling [22], each with its unique approach to capturing the diversity of the population.Recent research made by Hoeven et al. [33] highlights that the choice of sampling strategy heavily depends on the study's objectives and the nature of the population, thus reinforcing the need for careful consideration when designing the sampling procedure.
Empirical studies conducted by Xu et al. [34] reports the e ectiveness of representative sampling approach in the eld of text retrieval.Particularly at the initial stages of active learning, their proposed methodology signi cantly outperforms the performance of both Support Vector Machine (SVM) active learning and random sampling.This superior performance implies that representative sampling can deliver highly e cient learning results using fewer labeled documents, potentially minimizing human input in text classi cation tasks.
In their research, Grafström and Schelin [17] provide a formalized understanding of representative sampling, introducing novel techniques for generating a condensed sample set that retains the essential features of the original data.They further contribute by introducing a novel distance function and a variance estimator, enriching the tools available for representative sampling.

OMP and K-SVD
OMP is a greedy algorithm that has been extensively used for solving sparse approximation problems, particularly in signal processing [30].It iteratively selects the "best matching" projections of the signal onto an overcomplete dictionary [26].The application of OMP has extended beyond signal processing into elds such as machine learning and computer vision.OMP was successfully applied to solve the feature selection problem, demonstrating its potential for contributing to higher-dimensional data analysis tasks [28,31].
K-SVD, on the other hand, is an algorithm used for dictionary learning.It aims to achieve a sparse representation of a given set of signals, thereby providing an e ective means to perform tasks such as signal and image denoising [12], and image compression [3].It has found noteworthy applications in image processing, where it has demonstrated strong performance in tasks such as face recognition [37], and color image restoration [23].
The integration of OMP and K-SVD has been well-studied in elds such as signal processing and image analysis [5,8].For example, Bryt et al. [9] demonstrate that combining these two methods signi cantly improves the e ciency and accuracy of sparse signal representation, thereby providing a robust tool for image and signal reconstruction tasks.Another study by Rubinstein et al. [10] further validate these ndings and also extends the application of the OMP+K-SVD approach to image-denoising tasks.
In the domain of fairness testing, the application of OMP and K-SVD is still relatively novel, but previous works in other elds have demonstrated us the potential of these techniques to generate representative samples.Speci cally, the use of OMP for feature selection and K-SVD for dictionary learning would help construct a representative dictionary of data instances, thereby ensuring comprehensive and fair testing of ML models.Our work builds on these ndings and proposes an automated fairness testing framework that integrates OMP and K-SVD to ensure representative sampling.

BASE STUDY
We have been inspired by a study [32] that proposes a promising solution called AEQUITAS for fairness testing.This tool, characterized by its automation and directed approach, primarily concentrates on the identi cation and elimination of individual bias.The core functionality of AEQUITAS revolves around two pivotal attributes: identifying instances of discriminatory behavior, and enhancing the fairness performance of the ML model.
AEQUITAS speci cally identi es potential points of discrimination in ML applications, providing an important lead towards enhancing their fairness, and thus, their utility in various elds.By making the entire process automated, it e ciently removes any human error in bias detection, bringing a new level of accuracy to the overall system.Simultaneously, it also aims at enhancing the ML model's performance in fairness, acting as a check and balance system for the bias that may creep into the algorithms, either intentionally or unintentionally.

Figure 1: Discrimination Detection Mechanism of AEQUITAS
As indicated in Figure 1, the detection mechanism is predominantly concerned with modifying the sensitive feature and subsequently examining the resulting output of the ML model.Due to the inherent property of robustness in ML models, a minor alteration in input should not cause a signi cant disparity in output.However, if a substantial change is observed, it signals that the model does not maintain its fairness for that speci c data point.In simpler terms, if the model exhibits varying behavior for two data points that are identical except for the sensitive feature, this suggests the occurrence of discriminative behavior at that particular point.
The search procedure is divided into two primary stages: a global search and a local search.The global search entails generating random data points within the bounds of each feature.Following the creation of each data point, the mechanism veri es whether the model is exhibiting unfair behavior at that speci c point.If it identi es such points, they are stored for subsequent use in the local search stage.Once the global search is completed, the local search phase commences.The fundamental objective of the local search is to probe the surrounding/next discriminative point identi ed during the global search, with the aim of discovering additional discriminative points.
The enhancement of fairness performance is heavily dependent on both the quantity and quality of discriminative data points uncovered during the search stage.AEQUITAS operates on an iterative process that progressively augments the training set with the data derived from the discovered and corrected discriminative points.After adding these points, the model undergoes retraining, followed by another assessment of its fairness performance.If there is a decrease in fairness at any stage of this process, as opposed to an increase, the procedure is stopped.The system then returns the retrained model, which has now been updated with the enhanced fairness parameters.
AEQUITAS uses three di erent approaches and compares them with each other.Random AEQUITAS is purely random form of the approach.Semi-directed AEQUITAS updates and uses the perturbation direction probability.Fully-directed AEQUITAS uses weights for probabilities for both perturbation direction and the parameter to perturb.
AEQUITAS is assessed on only one dataset, which is Adult (Census) dataset, but the authors analysed the performance of the approach on six di erent ML models, namely Fair SVM, SVM, MLPC, random forest, decision tree and ensemble.
The authors evaluated the performance of AEQUITAS in terms of e ectiveness, e ciency and usefulness, and concluded the following: • E ectiveness: The fully-directed approach of AEQUITAS is signi cantly more e ective, generating up to 20.4 times more discriminatory inputs than a purely random approach.Moreover, this fully-directed methodology improves the performance by up to 56.7% compared to the semi-directed AE-QUITAS approach.The latter, however, outperformes AE-QUITAS random, by up to 64.9%.• E ciency: In terms of speed, AEQUITAS fully-directed surpasses the state-of-the-art methods by 83.27%.This improvement is mostly pronounced in the context of Multi-Layer Perceptron models, where it reaches a maximum enhancement of 96.62%.• Usefulness: On average, utilizing AEQUITAS for retraining reduces the discrimination percentage by 43.2%, and in optimal conditions, this reduction can reach up to 94.36%.
Even though AEQUITAS suggests an inspiring solution for the discrimination detection and retraining process, global search approach is a major drawback on the performance of AEQUITAS.Randomized and non-realistic data points have been used for discrimination detection, which jeopardizes the usefulness of the approach.

METHODS
We propose a new method, which we call RSFair that essentially follows the same steps of AEQUITAS but improves the global search step which inevitably improves the local search and retraining steps.Instead of using randomly generated data points, we suggest representative sampling and use two methods which are already proven its usefulness in dictionary learning.Those two methods are OMP and K-SVD.Even though they are two separate methods, we implemented them back to back to create the best representative subset from the main training set.Below, we explain the details of our approach in conjunction with OMP and K-SVD.

OMP
Orthogonal Matching Pursuit (OMP) is an algorithm used for sparse signal recovery in the realm of compressed sensing, particularly in signal processing and machine learning.It works by constructing an approximation to a signal using a small number of elements from a xed, pre-determined dictionary.
As Tropp et al. stated [30], the algorithm proceeds iteratively, identifying at each step the dictionary element that best correlates with the current residual (the portion of the signal not yet approximated).This element is then added to the set of selected elements, and the approximation and residual are updated for the next iteration.
The original process is as follows: (1) Initialization: At the beginning, no elements have been selected.The initial approximation is zero, and the initial residual is the complete signal.(2) Element Selection: In each iteration, the algorithm chooses the dictionary element that has the maximum absolute inner product with the current residual.
(3) Update: The selected dictionary element is added to the set of previously chosen elements.The approximation to the signal is then updated by computing the projection of the signal onto the space spanned by the chosen elements.This is equivalent to solving a least squares problem on the selected elements.The residual is then updated by subtracting the new approximation from the signal.(4) Termination: The process continues until a stopping criterion is met, which could be a pre-speci ed number of elements, a threshold on the norm of the residual, or a combination of these.
In our case, we select one dictionary element only instead of many in order to select the most correlated dictionary atom to the record (instance) under inspection.Therefore, this method implies that we do not repeat the second and third steps of the traditional OMP algorithm mentioned above.
OMP has been recognized as an e cient method for solving sparse approximation problems, particularly due to its computational simplicity and its robustness in the presence of noise.However, one should note that it requires the dictionary to have a certain level of inconsistency, and its performance may degrade if this property is not satis ed.

K-SVD
K-Singular Value Decomposition (K-SVD) is an algorithmic method for dictionary learning, utilized to nd a dictionary that leads to a sparse representation of a signal.In the signal processing and machine learning elds, K-SVD plays a vital role in addressing tasks that require a sparse and compact representation of data.
The K-SVD algorithm created by Aharon et al. [3] iteratively alternates between sparse coding and dictionary update stages.It starts with an initial dictionary and then proceeds in the following manner: (1) Sparse Coding: In this step, with a xed dictionary, the algorithm computes the sparse representation of the data by solving the problem = argmin|| − || 0 , where is the dictionary, is the data, and is the sparse coe cient matrix.This problem can be solved using di erent sparse coding methods but in our case, we used OMP.
(2) Dictionary Update: In this step, the dictionary atoms are updated one by one, while keeping the other atoms xed.
For each atom, the problem is formulated as a rank-1 approximation problem and solved using K-SVD.
The algorithm repeats these two stages until a stopping criterion is met, in our case reaching a maximum number of iterations.
K-SVD is pivotal due to its e ective performance in obtaining a compact and accurate representation of data, particularly when the data is high-dimensional.Moreover, the sparse representations obtained using K-SVD can capture the structure of data e ectively.

RSFair
The principal concept that forms the basis for the RSFair technique involves the construction of a representative subset of a given dataset that predominantly concentrates on its core dynamics, rather than the irregularities or outliers.In more explicit terms, it is the intentional selection of a subset from a larger dataset with the speci c aim of primarily capturing its inherent and most frequent patterns, as opposed to its anomalous instances that diverge from these regular patterns.
This approach is grounded in the utilization of authentic and conventional data points.By leveraging such data points, RSFair technique can not only recognize but also correct any recurring patterns of discriminatory behaviour exhibited by the ML model.It is essential to note here that discriminatory behaviour is deemed to be any pattern that leads to unfair outcomes in the model's predictions, which can potentially a ect its performance and perceived fairness.
We believe that the approach proposed in RSFair is decidedly more e ective than concentrating on correcting the model's behaviour at outlier points, which are less frequent and often less impactful on the model's overall fairness performance.Therefore, our particular emphasis on the common and regular data points, as opposed to the outliers, is the cornerstone of the RSFair technique's superiority.This forms the basis of why RSFair consistently manages to outperform other alternative approaches in our experiments.By centering on the model's frequent discriminatory patterns, RSFair ensures a more comprehensive and e ective enhancement of the model's fairness, surpassing the results attained Base dataset Dictionary size Dictionary K-SVD iteration count, selected 3 in project Table 1: Notations used in RSFair approach by other methods that might focus more on the outliers within the dataset.
The core algorithm of the RSFair is explained in 1 and the explanation of the notations can be found in Table 1.

Data Sets and Models
Two di erent data sets (Adult (Census)1 , Credit2 ) are used for the analysis, which have also been used in THEMIS 3 and AEQUITAS4 experiments.Those data sets have already been pre-processed and cleaned in those prior studies, and thus, we have used their versions to reproduce and discuss our ndings.
We trained three models (Decision Tree, MLP, and Random Forest) on those data sets to observe the e ect of the representative sampling against AEQUITAS and random sampling.

Experiment Goals
We had two research questions that we wanted to answer at the end of our experiments: • RQ1: How e ective is RSFair in nding discriminatory inputs?• RQ2: How useful are the generated test inputs to improve the fairness of the model?
For the rst research question, we implement OMP and K-SVD instead of the existing random sampling method and compare the performance with AEQUITAS and random sampling approaches.
For the second research question, we use the discriminative points that are found in the search phase and improve the fairness performance of the initial model.This process is repeated for AEQUITAS, random sampling, and RSFair to compare the results.

Comparison of Sampling Methods
Before we delve into a comparative analysis of the di erent methods, it is necessary rst to examine certain metrics drawn from the Adult dataset and di erent sampling methods.Figure 2 displays the essential statistics of the original dataset and the sampled subsets obtained through various approaches.For the purpose of illustrating the impact of sampling methods, we have chosen four features, while the remaining features can be accessed in our GitHub repository.
Our base study, AEQUITAS [32], employs a method of generating random data points.The only known quantities for each feature in this method are the boundary values.From this predetermined range or interval, a random value is selected for each feature, culminating in the creation of a new, random data point that is then used in the global search process.One critical aspect to highlight about this process is that every potential value within the established boundaries has an equal chance of being chosen.As a result, the distribution of values for each feature is uniform.This uniformity, while it is theoretically fair in terms of equal opportunity for all values, does not necessarily facilitate the search process.The primary issue here is that real-world data often does not follow a uniform distribution.Instead, certain values or ranges of values may occur more frequently and hence, can be considered more 'representative' or 'typical'.By generating random data points without considering the actual distribution of the original data, AEQUITAS's approach may miss these typical points, which may be more useful for the global search process and for improving the fairness of the ML models.
Random sampling stands apart from the generation of random data points, primarily due to its reliance on actual data points that are randomly selected from the original dataset.However, the subset created by random sampling does not necessarily echo the same underlying dynamics as the original dataset.This is because each random sample might contain di erent proportions of the data points, and it could by chance overrepresent or underrepresent certain groups or trends in the data.In essence, although random sampling can produce a subset that is numerically close to the original dataset, it does not guarantee a representative subset that mirrors the main trends or dynamics of the original data.Moreover, it requires a substantial number of trials to converge towards the original dataset, making it a time-consuming and potentially ine cient process.This limitation is particularly crucial in the context of standalone fairness tools in machine learning.These tools require representative subsets that re ect the key dynamics of the data and do not have the luxury of performing numerous sampling trials.Thus, while random sampling does leverage real data points from the original dataset, its limitations in providing representative subsets and its requirement for numerous trials can make it less suitable for use in standalone fairness tools.
Representative sampling, by contrast, is a method that concentrates on capturing the broader essence or the core characteristics of the original dataset.It aims to create a subset that represents the primary dynamics of the data, instead of merely mirroring its numerical properties.This process preserves core metrics from the original dataset while e ectively eliminating most of the uncommon data points or outliers.As depicted in Figure 2, the result of this methodical selection approach is a noticeably lower variance in the sampled subset compared to what is typically observed in random sampling or within the complete dataset.This reduced variance is a direct consequence of minimizing outlier in uences and focusing on common or typical data points that capture the main trends in the data.Therefore, representative sampling leads to a creation of a more stable and representative subset of the data.This stability and representativeness result in more consistent outcomes, reducing the likelihood of unexpected results due to outlier in uences.This property of representative sampling is of signi cant value in the context of fairness testing in ML models.In particular, for standalone fairness tools, which need representative and stable subsets for e ective operation, our approach provides a robust and reliable methodology.By focusing on the core dynamics of the dataset, representative sampling enables these tools to address discriminatory behaviors more e ectively, leading to the creation of more fair and trustworthy ML models.

Comparison of Approaches
To obtain more reliable results, it was necessary for us to carry out the experiments multiple times until we reached a point of convergence in the output.This iterative process is inherently timeconsuming.To address this issue, we borrowed an approach from AEQUITAS.According to AEQUITAS, due to the law of large numbers, a reliable convergence point which stabilizes the result of the search-based process can be achieved after approximately 400 iterations.As it can be seen from Figure 3, our ndings have mirrored this observation, as we have also found a convergence point of around 400 iterations.Therefore, to ensure the robustness and consistency of our results, we used 400 iterations as the standard for all our experiments.
As highlighted by Udeshi et al. [32], AEQUITAS's fully-directed strategy considerably outperforms a purely random strategy, demonstrating an impressive performance enhancement by a factor of 20.4 in the generation of discriminatory inputs.This empirical nding emphasizes the signi cant advantages of employing a targeted approach as opposed to a random one in this particular scenario.Moreover, AEQUITAS's fully-directed method exhibits enhanced performance by as much as 56.7% in comparison to AEQUITAS's semi-directed approach.While the semi-directed approach may not match the performance of the fully-directed strategy, it still proves superior to the random approach by achieving up to 64.9% better performance.This data e ectively underlines the merits of using a directed approach, even if semi-directed, in contrast to a random one.Consequently, this performance gradient clearly separates AEQUITAS's random strategy as the least potent approach, underscoring the bene ts of incorporating more intricate and targeted strategies for identifying discriminatory inputs.
In the context of our research, we drew comparisons between our method and AEQUITAS's most e ective version, which is the fully-directed approach.By referencing the statistics mentioned above, we are not only able to benchmark our approach against the purely random strategy, but also compare our results with the other versions of AEQUITAS.This enables us to have a broader understanding of our method's relative performance and its position in the landscape of discriminatory input identi cation techniques.
Beyond AEQUITAS's approach of generating random data points, another potential method worth considering is that of random sampling from the existing dataset.This proposition seems intuitively plausible for creating a subset for the global search phase because it presents the original, real-world data.In the light of this seemingly logical approach, we have decided to conduct a comparative analysis between random sampling and RSFair.
However, it is important to highlight some fundamental issues associated with the random sampling approach.One of the principal challenges is its inherent instability.This method can yield entirely di erent subsets from the principal dataset, especially when the size of the subset is relatively small.This can lead to a lack of consistency in the results, which can be problematic if the repetition of the experiment is not desired or feasible.Another potential downside of random sampling is that it can inadvertently include outliers in the selected subset.These outliers, though part of the overall dataset, are not particularly bene cial for fairness testing.This is because outliers typically do not re ect the common patterns or core dynamics of the data and can introduce noise or skew the results.Fairness testing is generally more e ective and meaningful when it focuses on the common data points, which re ect the typical behavior or attributes of the larger population.
In other words, random sampling can produce subsets that, due to their smaller size and potential inclusion of outliers, might not be representative enough of the full dataset.Therefore, while the method seems initially attractive for its simplicity and direct use of real-world data, it may not always provide the most reliable or useful subsets for the task of improving the fairness performance of ML models.RQ1: How e ective is RSFair in nding discriminatory inputs?
In order to derive the speci ed metric, we utilized three tools: AEQUITAS, random sampling and RSFair.These were assigned the task of producing roughly 1000 separate data points.The key objective of this task was to identify the extent of discriminatory inputs within these data points.This computational exercise was not a singular occurrence but was instead repeated a notable 400 times.This repeated process was implemented for every unique combination of dataset and classi er, for each strategy used in our study.After nishing this exhaustive computation, the results were averaged out over multiple iterations for reporting and comparison reasons.
Table 2 shows the performance of all methods in the identi cation of discriminatory points during the global search phase.The columns indicate discriminatory input detection rates in percent.These rates are calculated by dividing the total number of discriminatory inputs detected by the used approach (AEQUITAS, Random Sampling, RSFair) over the total number of generated inputs.For instance, the rst of row of the Table 2 reveals that 0.12% of the points created by the global search step of AEQUITAS are discriminatory points, while that number increases to 2.77% with random sampling and 3.06% with RSFair.Generally speaking, it can be reasonably concluded that the performance of the global search signi cantly bene ts from the sampling approaches.The generation of a more realistic dataset further empowers us to deliver a more accurate measure of fairness performance.
An additional discovery revolves around the Decision Tree model that was trained on the Credit dataset.In this particular instance, AEQUITAS surpasses RSFair in terms of global search performance.However, RSFair successfully outperforms AEQUITAS during the local search phase.This indicates that representative sampling aids in unearthing a greater number of discriminatory points the global search phase.This suggests that the points identi ed through representative sampling tend to be surrounded by additional discriminatory points.In other words, they signify a larger region characterized by discrimination.This observation underscores the critical role that representative sampling can play in the detection and delineation of discriminatory regions within the dataset.
Although random sampling from the dataset tends to yield superior results compared to AEQUITAS, largely due to its use of original data, it nonetheless falls short when compared to the RS-Fair approach.To provide some perspective for their global search performances, while random sampling outperforms AEQUITAS by a margin of 22%, the performance improvement o ered by RSFair is considerably greater.In fact, RSFair surpasses the random sampling technique by a substantial 54% and outperforms AEQUITAS by a remarkable 89%.When examining the performance comparison during the local search phase, it's evident on Table 3 that RSFair consistently outperforms AEQUITAS and random sampling across all tested combinations.The disparity in their performance during the global search phase, combined with the quality of the discriminatory points identi ed, results in a substantial performance gap between those methodologies.This signi cant di erence further emphasizes the e ectiveness of RSFair over alternatives in this context, suggesting the potential superiority of RSFair in identifying discriminatory points.
According to the results, we can say that random sampling performs 51% better than AEQUITAS while RSFair performs 17% better than random sampling and 76% better than AEQUITAS.In summary, it's apparent that any sampling technique, in this case, random sampling, o ers substantial performance enhancements over AEQUITAS.However, it is RSFair that truly stands out, delivering the most e ective results in terms of detecting discriminatory inputs in ML models.To answer our second research question, we generated approximately 1000 discriminatory points with the three methods.We then employed the revised versions of these points to retrain the principal model, which allowed us to monitor alterations in fairness performance.The retraining algorithm is also adopted from AE-QUITAS but with a signi cant modi cation made in the calculation of fairness performance.We incorporate RSFair's representative sampling instead of AEQUITAS's random data point generation for performance assessment to gain a more accurate understanding of the model's fairness when applied to a real-world input set.
Table 4 presents a detailed comparison of the retraining e ectiveness of all three methods in reducing the discriminatory behavior of ML models on two datasets.
The e ectiveness metric reported in Table 4 is computed as follows: For the rst row in the table, a decision tree model is trained on the Adult dataset and exhibits a discrimination rate of 3.5%.Then we applied AEQUITAS to identify discriminatory points of the model.We correct the outputs of these discriminatory points and retrain the model with the xed points.We again measure the discrimination rate of the model which is 2.9%.This change in rate represents a notable improvement of 17.4%.The same calculation is applied for reporting the e ectiveness of the other models before retraining, after applying AEQUITAS, random sampling and RSFair respectively.
The ndings presented in Table 4 provides several interesting insights.Firstly, it highlights that all approaches can e ectively reduce the discriminatory percentage of the tested models, thereby improving their fairness.However, the degree of improvement varies depending on the approach used and the speci c modeldataset pair.Furthermore, the table reveals a consistent pattern, the pattern that we observed earlier: random sampling consistently outperforms AEQUITAS while RSFair outperforms both approaches in terms of percentage improvement across all the models and datasets.This observation supports our research conclusion that the use of representative sampling in RSFair leads to better outcomes in terms of detecting and mitigating discriminatory behavior.
A thorough evaluation of the results reveals varying degrees of fairness improvements achieved by the three distinct methods.AEQUITAS approach improves fairness by an average of 41%, while the random sampling technique achieves slightly better results, enhancing fairness by an average of 49%.However, the RSFair approach emerges as the most e ective method, increasing fairness by 62% in average.To break down these results further, random sampling improves upon AEQUITAS's performance by a margin of 21%.However, RSFair demonstrates an even stronger performance, surpassing the random sampling method by 26% and outperforming AEQUITAS by a substantial 53%.The data also indicates the variability in the degree of improvement brought about by AEQUITAS and RSFair on di erent models.For instance, the improvement in fairness ranges from 17% to 82% with AEQUITAS, from 23% to 86% with random sampling, and from 46% to 90% with RSFair.This suggests that the e ectiveness of these approaches may not only depend on the choice of method, but also on the inherent characteristics of the speci c model and dataset.
In summary, our ndings reinforce the importance of representative sampling in fairness testing.It also validates the utility of RSFair as a potent tool for improving the fairness performance of ML models, highlighting its superior performance over AEQUITAS and random sampling across a range of scenarios.

THREATS TO VALIDITY
The e cacy of RSFair, a technique for assessing fairness in ML models, relies heavily on several key factors: Accessibility to training data and the model: RSFair's functionality is hinged on its ability to generate a representative subset of the original training data.Hence, it necessitates access to this data.Moreover, RSFair requires access to the training process and the model itself to enable detection and retraining e orts.
Nature of input data: The current version of RSFair is designed to work with continuous real-valued inputs, a re ection of the structured data used in this study.Nevertheless, it's worth noting that RSFair could be adapted to accommodate categorical data with relative ease.The volume of input data: In our research, we have demonstrated that RSFair can function with a limited quantity of data.However, our experiments reveal that it's advantageous to have a larger dataset at disposal to understand the data set more thoroughly and to create a more comprehensive representative subset from the primary dataset.
Accuracy of the ML model: Although it's closely connected to the input size, it's important to highlight the impact of the ML model's accuracy on RSFair's performance.As the accuracy of the model enhances, RSFair's performance signi cantly improves compared to other methods.This is because as the model becomes more accurate, it starts producing more realistic results, as opposed to random outputs.This realistic output generation forms the core foundation of RSFair.
In essence, while RSFair exhibits robust capabilities in fairness assessment, it is in uenced by these factors.Acknowledging these threats to the validity of our ndings can guide future enhancements and applications of RSFair or similar tools.

CONCLUSION
In conclusion, this research highlights the crucial role of fairness testing in ML models and underlines the importance of representative sampling in achieving more equitable outcomes.The transition from random to representative sampling is better in accurately measuring fairness and addressing any discriminatory tendencies exhibited by these models.
In the course of this study, we made use of Orthogonal Matching Pursuit (OMP) and K-SVD algorithms to facilitate representative sampling.By leveraging these techniques, we generated a representative subset of the main dataset, thereby ensuring that the global search phase of a prior tool, AEQUITAS, is grounded with more realistic and comprehensive data points.
Our observations reveal that the innovative method of employing representative sampling in AEQUITAS's global search phase outperforms AEQUITAS and random sampling approach in terms of performance.Upon incorporating representative sampling, we have noted a signi cant enhancement in the detection of discriminatory inputs and an improvement in the fairness performance of the evaluated ML model.
By adopting this advanced strategy, both researchers and practitioners would be better equipped to detect and address bias in ML models.This allows for a more profound understanding of the underlying dynamics of the dataset, facilitating the identi cation of discriminatory patterns that might have otherwise gone unnoticed.
As we look to the future, prospective developments in RSFair could strive to expand upon these results by incorporating other algorithms designed for representative sampling, thereby enhancing the robustness of the fairness testing process.Furthermore, the utilization of a wider variety of datasets and diverse ML models can provide a more comprehensive assessment of the generalizability and scalability of this approach, o ering deeper insights into the future of fairness testing and representative sampling.
In addition to the approach suggested in our study, there are other representative sampling techniques that could be considered, such as strati ed, cluster, or quota sampling.We also strongly encourage experimentation with techniques utilized in di erent areas of research, in a similar fashion to how we employed the OMP and K-SVD algorithms in our work.Moreover, we observe there would be cases when merely identifying and diminishing the discriminatory behavior of a model may not su ce.To develop a more profound understanding of the issue at hand, it will be imperative to ascertain not just where these discriminatory tendencies are manifested in the model, but also why they occur.Thus, pinpointing the roots of such biases and studying their origins would undoubtedly constitute a crucial aspect of future research in this eld.
To ensure reproducibility and contribute to the advancement of research, we have made our tool openly accessible to the public: https://github.com/umutcankarakas/RSFair Our code is written in Python and used the same version as AE-QUITAS, which is 2.7.15.All of the experiments are done in a Windows 11 machine with 16GB RAM, 4.7Ghz CPU running Intel Core i7.

Table 2 :
Global search performance.

Table 3 :
Local search performance.

Table 4 :
Retraining e ectiveness on reducing discriminative behavior