On the relation of causality- versus correlation-based feature selection on model fairness

As machine learning models are used increasingly in the educational domain, ensuring that they are fair and do not discriminate against certain groups or individuals is imperative. Although there are a few recent attempts to ensure fairness in these models, the majority of fairness literature tends to overlook the feature selection (FS) process despite its critical role as one of the foundational steps in the machine learning pipeline. Moreover, traditional FS methods identify features by examining the correlational relationships between predictive features and the target variable without seeking to uncover causal connections between them. To address these issues, we compare for four openly available datasets---two educational ones and two benchmark datasets regularly used in the fairness literature---the impact of these two different ways of FS (i.e., causality- versus correlation-based) on the performance and fairness of the resulting models. Our results show that causality-based FS generally leads to fairer models, while the models built after correlation-based FS manifest higher performance.


INTRODUCTION
Ensuring inclusive and equitable quality education and promoting lifelong learning opportunities for all is one of the key sustainability development goals [44].However, access to superior educational resources remains skewed in favor of more privileged learners.Furthermore, global assessments highlight a concerning shortage of educators, leaving them increasingly strained and fatigued.The onset of the COVID-19 pandemic and subsequent school closures have further exacerbated these challenges [23,36,39].Artificial intelligence (AI) in education can be part of the solution to overcome these problems: It can unburden teachers, enable learners and educators to access specialized materials well beyond textbooks in multiple formats that bridge time and space, and deliver quality education for all, personalized and at scale [38].With this, the application of machine learning algorithms is necessary to extract beneficial information from large educational data sets with multiple modalities, support automatic decision-making, and provide appropriate content to the learners.For instance, personalized learning systems can provide instructions in mixed-ability learning groups, chatbots can provide students with detailed and timely feedback on their writing products, and automated assessments can free teachers from some repetitive work and give them more room to support their students [21,47].
However, despite the remarkable results achieved by current educational AI models, ensuring their fairness remains a significant challenge [5,29].Fairness, in this context, pertains to treating individuals or groups equitably, without any bias or favoritism based on their inherent or acquired characteristics, particularly within decision-making processes [30,37].As in other domains, educational applications and tools driven by machine learning algorithms carry the potential for ethical challenges [1].Not only can bias in the real world "creep" into AI systems [37].Even if the underlying data itself is unbiased, the behavior of algorithms can exhibit bias based on certain design choices [30].Moreover, unlike low-stakes applications, such as a Netflix model predicting movie preferences, deploying machine learning models in education often involves high-stakes decisions, such as determining a student's admission to a study program or eligibility for a scholarship [41].A recent review on fairness in educational models points out that ethical challenges have been identified in various dimensions, encompassing student attributes like race, ethnicity, nationality, gender, native language, urbanicity, parental educational background, and socioeconomic status [5].Thus, with the proliferation of AI and automated machine learning models in education, it has become crucial to prioritize the fairness of these models.
Feature selection (FS) is one of the most commonly applied preprocessing or data-transformation/-generation techniques in the machine learning pipeline before training a model [19].It involves the task of pinpointing and choosing a subset of input features that hold the highest relevance to the target variable.Using it offers numerous advantages, including aiding in data visualization and comprehension, decreasing the need for extensive measurement and storage, reducing training and processing times, and mitigating the challenges of high-dimensional data to enhance prediction accuracy [18,28].However, despite it being such an essential step in the machine learning pipeline, according to Galhotra et al. (2022) [16], the majority of the fairness literature neglects it.In addition, traditional FS methods identify features by examining the correlational relationships between predictive features and the class variable without seeking to uncover causal connections between them [48].Thus, an intriguing topic to explore is how the causality-based versus correlation-based FS relates to the performance and fairness of the resulting models.
Especially in the educational domain, recent machine learning fairness articles emphasize the importance of using causal algorithms as future work [9,29].Typically, correlations merely indicate the co-occurrence of features without capturing their causal relationships with the class variable, but research indicates that incorporating causal features in FS for classification can offer two significant potential advantages: First, incorporating causal features can enhance the resilience and robustness of classification models as causal relationships indicate the fundamental mechanism behind the class variable, making them consistent across various settings or environments [4,34,40].Second, causal features can potentially enhance the explanatory power of classification models [32].Correlations only capture the simultaneous occurrence of features and the class variable.Consequently, the selected features often fail to provide a compelling explanation for predictions.For instance, a strong correlation between a pupil's height and their mathematics skills might be observed in one primary school.This observation might suggest that height is a significant predictive feature of a pupil's mathematics skills.However, it clearly is not a reasonable explanation for mathematical skills.In reality, factors like age serve as more plausible and understandable causes of mathematical skills, and such predictors will also be more robust if the model is applied in another school.
To address these issues, this article analyzes four openly available datasets-two specifically from the educational domain and two well-known benchmark datasets from the general fairness literature-and examines how causality-versus correlation-based FS relates to the performance and fairness of the resulting models.More specifically, we compare the traditional correlation-based filter FS technique with a specific causality-based filter FS algorithm, which came out on top in a recent evaluation of causality-based FS algorithms [48].

THEORETICAL FOUNDATIONS
In this section, we provide the theoretical foundations for the empirical experiments.First, we explain the main concepts of FS.Second, we give a short introduction to the discovery of the Markov blanket, which is needed for causality-based FS.Third, we summarize current fairness metrics that can be used to assess the fairness of machine learning models.

Feature selection
FS involves identifying and selecting a subset of input features that exhibit the highest relevance to the target variable.FS techniques currently in use can be divided into three main categories: wrappers, embedded methods, and filters [15,18,28].Wrapper methods conduct an exhaustive search through potential combinations of features.They evaluate each subset by employing the target learning algorithm as a black box [25].Wrapper FS approaches can be computationally demanding, as model training and cross-validation must be performed for each feature subset, and the results are tailored to a specific model.Embedded methods carry out FS intrinsically as part of the training process and are typically designed for specific learning algorithms.In comparison to wrappers, embedded methods can offer several efficiency advantages.They make optimal use of available data without the need to split the training data into separate sets for training and validation.They also reach a solution more swiftly by avoiding the need to retrain a predictor entirely for each variable subset under investigation [18].Embedded methods, which have an integrated FS mechanism as part of the predictive model construction process, encompass techniques such as decision trees and ensemble machine learning methods, with random forests [6] and gradient boosting [12] being the most prominent examples.
Compared to these other two FS types that rely on specific predictive models, filter methods operate independently of predictive models.They share a similar search approach with wrappers, but instead of evaluating against a predictor, they use a basic filter as a preprocessing step.Consequently, filters operate independently of the chosen predictor.Because of their model independence, filter methods offer swift processing speeds and do not exhibit bias towards particular predictive models.As high-dimensional data become more prevalent, filter methods are garnering increased attention.Traditional filter FS methods rely on correlations (see Section 2 in [18] for an overview), whereas emerging and successful filter methods are based on causality [48].
Causality-based FS aims to pinpoint the Markov Blanket (MB) associated with a class variable, with the goal of constructing predictive models that are both more interpretable and robust [17,46].The MB provides insight into the local causal relationship between the class variable and the features within it.As explained in more detail below (Section 2.2), because all other features are probabilistically independent of the class variable when conditioned on its MB, the MB of a class variable represents the theoretically optimal subset of features for classification [26,31].

Markov blanket
The concept of Markov blanket (MB) in a Bayesian network was developed by Pearl [31].A Bayesian network is a visual tool that succinctly illustrates a combined probability distribution across a set of random variables using a directed acyclic graph adorned with conditional probability tables [31].These tables detail the probability distribution of a node based on any instantiation of its parent nodes.As a result, the graph conveys qualitative insights about the random variables, such as conditional independence properties.Meanwhile, the associated probability distribution, which aligns with these properties, offers a numerical portrayal of how the variables are interrelated.The probability distribution and the graph of a Bayesian network are linked by the Markov condition, which asserts that a node is conditionally independent of its nondescendants when given knowledge of its parents.Definition 1. (Faithfulness): A Bayesian network  and a joint distribution  are faithful to each other if every conditional independence implied by  and the Markov condition is also reflected in  [31].
The MB of a variable within a Bayesian network comprises its parents (direct causes), children (direct effects), and spouses (other parents of these children).Given a target variable  and with the faithfulness assumption (Definition 1) in place, the MB of  is unique, and it becomes straightforward to extract it from the associated Bayesian network within a given application domain [15,46].More specifically, this means that by conditioning on the MB of a class variable  in a dataset, all the remaining features are conditionally independent of  .Thus, the MB of the class variable is theoretically optimal for FS [2,3,46,49,50].
Given that possessing complete knowledge of the ( ) is sufficient to ascertain the probability distribution of the class variable  , rendering the values of all other variables redundant, the process of inducing ( ) can be classified as a causal FS filter procedure [15,46,48].Nonetheless, this necessitates having the Bayesian network pre-established.Conventionally, we must initially learn the desired Bayesian network to ascertain the MB of a specific variable.Hereby, one distinguishes between the global and the local learning of the network.Global learning refers to learning the whole Bayesian network.Local learning refers to discovering only the local structure around the target variable  or any other specific variable of interest [2,3].Generally, the process of structure learning for Bayesian networks is recognized as an NP-complete problem.Hence, several algorithms have been invented to deduce the MB without the prerequisite of having the entire Bayesian network pre-constructed, thereby significantly diminishing the complexity of time and computing resources.
In a recent review, several of these MB discovery algorithms were assessed in their capability to act as causality-based FS techniques [48].The algorithm that came off best was the Iterative Parent-Child based search of MB (IPC-MB) algorithm by Fu and Desmarais [13].The IPC-MB algorithm uses the following procedure to find the parent-child (PC) set of the target variable  : First, all features are the candidate PC of  .Second, conditional independence tests check each feature in the candidate PC of  level by level of the cardinality of the conditioning sets, starting with an empty set.Third, the local search is repeated given all found candidates, which not only recognizes false positives but also candidate spouses.The IPC-MB algorithm came off as the optimal choice in several comparisons of local MB discovery algorithms, excelling in terms of robustness, speed, data utilization, and information retrieval [13,14,48].
In this paper, we will compare this causality-based FS (i.e., IPC-MB) with a standard correlation-based FS with regard to the performance and fairness of the resulting models.As illuminated above, the causality-based FS algorithm should select the theoretically optimal features.Another advantage of learning the MB is that it also gives the number of optimal features automatically, while for other FS algorithms, one usually has to provide the number of features one wishes to select.

Fairness Metrics
In the history of constructing and implementing machine learning models in education, the main priority has frequently been to maximize the overall performance of these models.This is particularly evident in typical educational classification tasks, such as endeavors to identify as many at-risk students as possible.However, there has been a conspicuous shortage of attention given to guaranteeing the fairness of these models [29].As machine learning models are used increasingly in the educational domain, ensuring that they are fair and do not discriminate against certain groups or individuals is imperative.
There is no universally accepted notion of fairness.Several different notions of fairness exist, and many metrics are used to measure these different notions [45].As recent reviews have given excellent overviews of fairness notions (see [30] for a comprehensive overview of fairness concepts in machine learning in general, and [22] for one specifically tailored to the educational domain), we refrain from repeating those and only explain what is needed for understanding our experiments.
In our experiments, we concentrated on group fairness.Following Saxena et al. [37], we judged fairness as the absence of any favoritism towards a specific group in the context of the machine learning decision-making process.More specifically, we measured fairness between groups labeled as a and b-determined by the sensitive attribute group membership, as explained below in Section 3. Our evaluation was based on the nine quantitative metrics outlined in Table 1.The combination of these metrics allowed us to holistically assess the fairness of the machine learning models.

EXPERIMENTAL SETUP
This section explains the overall experimental setup.First, we describe the used datasets, including references to the original articles, dataset sizes, and the features used as sensitive attributes in the experiments.Second, we describe our general analysis and evaluation pipeline, including the model and hyperparameter optimization processes.All experiments were performed in Python version 3.12.Moreover, we used the pyCausalFS1 toolbox [48,49] to find the causal features in the datasets, and the holistic AI tool2 for assessing the fairness of the models.

Datasets
We used four openly available datasets: two educational ones, that is, the Open University Learning Analytics Dataset (OULAD) [27] and the Portuguese Secondary School Math Performance (here referred to as PSSMP) dataset [8], and two commonly used benchmark datasets in the fairness literature: one from the social/law area, that is, the Correctional Offender Management Profiling for Alternative Sanction (COMPAS) dataset [24], and one from the financial domain, that is, the German credit (here referred to as GERCRE) dataset [10].

Open university learning analytics dataset.
The OULAD [27] data originates from courses taught at the Open University in the United Kingdom and consists of five tables with information from 24,806 students, their interactions in the virtual learning environment, assessments, courses, and registrations.We used the binary information of whether a student passed a course as the target variable and the disability status (disability) of the student as the sensitive attribute.To ensure reproducibility of the results, we preprocessed the data following a public repository. 3able 1: Fairness metrics used in this study.DV refers to the desired value (i.e., the value that would mean fairness was achieved according to the metric).

DV Formula Interpretation
Statistical Parity 0   −   Fairness is achieved if the probability of a specific prediction is not dependent on sensitive group membership.We used pass/fail of the students' final math grade (feature G3) as a binary classification task and removed the first and second grades because they are highly correlated with the target, making the prediction task trivial if included.Moreover, we onehot-encoded all categorical features in the dataset.As the sensitive attribute, we used the students' gender (sex).

Correctional offender management profiling for alternative sanctions dataset.
The COMPAS software assesses the likelihood of an individual committing another crime.Judges rely on COM-PAS to determine whether to grant release to an offender or maintain their incarceration.A scrutiny of the software revealed a bias against African Americans: COMPAS tends to exhibit higher rates of incorrect positive predictions for African-American offenders compared to Caucasian offenders, falsely indicating a greater risk of re-offending [30].Because of that, the COMPAS dataset [24] has become a well-known benchmark dataset in the fairness literature.
It entails data about 6,172 individuals.We designated the target class as the indication of whether a person commits a crime in the following two years or not (Two_yr_Recidivism).Additionally, we identified the sensitive attribute (race) as the information regarding whether this person is African-American.

German credit dataset.
The German credit dataset, here referred to as GERCRE, serves as a benchmark dataset frequently employed in the machine learning fairness literature (see, e.g., [20,33,43] for a selection of articles published after 2020).Recently, this fairness benchmark dataset faced criticism [11] due to the use of gender as the sensitive attribute in several machine learning articles (including those mentioned, i.e., [20,33,43]), despite the absence of a specific coding for gender in the data.Instead, the dataset includes only the combined feature of sex and marital status.Despite its age and recent scrutiny regarding its suitability for assessing fairness in machine learning models [11], we opted to use the GERCRE dataset for illustrative and comparative purposes.The dataset comprises information on 1,000 individuals.For the sensitive attribute, we selected the sex and marital status column, designating divorced/separated males as the sensitive group.Our target class was determined by creditworthiness, where one denotes credit-worthy, and zero signifies not credit-worthy.

Analysis pipeline and performance evaluation
Our goal was to compute the effect of two different ways of FS (causality-versus correlation-based) in comparison to no FS on the performance and fairness of the resulting machine learning models.In order to reduce the risk of getting results by chance and to increase stability and robustness, we implemented two nested cross-validation loops.The outer stratified five-fold cross-validation loops over each experimental dataset, always using four folds for training and one for testing.Within each division, the training set is used to (1) build a model (another five-fold cross-validation grid-search is employed to select the best hyperparameters, as explained in Section 3.2.1 below) using all features (i.e., without FS), (2) perform causality-based FS and build a model using the same model-building function as in (1) but by using only the  selected causal features, and (3) perform correlation-based FS to select exactly as many features  as the causality-based FS selected and build a model using the same model-building function as in ( 1) and ( 2) but by using only the  selected correlational features.Hereby, the IPC-MB FS algorithm (see Section 2.2) of the py-causalFS toolbox is employed to get the causal features in (2), and the f_classif is employed to get the correlation-based features in (3).As explained in Section 2, the causality-based FS also readily returns the number of optimal features , and hence, in our algorithmic pipeline, we use the  most important features from the correlation-based FS that by default returns only an ordering of feature importances.After model building and FS on the train set, the model performance and fairness are evaluated on the selected features of the respective hold-out test set.To further decrease the risk of getting results by chance, the model building and performance and fairness assessing process is repeated ten times for each kind of FS (i.e., 1 -without FS, 2 -causality-based FS, 3 -correlationbased FS).Finally, the performance and fairness results on the 50 different runs and test sets (i.e., ten repetitions for each five-fold split) are averaged for each kind of FS.

Models and hyperparameter optimization.
Initially, we tried several different classification model types (logistic regression, support vector machines, multilayer perceptron, random forest), but since random forest [6] consistently gave the best performance in the initial tests, we used only this model class for the final pipeline and evaluation to improve comparability and simplicity.To select the best parameters, each training set of the outer cross-validation loop was split further into five folds to select the best hyperparameters.A random forest model with the current hyperparameters was trained on each fold, while the objective function of each step was to increase the performance on the hold-out datasets of each fold.For hyperparameter optimization of the random forest model, we implemented a grid-search over the max_depth, the max_features, the min_samples_leaf, and the min_samples_split of the trees in the forest.For classification performance evaluation, we used five common metrics: Accuracy, balanced accuracy, precision, recall, and the F1-score.

RESULTS
Table 2 summarizes the performance results, and Table 3 summarizes the fairness results for the OULAD dataset.As described in Section 3.2, the tables report the averages (mean and standard deviation) over the 50 different runs and test sets (i.e., ten repetitions for each five-fold split).Tables 4 and 5 summarize these results for the PSSMP dataset, Tables 6 and 7 for the COMPAS, and Tables 8  and 9 for the GERCRE dataset, respectively.
A first observation that can be made from the results is that generally, FS decreased model performance and fairness.Without FS shows the best results in most cases.However, there are a few notable exceptions to this general observation: For the PSSMP dataset, correlational-FS yielded the best performance according to all metrics (one possible reason for this is that there may be adverse features in the dataset that the correlation-based FS correctly did not select), and for the OULAD dataset, the causality-based FS often gave better results in terms of model fairness (indicating that the causality-based FS correctly not selected features that could be linked to the sensitive attribute).
For the two FS types, we bolded for each row the FS that gave the best results.As shown in the Tables 2-9, typically, causality-based FS performed better when measuring fairness and correlation-based FS performed better when measuring classification performance.Although there were a few exceptions, generally, the models built after correlation-based FS outperformed those built after causalitybased FS in terms of accuracy, balanced accuracy, precision, recall, and F1-score.Moreover, on average, the models built after causalitybased FS more often had fairness metrics value closer to the desired value of the respective metric (see Table 1 for an overview of all employed fairness metrics and their respective desired values).
Correlation-based FS is a technique that focuses on finding and selecting the most relevant features from a dataset.Causality-based FS aims to identify the MB of a class variable to build more interpretable and robust predictive models.Thus, using the causal features for similar data points in another environment or at a later time (for example, by updating the already quite old GERCRE dataset with current information) would probably lead to better results, although the performance of the correlation-based selected features is better for the existing data.
Another result worth pointing out that possibly affects the application of causality-based FS algorithms in real-world applications is that the causality-based FS took significantly longer than the correlation-based FS.For example, even for the relatively small PSSMP dataset, the computation of the causal features with the pyCausalFS toolbox took on average 0.266 seconds per training set fold, while the computation of the correlational features on the same folds took only 15.625 milliseconds on average.Thus, causality-based FS is not recommendable if computing resources are an issue.

DISCUSSION
Machine learning is now being applied in a diverse array of decisionmaking contexts, many of which carry significant consequences for both individuals and society at large.While this technology holds the promise of mitigating undesirable elements of human decision-making, there is a valid apprehension that biases present in the data and inaccuracies in the model can result in decisions that unfairly disadvantage groups with a history of discrimination.Consequently, the research community has begun to explore methods to guarantee that the models we train do not render decisions that exhibit unfairness concerning sensitive attributes [7].
Until now, the fairness literature has largely overlooked the FS step in the machine learning pipeline [16], and several fairness AI in education articles have pointed out the need for more causal  algorithms [9,29].Thus, the goal of this article was to assess the impact of correlational versus causal FS on the resulting machinelearning models.The theoretical superiority of causality-based FS has been discussed in several works [31,48,49], but to our knowledge, no direct comparisons to classical correlation-based FS have been performed.Our results showed that, mostly, the causality-based FS led to fairer models than the correlation-based FS.However, there is a trade-off, as the performance of the resulting models was usually better after correlation-based FS.Causal FS aims to identify the MB of a class variable to build more interpretable and robust predictive models.Correlation-based FS is a technique that focuses on finding and selecting the most relevant features from a dataset.Since the correlation between two variables is a less stringent criterion compared to independence, it is logical to question why there is not much work on causal algorithms and machine learning [40], and why fairness algorithms and standards are typically framed in terms of correlations.One pragmatic justification is that, as discussed in Section 4, computing correlations is significantly more straightforward than estimating independence.While correlation is a descriptive statistic demanding relatively few assumptions for calculation, establishing independence necessitates the application of inferential statistics, which can generally be quite complex and computationally expensive [22,42].

Limitations and future research
Our study opens avenues for further exploration and refinement.Firstly, our analysis was anchored in the utilization of the prevalent correlation-based filter FS, and the best-performing causality-based filter FS.Exploring additional FS techniques and types, such as wrappers, could prove worthwhile.Another aspect deserving attention is the sensitivity of our outcomes to the choices embedded in different algorithms.While our approach involved a meticulous grid search, coupled with the aggregation of averages across multiple iterations for result stability, the impact of diverse hyperparameter search spaces warrants investigation.
Furthermore, we focused on specific sensitive attributes, namely disability, gender, race, and marital status, aiming to encompass various factors where individuals may face discrimination.However, it is essential to note that even if an algorithm is deemed fair regarding one attribute, this does not necessarily extend to others.Future work should also investigate the effect of FS on additional sensitive attributes.Finally, an intriguing avenue for future research lies in delving into the comparative effects of causalityversus correlation-based FS on the interpretability and quality of subsequent AI models and explanations.Quality metrics such as explanation robustness and fidelity [35] could serve as valuable benchmarks in evaluating these effects.
Impact 1   /  Very similar to statistical parity but computes the ratio instead, meaning fairness is achieved if this metric equals 1.  (  −   ) Z-test statistic for the difference in success rates.Fairness is achieved if the computed value is between -2 and 2, indicating no statistically significant difference in success rates.  −   Difference in true positive rates for   and   , considered fair if it achieves 0, range between -1 and 1. False Positive Rate Difference 0    −    Difference in false positive rates between   and   , considered fair if it achieves 0, range between -1 and 1. * (   −    +  −  ) Difference in average odds between   and   , considered fair if it achieves 0, range between -1 and 1.
[8] Difference in the accuracy of predictions for   and   , considered fair if it achieves 0, range between -1 and 1.3.1.2Portuguesesecondaryschool math performance dataset.The Portuguese secondary school mathematics performance (PSSMP) dataset[8]consists of 649 students from two Portuguese secondary schools.

Table 2 :
Performance metrics for the OULAD test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 3 :
Fairness metrics for the OULAD test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 4 :
Performance metrics for the PSSMP test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 5 :
Fairness metrics for the PSSMP test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 6 :
Performance metrics for the COMPAS test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 7 :
Fairness metrics for the COMPAS test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 8 :
Performance metrics for the GERCREtest sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.

Table 9 :
Fairness metrics for the GERCRE test sets for the features selected on the training sets (mean and standard deviation over five-fold cross-validation and ten repetitions).The best average FS (causal-or correlation-based) is bolded for each metric.