Benchmarking Classifiers for Loan Default Prediction using Archetypal Analysis

The prediction of loan default is a critical process for the successful development of financial institutions. To effectively manage credit risk, numerous machine learning models have been employed to distinguish creditworthy from high-risk applicants. However, determining an optimal model remains a challenge. To address this, in the current study, we explore an alternative approach for model benchmarking. The main concept involves the usage of a pipeline that constructs different classifiers for loan prediction and compares their performance across several evaluation metrics. To achieve this goal, we deploy an approach based on a multivariate statistical method, known as Archetypal Analysis (AA). The proposed methodology is applied to four datasets with diverse structural characteristics. The findings demonstrate that advanced classifiers like Random Forests (RF) and Artificial Neural Networks (ANN), with oversampling, simple parameter tuning, and feature selection consistently outperform traditional classifiers across most evaluation criteria. In conclusion, the results showcase the ability of AA to intuitively identify the best and worst models for each unique scenario.


INTRODUCTION
Loan default prediction involves the use of statistical methods to distinguish creditworthy from risky borrowers [1].In the past, loan applicant's creditworthiness was evaluated by employees.However, this process was characterized by subjectivity and inconsistency [2].Consequently, financial institutions gradually turned into advanced techniques, leading to an abundance of techniques, classifiers, and evaluation criteria [2].
It is widely acknowledged, that no single approach can fully meet the needs of each financial institution [3], since machine learning models have their own advantages and limitations.To uncover the best practices for loan default predictions, numerous studies have conducted model benchmarking [4].Despite the contribution of these studies, no effective guidelines have yet been proposed.
The limitations of these studies arise from oversights in their experimental design.Firstly, it is common to overlook the structural differences between financial institutions.Moreover, the plethora of the proposed techniques, classifiers and evaluation metrics adds complexity in selecting an optimal model.Lastly, a critical issue lies in the widespread use of aggregation methods for model evaluation.
Based on that, in this study we examine a commonly observed phenomenon, wherein competing models exhibit varying performance.As a result, the research question we pose revolves around the concept of model benchmarking.During this process, it is of primary interest to find a set of models that generally perform well across all criteria, using a trustworthy benchmarking procedure.Yet, valuable insights can also be gained from models that showcase poor performance [5].
To address the research question, we consider the use of a multivariate statistical method, known as Archetypal Analysis (AA) [6], that has the potential to provide the answers to the problem of benchmarking models, that may present varying degrees of efficiency across different evaluation metrics.To the best of our knowledge, AA has not yet been applied in the field of loan default prediction for benchmarking classification models.For the effective application of AA, we follow a streamlined process where, initially, multiple datasets about loan prediction with different characteristics are selected.Then, we employ well-known machine learning techniques and classifiers while a wide range of evaluation metrics is used for deducing their performance.By completing these steps and subsequently applying AA, our objective is to illustrate an alternative method of selecting the optimal model on each specific case, based on the dataset characteristics.
Hence, the primary contribution of our paper, in relation to recent literature on loan prediction is that we offer a novel way of benchmarking classification models on this domain, using a multivariate statistical methodology.The advantage of the proposed methodology is that it relies on multiple evaluation criteria and can find the optimal (or worse) classifier on one or more metrics, thus providing a more comprehensive way of model comparison.In that sense, our paper differs from other known methodologies on loan prediction and data mining such as KDD, CRISP-DM or SEMMA as we do not create a classification model and analyze its evaluation but rather, an efficient way of selecting targeted models based on pure statistical outputs.
The remainder of this study is organized as follows.The next chapter offers an analytical view on the datasets, classifiers, and benchmarking work on loan default prediction.In Chapter 3, the applied methodology is described in detail.The results obtained from the experiments are presented in Chapter 4 while Chapter 5 provides conclusions and suggestions for potential future directions.

RELATED WORK 2.1 Datasets for Loan Prediction
Availability is a common challenge regarding the datasets employed for loan default prediction.This issue stems from data privacy laws that impose restrictions on data sharing [7].As a result, public datasets had lower usage rates, with the majority of research conducted on the so called Australian and German datasets [8].Additionally, most studies also rely on a single dataset for their analysis while the respective average was approximately around two datasets [9].However, single dataset usage hampers the ability to generalize the findings.

Classifiers for Loan Prediction
A plethora of different classification models has been used for loan default prediction, with one of the most widely known being Linear Discriminant Analysis (LDA).LDA still remains a reliable technique used for predicting loan defaults [10].Logistic Regression (LR) has also emerged as a popular alternative [11].In particular.LR has been used as a multivariate model for credit risk assessment [12] and continues to be considered as one of the fundamental models used by financial institutions [13], due to its ease of implementation, and good performance [14].Artificial Neural Networks (ANN) classifiers are also considered as one of the most popular classifiers in loan default prediction [10].Over the years, several ANN models have been proposed [15].Among them, feed-forward networks are the most commonly used due to their comprehensibility and

Benchmarking
Over the years, there has been a decrease in model benchmarking, which indicates that researchers seek to propose new classifiers [8].However, given the abundance of the available methods, financial institutions are unable to choose the optimal model based on their needs.Although this difficulty has been thoroughly examined [2,5,25,26], identifying an optimal model is hard to achieve and it depends on multiple factors [17].
To address these gaps, researchers in various fields had used the Archetypal Analysis (AA) algorithm [6].For instance, AA had been proposed in economics [18] and software development [5].Moreover, recent studies that estimate the effort required for software development, highlight the benefits of AA as an effective benchmarking method [19].

METHODOLOGY
The proposed methodology has been divided into separate steps, as shown in Figure 1.Our approach is divided into (i) data collection, (ii) data preprocessing, (iii) classification, (iv) evaluation and (v) benchmarking with AA.
For the execution of the experiments, the Python programming language was used in the Jupyter Notebooks environment.Additionally, for the implementation of the AA algorithm, the R programming language was used in the RStudio environment, with the primary library utilized being Archetypes [20].The experiments were conducted on a laptop with an AMD Ryzen i7 processor operating at 2.30 GHz and 8 GB of memory.

Data Collection
In the first step, we collected four datasets with different structural characteristics, as described in Table 1.The datasets have been collected from Kaggle [24], which is a popular platform for sharing publicly available datasets.Therefore, all selected datasets are public, as access to private financial institution datasets was not feasible.A brief description of the selected datasets is provided below.
The Loan Prediction dataset pertains to a company's attempt to automate the loan approval process for applicants through online applications.The company provides a subset of the data for use.On the other hand, the Credit Risk Classification dataset focuses on banking products and consists of two datasets containing both customer transactions and demographic information.Additionally, the Credit Card Approval Prediction dataset involves the approval

Data Preprocessing
Data preprocessing is particularly important for improving the performance of the models.The following techniques were applied: 1) median imputation for quantitative features, 2) mode imputation for categorical features and 3) deletion of features with missing values exceeding 30%.During feature engineering, variables that contained categories or classes were merged or split when deemed useful.For the transformation of the variable values, label encoding, and one-hot encoding techniques were used to handle categorical variables.Next, the datasets were split into 70% for the training set and 30% for the test set.After the split, Standard Scaling was performed.The primary objective in loan default prediction is to accurately predict the minority class, as misclassifying the minority class often leads to greater negative consequences [21].In this study, for class imbalance handling, four different techniques are being employed based on the examined model.The default option was to ignore class imbalance or to manage it with oversampling (SMOTE or ADASYN) undersampling (Near Miss) and hybrid (SMOTE-ENN).Lastly, in the feature selection stage, the models could have no feature selection, or feature selection using the feature importance method of the RF classifier, with a setting of a 5% significance threshold [22].

Classifiers and Evaluation Criteria
The constructed models were based on four classifiers, namely LDA, LR, RF and ANN, with all selected classifiers used either with or without parameter tuning.Due to the extensive number of experiments, we apply Random Search CV algorithm to all classifiers.In the case of ANN, a wider set of parameters was constructed, and the Random Search algorithm from the keras_tuner library was used.
Regarding evaluation, an inspection of recent literature reveals that accuracy is among the most common evaluation metrics [2,9].However, accuracy can lead to misleading conclusions because classifiers tend to prioritize predicting the majority class.The Area Under the Curve (AUC) metric is equally important, particularly for imbalanced datasets [18], and it is one of the most popular metrics [8].Additional effective metrics are Precision, F-measure and G-mean [8], while less commonly used metrics are the Brier Score, Kolmogorov-Smirnov Statistic, H-measure, and Gini index [8].Hence, after careful consideration, we based the model evaluation on 8 different metrics.
We also make use of the macro scores (F1M, RCM, PRM, GMM) which represent the average of each class for each metric (loan -no loan), having a total of 12 evaluation metrics.As shown in Table 2, multiple combinations are created, resulting in a total of 80 models per dataset, for a total of 320 classification models that are subsequently benchmarked with the AA algorithm.

Archetypal Analysis (AA)
We base the application of AA on the representation of each constructed classifier.A detailed description of the algorithm can be found in the Supplementary Material of the results repository [23].
Under the assumption that each model is trained to predict observations of a dataset and is evaluated with a set of evaluation metrics, we represent the models as shown in Table 3.Hence, each model can be expressed as a multidimensional vector M = {Score1, Score2, . . ., Scorem} with the scores accounting for the model's performance across the defined evaluation metrics.In a multidimensional space, these vectors represent points and multiple points in a multidimensional space form a geometrical shape.
The AA algorithm is then employed in the shape of points in the multidimensional space to find the convex hull that surrounds the multidimensional vectors.The points of the convex hull are called archetypes and are located in the boundary of the multidimensional space.In the case of our study, the archetypes are constructed models with specific performance across the evaluation metrics and they may have efficient or poor performance in one or more evaluation metrics.Finally, the output of the AA algorithm is a matrix where each of the remaining models that are not archetypes is assigned weights, that portray how similar (or close) a model is to each of the detected archetypes.In Table 4, we illustrate the output of AA.
The weights represent the a-coefficients of the AA algorithm and, as they sum up to 1 for each model, they act as indicators of the resemblance of each model with each archetype.For example, in a solution with three archetypes and model coefficients 0.9, 0.08 and 0.02, we can deduce that the model is 90% similar to archetype 1, 8% similar to archetype 2 and 2% similar to archetype 3, and thus is closer to archetype 1 in terms of performance, whether its performance is poor or efficient.
The benchmarking process of this study is based on this type of interpretation, providing a multidimensional type of model evaluation which not only pinpoints models that excel in their predictions but also models that perform poorly, enabling the financial institutions to monitor the prediction process and choose the best models.

RESULTS
In this section, we present the results of applying the AA algorithm to the constructed classifiers.The presented models are named To simplify the results of the proposed methodology, the analysis for the following two sections will solely focus on the Loan Prediction dataset.The entirety of the results for all datasets can be found in this Repository [23], along with Supplementary material, plots and insights.The presentation of results is based on evaluating the produced archetypes, finding relations between models and archetypes, and briefly analyzing the results of the archetypes in terms of model performance.We should emphasize that the archetypes detected by the AA algorithm do not necessarily represent models with optimal performance, but rather, models that present interesting characteristics and act as "extreme" or "noteworthy" cases.Of course, the AA algorithm is also capable of detecting models with optimal performance on multiple evaluation metrics.

Evaluating Archetypes
In Table 5, the models identified by AA as archetypes for the Loan Prediction dataset are presented along with their performance across the selected evaluation metrics.We should point out that the AA algorithm automatically detects the optimal number of archetypes to be extracted, so as to minimize error and optimize the performance of the algorithm.
As observed, the six detected archetypes showcase different characteristics, performance in each evaluation metric and overall performance.More specifically, AR5 (ANN-tn_ad_nfs) and AR6 (ANN-tn_nih_nfs) are particularly ineffective in most evaluation metrics, as they represent ANN classifiers without feature selection and without imbalance handling, in the case of AR6.Hence, these models can be considered as baseline classifiers that do not adapt to the characteristics of this specific dataset and perform poorly, based on the proposed benchmarking process.AR3 (ANN-df_nm_nfs) is moderately effective in most evaluation metrics, representing a model that performs undersampling with NM but does not use feature selection.Finally, AR1 (ANN-df_ad_rfi), AR2 (RF-tn_nm_nfs) and AR4 (LDA-df_nih_rfi) exhibit high effectiveness in most criteria, with AR4 and AR1 being the optimal models.This indicates that selecting the most appropriate features, with the RF feature importance method, is crucial for the performance of a model while RF retains its position as a classification model that has consistently acceptable performance.AR1 appears to be the most effective in most of the evaluation criteria compared to the other archetypes.However, each evaluation metric is assessed differently based on the needs of each financial institution.Thus, a model can be considered optimal, if it consists of a desirable mixture of archetypes.

Relations between Remaining Models and Archetypes
Another part of the study that can be extracted are the a-coefficients generated by the AA algorithm, to explore the similarity of the constructed models with the archetypes.To explore the similarity of models based on the a-coefficients, the common practice is to set a threshold.Based on that, when the a-coefficient of a specific model exceeds this threshold, it is considered as a neighbor to the archetype, meaning that its overall performance closely resembles that of the archetype.
The selection of such a threshold can be arbitrary, but we define that if the a-coefficient is above 70%, it indicates a strong relationship between the model and the archetype [5].Models that do not exceed this threshold do not appear in the results as they cannot be associated with any archetype.Additionally, models that exhibit identical performance across evaluation metrics are merged, thus reducing the number of models presented.In Table 6, the indicative performance of some models that passed the 70% threshold is showcased, while Table 7 displays the a-coefficients of the models with respect to the archetypes, pinpointing the archetype that is closer to each respective model.The full tables can be found in the Supplementary Material [23] Based on Table 6, the benchmarking process indicates a wellrounded matching between models and archetypes, as in the majority of the models, the archetype to which they are closest to can be easily identified.This proves that AA can indeed be a potent  For example, models ANN-tn_ad_nfs and ANN_tn_nih_nfs are identical to AR5 and AR6, respectively, which were the worst performing archetypes overall.On the other hand, the ANN-df_nm_nfs model closely resembles AR3 and can be considered a moderate model, with moderate ACC, PR and SP.The models RF-tn_nih_rfi and ANN-df_ad_rfi are 90% and 100% similar to AR1 and are highly effective in all metrics.Furthermore, models like LDA-tn_nih_rfi and LR-tn_nih_rfi closely resemble AR 4, which is highly effective in all metrics except RC.Lastly, it can be observed that model LDA-df_nih_nfs is 80.7% similar to AR4 and 17.3% similar to AR1, making it a combination of archetypes that are particularly effective, which is an encouraging sign for a model that is adaptable and conforms to the characteristics of the specific dataset.
In a broader context, with the help of a-coefficients, we can easily discern the best and worst models, based on their evaluation in one or more metrics.However, it is important to note that most models are not optimal in all evaluation metrics.For instance, models similar to AR4, have a very low RC but perform well in the remaining metrics.This is made abundantly clear when examining the ROC-AUC curves of the archetypes, presented in Figure 2 where different performances per archetype are detected.Hence, the financial institution of this dataset should carefully consider if this model suits their purposes or if it would be better to select a model belonging to a different archetype that may present better RC performance.(e.g.AR1).

CONCLUSIONS AND FUTURE DIRECTIONS
The aim of this study is to present a benchmarking method that detects classifiers with varying performance.Generally, results may vary significantly, with no golden standard, as the AA solution can differ based on the utilized dataset with effective and ineffective models used in drawing conclusions.Nevertheless, the proposed solution is a valid way of determining the performance of multiple models.Despite the challenges, the application of AA provides rich results.For example, results indicate that the ANN classifier with complex parameter tuning often emerges as a poor model.This leads us to the conclusion that the ANN algorithm is quite challenging to adequately tune, requiring considerable time and expertise.
Additionally, RF and ANN, showcase better performance compared to the traditional LDA and LR classifiers though LDA and LR have a solid performance even without parameter tuning and class imbalance treatment.In class imbalance, the SMOTE and ADASYN, are particularly effective while Near Miss and SMOTE-ENN performed poorly in most model combinations.Additionally, the ANN classifier, appears to perform ineffectively in combinations where oversampling techniques are not used.Moreover, feature selection generally leads to more effective results than selecting all variables.
Regarding metrics, ANN models with complex parameter tuning, or models with the NM technique usually have a high Brier Score.Moreover, models as RF with tuning, oversampling and feature selection, and ANN with simple parameter tuning and oversampling excel in the majority of the evaluation metrics.
One limitation of our work is the selection of only public datasets, as the use of private datasets was not possible due to restricted access.However, the AA benchmarking process can potentially be used in any dataset, as long as the proper preprocessing is employed.In addition, no outlier handling was performed, which can influence AA results.Finally, AA needs to be complemented by other methods for a better interpretation of the results.As this work serves as an introductory study in loan default prediction using AA, several future work directions are the employment of private datasets and the use of additional methods and visualizations that support AA.In addition, we plan to further investigate outlier treatment and its influence on the AA benchmarking as well as experimenting with different techniques and classifiers.

Table 2 :
Combination of techniques and classifiers per dataset

Table 4 :
AA Output

Table 7 :
Indicative A-coefficients of models to archetypes