Individualized Discrimination Model for Breast Cancer

Breast cancer is one of the key research topics today. Most of the researchers are using the Global Burden of Disease Comparison Tool to characterize the incidence of breast cancer. However, there needs to be more research on determining malignancy based on clinical data quickly. Therefore, this study goes through the relevant datasets collected on Kaggle and filtered the data. Finally, five significant clinical variables after screening were put into three learning models in R language to observe the accuracy of the models in predicting the malignancy of breast cancer. The first model is the Logistic regression model, the second is the Random Forest model, and the last is the K- Nearest Neighbors model. This research finds that Logistic regression and K- Nearest Neighbors model show the same good results. The accuracy of both models in predicting the malignancy of breast cancer is 91.23%. Accordingly, based on the results of this experiment, it is recommended that doctors can use these two models to predict early breast cancer patients. Moreover, these two models can also help health insurance companies to assess the risk of patients' disease. A personalized insurance plan will give customers more peace of mind.


INTRODUCTION
Breast cancer has a global impact, according to the data compiled by International Agency for Research on Cancer (LARC)in GLOBO-CAN2020, female breast cancer incidence has reached approximately 2.3 million cases [1].Affecting 5% of women worldwide, and its prevalence is even higher at 12.5% in high-income countries [2].Although the scientific community has conducted much research, the pathogenesis and risk factors of breast cancer are still not fully elucidated [2].Breast cancer incidence is on a continuous upward trend, persistently maintaining its position as the primary contributor to the disease burden among women caused by cancer [2].At the same time, breast cancer incidence is highly correlated with the human development index [3].In other words, the more underdeveloped region, the higher the incidence and mortality of breast cancer [3].Thus, Breast cancer is a heavy burden for people in underdeveloped countries [3].Furthermore, although women tend to come to mind when breast cancer is mentioned, it does not mean men are not at risk of developing the disease.According to the Male Breast Cancer 2021 study, although male breast cancer is generally considered a rare disease, its incidence increases yearly [4].Moreover, male breast cancer is typically diagnosed at an advanced stage [5].It is easy to find that the incidence of breast cancer is apparently on the rise in both men and women.Breast cancer accounted for approximately 11.7% of the global new cancer cases in 2020, with an estimated 19.3 million cases [1].It has become the leading malignancy worldwide [1].Therefore, both men and women should increase their awareness of and importance of breast cancer.
So far, many studies on breast cancer have used the Global Burden of Disease (GBD) comparative tool to describe the frequency of occurrence, death rate, years of disability-adjusted life expectancy, and other aspects of breast cancer [6].The benefits of using the GBD comparative tool include effectively assessing modifiable risk factors contributing to cancer, such as dietary habits and environmental factors [7].These modifiable risk factors have been laboriously researched and provide scientific prevention strategies [8].For non-modifiable risk factors, such as genes, protein truncation variants in ATM, BRCA1, BRCA2, CHEK2, and PALB2 have also been found to be associated with breast cancer risk [9].In addition, many models have been used to predict breast cancer mortality, such as the smoothed lee carter (SLC) model and functional demographic model (FDM) [10].Research has shown that it is necessary to compare multiple models simultaneously to obtain relatively accurate predictive values for breast cancer mortality [10].However, there are currently insufficient models available to create personalized discriminatory models based on individualized factors critical to the discriminatory description of the nature of breast cancer.Therefore, this study aims to create a personalized discrimination model using relevant data from breast cancer clinics.This model can assist doctors or biomedical organizations in determining the probability of a cancer being benign or malignant more quickly and conveniently based on specific clinic-related cancer data.

Data information
This study is based on a dataset on Kaggle about Wisconsin breast cancer.The dataset includes measurements of eight dimensions clinically related to breast cancer.These eight variables are filtered for correlation to obtain four main factors associated with diagnosis (benign or malignant).Perimeter, smoothness, concave point and symmetry are observed as independent variables in three machine learning models commonly used in the R language.The correlation between each variable and the nature of cancer can be observed.Then, the models are used to predict the nature of cancer.Finally, the prediction accuracy of the three models is observed to determine which model is more suitable for predicting the chance of malignancy.

Three models
The three machine learning models used in this research are the logistic regression, random forest, and K-Nearest Neighbors (KNN) models.Firstly, logistic regression is a statistical method used for binary classification tasks, where the target variable can only take two possible values (such as yes or no, 0 or 1) [11].Also, this model assumes a linear relationship between the features and the log odds of the target variable [11].The advantages of the logistic regression model lie in its ability to handle large amounts of data and its high interpretability.Additionally, the logistic regression model does not assume specific distributions for the features, which is another strength [11].Currently, logistic regression models have been widely applied in cancer research, such as exploring individual factors influencing cancer patient survival rates using logistic regression models [12].However, the logistic regression model itself has limitations.Its application can be constrained when the relationship between features and the target is not linear or when dealing with high-dimensional data with complex interactions.
The second model applicated is Random Forest, an ensemble learning method based on decision tree classifiers [13].During training, it constructs multiple decision trees and combines their predictions to make the final decision [13].One of the advantages of a Random Forest is that each tree is trained on a random subset of data and features, reducing the risk of overfitting, and improving generalization capability [13].Random Forest has been widely applied in tumor diagnosis and medical image analysis.It can classify tumors based on features extracted from medical images and clinical data.For instance, a study utilized a Random Forest model based on radiomic features from MRI to grade rectal cancer tumors [14].
Finally, the KNN model is a non-parametric learning algorithm for classification and regression tasks [15].One advantage of KNN is its non-parametric nature, which means it does not make any assumptions about the underlying data distribution [15].Unlike logistic regression, KNN does not assume a linear relationship between features and the target variable's log odds [15].Hence, KNN can capture complex nonlinear relationships between features and the target variable [15].This nonlinearity is also considered an advantage of the model.Currently, KNN has also been applied to cancer diagnosis, such as discrimination of skin cancer [16].However, the need to search for k neighbors for each new data point during the testing process makes KNN's computational complexity less efficient than other algorithms, especially for large-scale medical datasets [15].

METHODS
This research utilized a dataset on breast cancer diagnosis in Wisconsin, USA, obtained from the Kaggle website.A user named Utkarsh Singh published this dataset.It contains a total of 569 samples with 31 variables.These 31 variables include 30 quantitative measurements related to clinical features of breast cancer and one qualitative measurement indicating malignancy.The 30 quantitative measurements are obtained in three different ways (mean, standard error, and worst) for ten clinical measurements related to breast cancer.This research divided the original dataset into three groups based on mean, standard error, and worst for preliminary variable selection.Moreover, histogram plots were generated using the plotting function in the R programming language.Observing the trend in the histogram plots shows that although the numerical results for the ten variables differ under different quantitative methods, the data distribution and trends in the graphs are very similar.In other words, the same variable may have different numerical values under different quantitative methods, but they convey the same meaning.Therefore, this study utilizes the ten variables related to the original dataset's mean and the qualitative variable indicating malignancy (diagnosis).The following subsections introduce how the study proceeds with data feature description, further variable selection, and fitting models based on the existing dataset.

Data Feature
Here is the result in a preliminary dataset.The variables are diagnosis, radius mean, texture mean, perimeter mean, area mean, smoothness mean, compactness mean, concavity mean, symmetry mean, and fractal dimension mean.For this dataset, it is observed that there are no cases of missing data using the R language, which is composed of 357 benign and 212 Malignant (as shown in Table 1).Regarding the data composition, the malignant and benign tumor data in the dataset are relatively even.The relatively even dataset is beneficial for building a learning model in R language to predict whether breast cancer is malignant, which will provide more accurate prediction results.In addition, those ten box plots (as shown in Figure 1) make data more visualized.The diagnosis is depicted on the X-axis, with "B" indicating benign and "M" indicating malignant.Table 2 offers a more digitized representation of features related to those ten variables.

Data Selection
Figure 1's box plot presents the distribution of data for each variable.Texture mean has the highest number of outliers among the ten variables.Therefore, the variable "Texture mean" should be removed from the dataset.Additionally, the fractal dimension is a peculiar variable.There is no significant difference in the fractal dimension measurements between malignant and benign cases, whether in range, minimum value, or maximum value.Thus, it can be concluded that the fractal dimension does not provide much assistance in determining breast cancer.
Next, further analyses of the variables are conducted in this study to select better the data related to diagnosis (B and M).First, creating a heatmap allows for visual observation of the correlations between variables.A stronger correlation between two variables indicates a

Fitting Models
This research involves selecting and applying a well-refined dataset to three R language learning models.The objective was to determine which model is more suitable for predicting breast cancer malignancy or benignity.The first model is Logistic Regression, the second is the Random Forest model, and the last is the K-Nearest Neighbors model (KNN).After creating these models, confusion matrices are generated for each model to evaluate their prediction performance.

RESULTS AND DISCUSSION
This experiment aims to evaluate the predictive performance of three R language models using selected clinical variables obtained from the Wisconsin Breast Cancer Diagnosis dataset found on Kaggle.The study focuses on determining which model can better predict the malignancy of breast cancer based on the provided clinical data, namely perimeter mean, smoothness mean, concave points mean, and symmetry mean.The following four sections will describe the methods used and compare their respective predictive outcomes.

Logistic Regression Model
After fitting the data into a logistic regression model, the output table provides the relationship between the predictor variables (X) and the dependent variable (breast cancer diagnosis) as the positive class (malignant) log probability.As shown in   The subsequent analysis focus on each clinical variable's result.First is the "perimeter mean."In this model, for every one-unit increase in "perimeter mean, " the log odds of predicting a sample in the database as malignant breast cancer increase by 0.116 units.The coefficient's standard error is 0.024, indicating a high level of reliability for the estimate.The corresponding z-value is 4.767, which suggests a highly significant status for the coefficient (pvalue of 1.87e-06 ***).Base on the analysis above indicates that the "perimeter mean" variable plays a vital role in predicting malignant breast cancer.
The second variable is "smoothness mean," with an estimated value of -0.203 and an error value of 22.385.The relatively high error value suggests lower reliability for the estimate.Based on the z-value of -0.009, it appears that the "smoothness mean" factor may not have statistical significance in predicting breast cancer malignancy.The third variable is "concave points mean." Its estimate is positive, and the p-value is 3.26e-05 ***, indicating that higher values of "concave points mean" are associated with an increased probability of breast cancer malignancy.The last variable is "symmetry mean." Its estimated value is relatively small (3.791), and the p-value is significant.Those values suggest that "symmetry mean" may not be an important predictive factor.
In evaluating the performance of the developed logistic regression model, 5-fold cross-validation was employed.According to the cross-validation results shown in Figure 3, the model achieved a high accuracy of 91.23%, a commendable performance.The confusion matrix of the logistic regression model elucidated the quantity relationships between the predicted values and the actual values.The high accuracy indicates that the logistic regression model correctly classified a substantial proportion of cases in the test set.The Kappa statistic is 0.8134, marking a significant agreement level between the predicted and actual values.

Random Forest Model
The Random Forest algorithm extensively applies to solving classification and regression tasks within machine learning.It constructs a collection of decision trees by randomly selecting subsets from the training set.Subsequently, the outcomes from these decision trees are amalgamated to forecast the target class of a provided test instance.In this experiment, the random forest model aims to predict whether the outcome of breast cancer, based on the shared clinical variables, is benign or malignant.The advantage of the random forest lies in its ability to alleviate the influence of noise by aggregating the outputs of multiple decision trees rather than relying on a single decision tree.This aggregation process leads to more robust results.
To ensure optimal performance, the effectiveness of random forest relies on hyperparameters.Two key parameters are the number of trees and the number of variables considered at each split.In this study, we conducted cross-validation to determine the ideal number of variables to evaluate at each split while keeping the number of trees fixed at 500.Through this process, optimal number of variables to assess at each division was 2 has been found.
The Random Forest model applied in this experiment achieved an accuracy of 86.84%.By examining Figure 4, the confusion matrix provides a more intuitive visualization, revealing that only a small number of samples were not predicted successfully.Among the samples, the number of malignant tumor samples predicted as benign tumors is 8.And the number of benign tumor samples predicted as malignant tumors is 7.  experiment, KNN utilizes the values of variables such as perimeter, smoothness, concave points, and symmetry to find the most similar known samples to the unknown samples.Then, it predicts the malignancy of breast cancer based on the categories of these available samples.The final prediction results of the KNN model vary with the change in the k value.After cross-validation, the optimal k value determined for this study is 5.The prediction results demonstrate that KNN is well-suited for predicting the data in this experiment.The accuracy of the predictions is as high as 91.23%, with a Kappa value of 0.8134.As shown in Figure 5, the confusion matrix indicates that only ten samples were mis-predicted.

Comparison
After predictions using three learning models, the accuracy values for each model are obtained.As shown in Table 5, the accuracy for Logistic Regression is 0.9123, for Random Forest is 0.8684, and for K-Nearest Neighbors is 0.9123.Overall, all three models perform well in predicting the probability of breast cancer malignancy for the current dataset.However, comparatively, Logistic Regression and K-Nearest Neighbors demonstrated more outstanding performance.
The results of this study are similar to a recent study investigating malignant breast cancer classification in Wisconsin.Previous research reported that the predictive accuracy of logistic regression was 95%, which outperformed other machine learning models [17].In contrast, this experiment achieves equally excellent predictive accuracy (91.23%) for logistic regression and KNN models.And it does not show a significant advantage of logistic regression over the KNN model.Furthermore, there are substantial differences in data selection between our experiment and previous ones.In order to mitigate overfitting, correlation analysis has been conducted on multiple variables.This analysis reveals a high correlation among several independent variables in the dataset, resulting in reduced utilization of clinical data related to breast cancer in this study.

CONCLUSION
This study finds that the Logistic regression model and K-Nearest Neighbor model in R language can predict breast cancer malignancy well based on the required clinical data.Both models have an accuracy of 91.23% in predicting the malignancy of breast cancer in this experiment.However, this study has limitations regarding the data's geographic origin and specific clinical measurements.First, the dataset used for modelling is breast cancer diagnosis statistics from the state of Wisconsin in the United States.Breast cancer diagnostic data from one region is insufficient to study breast cancer.It remains to be investigated whether the clinical data trends of malignant breast cancer in different geographical areas stay the same.For example, Asian and European populations are diverse regarding body size and growth environment.Whether these two models are suitable for predictive analysis of breast cancer in Asia requires further research.Secondly, this study only utilized the clinical measurements of the mean's result.Investigating whether different measurement methods could lead to varying experimental results is essential.

K
-Nearest Neighbor (KNN) is another machine learning algorithm for classification and regression.The KNN model predicts the category of unknown samples based on feature similarity.In this

Table 1 :
Number of samples

Table 2 :
Summary of data on quantitative variables point mean, and concavity mean.Thus, digitizing the correlation between diagnosis and other variables is the most effective data selection method.As shown in Table3, the perimeter mean exhibits a stronger correlation with diagnosis than the radius means, and area

Table 3 :
Correlation with diagnosis

Table 4
resenting the estimated values' reliability.A minor standard error indicates a more reliable estimate.The intercept's standard error is approximately 3.8, meaning that the calculation of the intercept value is relatively reliable.

Table 4 :
Logistic Regression model output table

Table 5 :
Comparison Table of Three Models