Lung Squamous Cell Carcinoma (LUSC) Survival Analysis using TCGA Database

Lung squamous cell carcinoma (LUSC) have a high death rate compared to other cancer type, and the treatment of LUSC remained unsuccessful until now. Therefore, enhancing prognosis and investigating factors contributing to LUSC patient death is necessary. This study used the TCGA-LUSC dataset published by the GDC to separately and unitedly analyzed the clinical and genetic factors that show promising influences in predicting LUSC patient death. Univariate and multivariate analyses were conducted when analyzing clinical and mRNA-seq datasets separately. In the univariate analysis part, the log-rank test and univariate Cox regression were used. Random Forest model, LASSO model, and Multivariate Cox regression model were used in the multivariate analysis section. After factors were filtered separately for the clinical and mRNA-seq dataset, these factors were combined to fit a nomogram prediction model to investigate further the importance of each factor in predicting death by LUSC with respect to time. This study found that clinical factors, including pathological stage, diagnosis age, tissue and organ of origin, and prior malignancy history showed statistical significance in the clinical data analysis part. Moreover, GCGR, PLAAT1, DPPA2, HORMAD1, and FLRT3 showed statistical significance in the mRNA-seq data analysis part, with GCGR, PLAAT1, and HORMAD1 had negative correlations with the cancer survival time, while DPPA2 and FLRT3 had positive correlations. Combining these nine factors and fitting them into the nomogram model, this study found that the gene GCGR shows the highest relevance with patient death by LUSC, indicating that further study can be done in investigating the detailed relationship between GCGR and LUSC.


INTRODUCTION
Lung cancer has the second-highest patient population globally, with a high mortality rate [1].Male patients are more likely to develop lung cancer due to factors like smoking, age, and race.In the United States, over 400,000 Americans died from lung cancer, with an 80% mortality rate.Although smoking is an essential factor, other factors like age, places of living, and race also contribute to lung cancer [2][3][4].Genetic variables like microRNA (miRNA) have been shown to significantly influence lung squamous cell carcinoma (LUSC) diagnosis, suggesting that examining RNA's role in lung cancer could lead to potential treatments to address the high death rate problem [5,6].
With this high mortality rate, studies of lung cancer were never stopped.However, the mortality rate was only reduced by a few percent [7,8].Many studies introduced how one specific factor like gender, race, and smoking history contribute to the diagnosis of lung cancer [4,5,9].These univariate analyses show that many factors are good predictors.However, lung cancer's causes and death rate are complicated and cannot be solved by a single factor.
Current studies on lung cancer often have small or unrepresentative datasets, making metadata analysis crucial.The Cancer Genome Atlas Program (TCGA) by Nation Cancer Institute offers 33 different types of cancer, including LUSC and LUAD.LUSC has more diverse genes and pathways, and current treatment for LUSC has limited clinical progress [10][11][12][13].Analyzing TCGA-LUSC datasets and combining potential factors can help better understand LUSC's influential factors.Multivariate survival analysis can rank potential factors, identifying the first most influential factor and guiding further research [14,15].This exploratory analysis is essential due to the high mortality rate and limited understanding of lung cancer and can optimize screening and prognosis for LUSC lung cancer subtypes.
The purpose of this study is to investigate clinical and genetic variables that impact lung cancer survival probability over time utilizing statistical survival analysis methodologies.Survival analysis is a statistical field used to study time-related events [16].It is a suitable method of studying cancer, but it is not limited to applying it to cancer studies.Other fields related to time can also apply this concept.Cancer survival analysis uses techniques that other survival analyses might not use since cancer studies often include missing data in datasets [16].This study investigated clinical variables using statistical analysis methods such as the Log-rank test and the Cox proportional hazard model (Cox regression).The Logrank test, Cox regression, Least Absolute Shrinkage and Selection Operator (LASSO) model, and Random Forest were employed to examine the genetic components.In the final stage of the analysis, the researchers utilized clinically and genetically significant factors that had been determined in earlier steps.These factors were employed in the construction of a nomogram prediction model, along with the use of receiver operating characteristic (ROC) curves.The purpose of this model and the ROC curves was to estimate the probability, denoted as S(t), of an individual's survival from the time of their lung cancer diagnosis to a specific future time point, denoted as t.Many methods were applied since each one displayed different aspects of lung cancer.Thus, applying multiple methods is necessary for displaying a comprehensive survival analysis and is easier for results' cross-validation.

Data Sources and Description
The TCGA-LUSC data was used in this study.The TCGA data can be downloaded online or using R packages.The TCGA-LUSC dataset had 60660 observations and 553 variables, which were acquired by the National Cancer Institute and the National Human Genome Research Institute from a variety of organizations and hospitals.The TCGA-LUSC dataset encompassed comprehensive clinical and genomic information pertaining to lung cancer, with a special focus on the lung squamous cell carcinoma subtype.The variable overall_survival was right censored, and the unit was day.

Overall Workflow
All data analysis was conducted in RStudio.First, the TCGA-LUSC dataset used for the survival analysis was downloaded to a local computer using the TCGAbiolinks package in R. Figure 1 shows the general workflow of processing the TCGA-LUS dataset.The TCGA-LUSC dataset was divided into two datasets: a clinical dataset with clinical variables and a gene dataset with different genes.They both went through univariate analysis and multivariate analysis.After clinical and genetic factors were sifted with factors that showed statistical significance in univariate and multivariate analyses, these factors were combined and then used in the nomogram prediction model.Subsequently, a ROC curve was employed to evaluate the precision of the predictive model.

Clinical Data.
The clinical lung cancer data was extracted from the TCGA-LUSC dataset using the SummarizedExperiment R package.Then, a log-rank test or a univariate Cox regression (in the case of a continuous variable) was employed to get the p-values for each clinical component exhibiting fewer than 50% missing data.Clinical factors with p-values less than 0.05 were retained for further multivariate analysis.Furthermore, multivariate Cox regression was used to obtain factors that show statical significance in influencing LUSC patients' survival probability.Each clinical factor's hazard rates (HR) were recorded after conducting the multivariate Cox regression.Finally, assumption tests for Cox regression were conducted to check the appropriateness of using the Cox regression on clinical data.

Genetic Data.
Figure 2 shows the detailed workflow of processing mRNA data.Three different types of gene differential expression analyses were conducted using packages DESeq2, limma, and edgeR.The mRNA was categorized into three distinct groups using all three methods: normal genes, up-regulated genes, and down-regulated genes.In subsequent stages, only the genes that exhibited upregulation or downregulation were retained.The gene expression profile was obtained using two screening criteria: a p-value below 0.05 and a multiple difference above 2. Afterwards, a cross-validation procedure was used to get the set of genes that exhibit reciprocal influence across all three techniques.Subsequently, a univariate Cox regression model was employed to fit each individual gene.Genes exhibiting p-values below the threshold of 0.05 were selected for subsequent study employing three distinct models, namely Random Forest, LASSO Model, and multivariate Cox Regression.The initial step was the use of the Random Forest model to rank the genes identified by differential expression (DE) analysis.This ranking was based on the genes' relative value in predicting the chance of mortality in individuals with lung squamous cell carcinoma, a specific subtype of lung cancer.Subsequently, a LASSO model was employed to assess the contribution of the top 20% of genes, namely the first 100 genes (n = 502), in terms of their ranking.The multivariate Cox Regression model was ultimately utilized to ascertain the genes that have a substantial impact on individuals with lung cancer.The genes that exhibited statistical significance following this procedure were integrated with clinically important variables that also shown statistical significance to undergo additional analysis.Figure 3 shows the HR and p-values for these four factors.Two variables, tissue_or_organ_of_origin and ajcc_pathologic_stage, had one subcategory that had large p-values.The lower lobe of the lung as the reference category of the variable tissue_or_or-gan_of_origin, the middle lobe of the lung had a p-value of less than 0.001; other locations had a p-value of 0.942; the upper lobe of the lung had a p-value of 0.018.Only one subcategory of the variable tissue_or_organ_of_origin had large p-values.Since this variable represented the two locations with only a few samples combined, the loss of statistical significance was likely due to combining these two locations and the small sample size.Thus, the variable tissue_or_organ_of_origin is a vital factor influencing patients' survival probability.Furthermore, in comparison to the lower lobe of the lung, the middle lobe of the lung exhibited a hazard ratio (HR) of 3.9.This finding suggests that when the tumor origin in lung cancer patients shifted from the lower lobe to the middle lobe of the lung, the risk of mortality due to lung cancer increased by a factor of 3.9.The hazard ratio (HR) of the upper lobe of the lung was found to be 1.5.This suggests that there was a 1.5-fold increase in the risk of mortality due to lung cancer when the tumor origin shifted from the lower lobe to the upper lobe of the lung in the patient with lung cancer.One more variable that exhibited a subcategory with a significant p-value was ajcc_pathologic_stage.
The reference category for the variable ajcc_pathologic_stage is pathological Stage I. Stage II exhibited a p-value of 0.402, whereas Stage III had a p-value of 0.005.The Stage II subcategory of the variable tissue_or_organ_of_origin had relatively large p-values.This variable was kept because there was an increasing trend of HR and a decreasing trend of p-values as the pathological stage increased, indicating that the possibility of patients experiencing death increased as the pathological stage increased.The positive HR of 1.7 for Stage III+ suggests that when the lung cancer patient's pathological stage changes from stage I to Stage III or more, the hazard of death due to lung cancer increases by a factor of 1.7.
The variables that showed statistical significance were age_at_index and prior_malignancy.The variable age_at_index had a p-value of 0.014 and an HR of 1, indicating that age_at_index showed statistical significance in predicting death due to lung cancer.However, for each one-year increase in the LUSC patients' diagnosis age, there was no change in the hazard of death from lung cancer.The variable prior_malignancy had a p-value of 0.046 and an HR of 1.5, indicating that compared to patients with no prior malignancy, for patients with prior malignancy, the hazard of death due to lung cancer increases by a factor of 1.5.
In Figure 3, the variable age_at_index had a p-value of 0.014 and an HR of 1.The mismatch in these two values may have resulted from the lack of consideration of the slightly non-linearity, as shown in Figure 6.Moreover, this may suggest that the precision of the estimate might be increased as a result of the inadequate sample size.For example, 72 people were diagnosed with lung cancer when they were 63.However, only one person had lung cancer when he or she was 39 years old.Ignoring this mismatch and focusing on the p-values of the multivariate Cox regression model for clinical data, all four variables left: prior_malignancy, age_at_index, ajcc_pathologic_stage, and tissue_or_organ_of_origin.These four clinical variables are the results of the clinical data analysis part. 2. The p-value column shows that all five variables and the global test had p-values greater than 0.05, suggesting that there was no statistically significant association between the residuals and time, so confirming the adherence to the Cox regression assumption that covariates do not exhibit a connection with time and that the risks remain proportionate.

Assumption Tests for Cox Regression. The Cox regression assumption test result for all four clinical variables included in the multivariate Cox regression analysis is presented in Table
Figure 4 is a visualization of the proportional hazard test for the multivariate Cox regression model of clinical data using the Schoenfeld residual test.All four solid line were horizontal, indicating that the PH assumption was met for all selected variables.The black solid lines are the smoothing spline fit of each plot, and the dashes black lines represent the range of 2 standard errors.Figures 5 and 6 have been assigned to assess the potential impact of influencing data or outliers on the results of the Cox regression analysis.Figure 5 presents the predicted changes in the regression coefficients when each observation is sequentially removed.Figure 5 illustrates that while there are variations in the size of residuals, with age_at_index exhibiting a narrower range of +/-0.002compared to other tissue_or_organ_of_origin locations which had a wider range of +/-0.2, none of the individual observations exert a significant influence when comparing the magnitudes of the largest residuals values to the regression coefficient.Furthermore, the scatter plots demonstrate an even distribution of points across all plots.Additionally, Figure 6 presents the graphical representation of the deviation residuals.Deviance residuals may serve as indicators of outliers inside a statistical model.The residuals depicted in Figure 6 exhibit a near-symmetrical distribution centered around zero, suggesting the absence of any potential outliers that may have exerted an impact on the outcomes.
Figure 7 tested the linearity assumption for the continuous variable age_at index.Figure 7 shows that the age_at index had a linear form (a slight nonlinearity is there), suggesting that age_at_index is a proper fit in the multi-cox regression model.All assumptions for the Cox regression were met, suggesting that the multivariate Cox regression for clinical data is appropriate; therefore, the data generated in the clinical data section was eligible to use in further analysis.
Figure 4 to 7 and Table 2 shows assumption tests for the multivariate Cox regression for the clinical dataset.In Figure 5, all variables had residuals evenly distributed above and below 0, indicating that the test results for all four variables are accurate.To conclude, all assumptions were met for these four variables.Hence, conducting the Cox regression test was appropriate.The reason for this is presumably due to the substantial size of the TCGA-LUSC mRNA dataset, which necessitates the utilization of the limma package, known for its compatibility with big datasets.

mRNA Univariate Analysis.
During the univariate analysis, the Log-rank test and univariate Cox regression were employed.Three different types of measures of the quantity of transcript expression in RNA-seq data (count, tpm, and fpkm) were used to fit these two models.DESeq2 package was used to analyze all three datasets.P-value of 0.01 was used to sort the effective genes.Figure 9 shows that there were 527 genes left after cross-validate the result from all three datasets, and these 527 genes were used in the multivariate survival analysis afterward.10 shows the top 20 genes after fitting the Random Forest model.TLX2, H2BC9, and DPP6 were the three genes that showed the highest mean decrease Gini score.To be more specific, TLX2 had the relatively largest mean Gini gain, indicating that by the Random Forest model and among all these 527 genes, TLX2 plays the most crucial role in influencing the death probability the LUSC patients.The higher the mean decrease Gini value, the higher the weight in which this gene plays a role in predicting the death probability of LUSC patients.The top 100 genes were collected after fitting the Random Forest model since there were 527 genes left in the previous step, and the first 100 genes were the top 20% of genes.These 100 genes were used in the LASSO model applied afterward.There was the need to fit in more than one model since each model had its advantage and disadvantage.One advantage of RF is that human biological systems are complex; hence, non-linear relationships between genes and LUSC are likely involved; random forest can evaluate non-linear relationships and is suitable for fitting high-dimensional datasets.However, RF can be computationally intensive, and there is no guarantee that the result from RF will be better than other simpler methods like linear regression and support vector machines.Therefore, there was a necessity to use other models.
The 100 genes sorted from Random Forest were then fitted in the LASSO model.Using the min lambda and 1se lambda values computed using the glmnet( ) function, 10 and 6 genes were informative in predicting the death probability of LUSC patients with time.Figure 11 shows the AUC results using the min lambda values (blue) and 1se (red) values.The gene list sorted using lambda min was then used in the multivariate Cox regression model since it had a higher AUC value.Both AUC values were less than 0.7, indicating that there's poor discrimination between values predicted by the LASSO model and random guesses.This poor discrimination was probably due to the non-linear relationship between genes and the probability of death by LUSC, or the underfitting of the LASSO model since among the top 20% of genes selected by the RF model, only 10 out of 100 genes left after using the LASSO model.Thus, an underfitting might occur, leading to the exclusion of many effective genes and hence a low AUC score.On the other hand, unlike the RF model, which might be influenced by overfitting, the LASSO model avoids this.
Figure 12 shows the Cox regression result.There were 5 out of 10 genes showed statistical significance in the ten genes sorted by the LASSO model using the lambda min value.The gene names were GCGR, PLAAT1, DPPA2, HORMAD1, and FLRT3.GCGR, PLAAT1, and HORMAD1 had hazard ratios (HR) that were less than 1, indicating that these genes had negative correlations with the survival time: the higher expression levels of these genes are associated with a lower risk of death by LUSC.Using GCGR as an illustrative scenario, an estimated HR value of 0.75 indicates that a one-unit increase in GCGR expression is linked to a 25% reduction in the likelihood of the event (namely, death by LUSC) occurring at any given time, while keeping all other factors in the model constant.The genes DPPA2 and FLRT3 exhibited hazard ratios (HRs) beyond 1, suggesting that these genes had positive correlations with the LUSC patient survival time: the increased expression levels of these genes are associated with an elevated likelihood of mortality due to LUSC.The hazard ratio (HR) of DPPA2 was calculated to be 1.08, indicating that a one-unit increase in the expression of DPPA2 is linked to an 8% increase in the hazard of the occurrence, while assuming that all other factors in the model remain same.These five genes ranked in the top 15% of the genes in the RF model; their ranks were 12, 23, 34, 54, and 79, respectively.Therefore, although the LASSO model showed poor accuracy, the five genes selected in the Cox regression had statistical meaning in contributing to the death of LUSC patients.

Combined (Clinical and Genetics) Data
Figure 13 shows the Nomogram prediction result.The gene GCGR had the most potent effect among all the clinical and genetic data.GCCR, the glucose receptor, shows promising effects in affecting liver cancer; however, this study shows that this receptor involves with glucagon signaling also has a statistically significant negative correlation with the LUSC patients' survival times.Additional research is required to explore the intricate association between the glucagon receptor (GCGR) and lung squamous cell carcinoma (LUSC).One counterintuitive result in Figure 13 is that as the years of experiencing LUSC increase, the survival probability of the patient increases as well.This might be because treatment against  Figure 14 shows the ROC curve for the nomogram prediction model in Figure 13.The AUC value for two years is above 0.7, indicating that the model predicts better than random guesses.However, the AUC values for 4 and 6 years could have been better.This is most likely due to the small sample sizes of these two time frames.The sample size of less than two years survival time was 327.The four-year sample size was 52, and the six-year sample size was 40.The small sample size and high variance in variables might be the inaccuracy reason for the prediction model.

Limitations
One assumption for Cox regression is that the censored data is non-informative.However, this assumption might be violated in the TCGA-LUSC since some patients might quit if their situations worsened.While Cox regression works better for continuous variables, most clinical variables were categorical, leading to a lower accuracy of the model results [16].Moreover, models in this study generated purely statistical results; the lack of consideration of clinical meaningfulness might exclude some actual affective data (e.g., clinical variables related to smoking were removed because of the large number of missing values in the datasets).

CONCLUSION
This study used the TCGA-LUSC dataset to perform univariate and multivariate data analysis regards to LUSC lung cancer subtype.Upon the result of this study, the negative correlation between GCGR and LUSC has the potential to be further investigated since it affects the LUSC death probability with respective to time the most among the clinical and mRNA that showed statistical significance in this study.Moreover, future studies can focus on the patient that survived more than two years after being diagnosed with LUSC, specifically investigating why they survived longer than those who died in 2 years after being diagnosed since patients who survived longer had higher survival probability in the nomogram prediction in this study.

Figure 3 :
Figure 3: Forest plot Cox Proportional Hazards Model that showed HR and multivariate cox regression p-values for factors that were statistically significant during the univariate clinical data analysis.

Figure 4 :
Figure 4: Visualization of proportional hazard test for Cox regression using the Schoenfeld residual test.

Figure 5 :
Figure 5: Testing for outliers in clinical variables during the multivariate Cox regression test.

Figure 6 :
Figure 6: Testing for outliers in clinical variables during the multivariate Cox regression test.
Figure 8 shows the DE analysis results.The three volcano plots, Figure 8(a), Figure 8(b), and Figure 8(c), were results generated from R packages DE-Seq2, limma, and edgeR, respectively.The top left corner is the volcano plot displaying the result using the DESeq2 package.The top right corner displays the result of the DE analysis using the limma package.The bottom left corner shows the volcano plot generated from the edgeR DE analysis.The Venn diagram shows the cross-validation results of applying the three differential expression methods.2214 genes were either up or down regulated in all three methods, and these genes were used to conduct survival analysis afterward.The use of the limma function yielded the most distinct outcome in comparison to the other two ways, as seen by the Venn diagram.

3. 2 . 3
Three Method used for the mRNA Multivariate Analysis.Figure

Figure 8 :
Figure 8: Visualizations of three DE analysis and the cross validation result these three methods.

Figure 9 :
Figure 9: Cross validation result of the univariate analysis using datasets that contains three different measures of RNA-Seq.

Figure 10 :
Figure 10: Example outcome of fitting 527 genes into the Random Forest model.

Figure 11 :
Figure11: LASSO model AUC results.The AUC value in blue was derived from using minimum lambda value.The AUC value in red was derived from using the 1se lambda value.

Figure 12 :
Figure 12: The result of multivariate cox regression model.

Figure 13 :
Figure 13: Nomogram prediction model result.GCGR shows the highest contribution in affecting the LUSC patient death.

Figure 14 :
Figure 14: ROC curves for the nomogram prediction model in Figure 13.The AUC value of 4 and 6 years were below 0.7.

Table 1 :
Selected clinical variables that had p-values less than 0.05 Factors that showed statistical significance when performing clinical and genetic data analysis were combined into one dataset to further investigate their united effects on LUSC using the nomogram prediction model.Both logistic and Cox nomogram prediction models were used.Then, an ROC curve was produced to assess the accuracy of the Cox nomogram model.
ajcc_pathologic_stagePathologic staging is a comprehensive approach that integrates the findings Obtained from both clinical staging, which involves physical examination and imaging tests (as described in the Clinical stage), and surgical outcomes.0.0170 * tissue_or_organ_of_origin The word utilized to denote the anatomical location from which the patient's malignant condition originates, as delineated by the International Classification of Diseases for Oncology (ICD-O) established by the World Health Organization (WHO).0.0002 *** age_at_index The age of the patient, expressed in years, as determined on the reference or anchor date employed during the process of date obfuscation.0.0356 * prior_malignancy The binary indicator employed to characterize the patient's medical background pertaining to previous cancer diagnosis.0.0415 * a The definitions were collected from NIH GDC website (https://gdc.cancer.gov/about-data/gdc-data-processing/clinical-data-standardization).b p-values less than 0.05 were marked as '*.' p-values less than 0.01 were marked as '**.' p-values less than 0.001 were marked as '***.' 2.2.3 Combined Data.log-rank test (or univariate Cox regression), indicating that these variables were statistically significant in their impact on the survival probability of lung cancer patients in the clinical TCGA-LUSC dataset.3.1.2Multivariate Cox Regression Result.Figure 3 is the Forest plot for multivariate Cox regression for the clinical factors that were statistically significant during the univariate log-rank test.

Table 2 :
Assumption test results for clinical multivariate Cox regression