Investigating the Role of Gene Body Methylation and CpG Methylation in Cancer Subtyping

DNA methylation plays a critical role in emergence and advancement of many cancer types. However, studies that use DNA methylation data primarily focus on DNA methylation of CpG sites. The role of gene body (GB) methylation in cancer remains rather unclear. Particularly, integrating methylation data with gene expression, in order to improve the classification of disease subtypes, remains under-explored. In this study, we use SCClust (Sparse Canonical Correlation analysis with Clustering), a method that combines high-dimensional omics data using sparse canonical correlation analysis (sCCA), to gene expression and DNA methylation data for two cancer genomics datasets from The Cancer Genome Atlas (TCGA) to distinguish between underlying subtypes. We evaluate the identified subtypes using Kaplan-Meier plots and hazard ratio analysis on colon cancer and kidney cancer data. Our findings demonstrate that integrating both types of DNA methylation data with gene expression results in similar subtypes and outperforms existing methods. The underlying information revealed as a result of data integration is different depending on what type of methylation data is integrated with gene expression. The significance of this study lies in the advancements in understanding the epigenetics of cancer as well as categorization of cancer data that incorporates epigenome with transcriptome.


INTRODUCTION
In precision oncology, classification of patients into similar groups can be achieved by molecular subtyping i.e. clustering of cancer into homogeneous groups that harbor similar clinical characteristics.
The majority of efforts in molecular subtyping use transcriptomic data or one specific type of data rather than integrating multiple datasets in order to elucidate novel information [24].Tumorigenesis and tumor malignancy are strongly linked with epigenetic changes [27].Particularly, DNA methylation is an epigenetic change that takes place early on during tumor formation.There are two types of DNA methylation: CpG methylation refers to methylation of CpG islands in the gene promoter region, and gene body methylation refers to methylation that is widespread not only in promoters but also in gene bodies.Abundance of DNA Methylation on CpG islands is associated with many disease phenotypes, including cancer [38].It plays an important role in controlling gene expression in cancer cells.Although CpG methylation is known to silence genes, it has not yet been investigated in depth whether there is a correlation between gene body methylation and gene expression data [40].DNA methylation in the bodies of genes is yet to be addressed comprehensively but it is likely involved in important mechanisms such as of elongation of transcription, differential promoter usage as well as alternative splicing [41].It is established that gene body methylation is important in controlling transcription [8], however, there are no studies that perform molecular subtyping using gene-body methylation and gene-expression data.The majority of molecular subtyping efforts use clustering.However, clustering methods do not perform well on DNA methylation data due to the measurements comprising largely of zero values,despite being recorded on a continuous scale [13].However, we showed that when DNA methylation measurements are statistically integrated with gene expression data, the result is improved subtyping in terms of survival [7].Therefore, the efforts to utilize DNA methylation for patient subtyping primarily comprise of methods that integrate DNA methylation with other types of genomic data.
Recently, DNA methylation patterns across the gene body have garnered attention towards their role in transcriptional regulation and efficiency with only a few studies going beyond CpG methylation and investigate gene body methylation's regulation of gene expression.Gene body methylation was found to be a robust positive predictor of gene expression in human small cell lung cancer [17].Hypermethylation of gene bodies lead to high expression of cancer promoting genes in lymphoma cells [10].Gene body hypermethylation leads to aggressive behaviour of testicular cancer [2].Stefansson et al. [28] identified two novel sybtypes of luminal breast cancer using gene body methylation data.
The above-mentioned studies illustrate the the importance of gene-body methylation in tumorgenesis.However, the function and In this study, we attempt to fill the identified gap by proposing the use of our previously developed method SCClust (Sparse Canonical Correlation analysis with Clustering), in order to combine gene expression with CpG methylation and gene body methylation data, individually, for two TCGA cancer datasets, while mapping them to a lower-dimensional space (Figure 1).This is followed by clustering the results of the integrated data, followed by clustering evaluation using Kaplan-Meier plots.To our knowledge, there have not been any studies that perform sCCA on these data in cancer genomics.Elucidation of subtypes using integrated DNA methylation and gene expression data also remains unexplored.

MATERIALS AND METHODS 2.1 Data
We downloaded gene expression and DNA methylation profiles for 147 Colon Adenocarcinoma (COAD) and 183 Kidney Renal Clear Cell Carcinoma (KIRC) patients from the TCGA.CpG methylation data comprised methylated sites for each subject and gene body methylation data comprised methylated gene bodies for each subject.Gene expression data consisted of gene expression levels for each subject in the dataset.A summary of the datasets is provided in Table 1.The downloaded datasets are high-dimensional.For example, for kidney cancer, the gene expression dimensionality is 20,501, the CpG methylation dimensionality is 24,780 and the gene-body methylation dimensionality is 13,738 for 183 subjects, thus necessitating dimensionality reduction for analysis and aggregation.

Dimensionality reduction and clustering using sCCA
Canonical Correlation Analysis (CCA) [6] is a technique in statistics to linearly combine two random variables in a manner that maximizes the correlation among them.The sCCA variant of CCA solves the problem of multicollinearity by using sparse loadings in the CCA algorithm.sCCA methods have a common objective function: (1) , (2) −  ( (1) ,  (2) ) +    1 ( (1) ) +    2 ( (2) ) Here the penalty functions for the vectprs  (1) and  (2) are denoted by    1 ( (1) ) and    2 ( (2) ), respectively.This is an optimization problem where the solution is found through iteration [22].sCClust uses an sCCA implementation by csala et al. [3], which maximizes the correlation and uses the Elastic Net (ENet) regression model.The R package sRDA was used to carry out sCCA using sCClust with the modification that the sCCA function would accept two sparsity arguments instead of one.After jointly reducing the dimension of the DNA methylation and gene expression data, we identified cancer subtypes by clustering the reduced data with the k-means algorithm [11].

Survival measures and comparison counterpart
Survival analysis is used to calculate the percentage of data units that survive over time.A graphical method for displaying survival data is the Kaplan-Meier (KM) plot which signifies the set of subjects surviving over time displayed as a step function where each step shows occurrence of an event.In order to show the disparity between identified clusters, we need to measure separation between KM plots.One commonly used measure to achieve that is the minimum hazard ratio   [20].It is measured as the differences between clusters or in this case, cancer subtypes, in Cox's proportional hazards model.The assumption of proportional hazards needs to be satisfied in order to use hazard ratios in the survival analysis.Global Schoenfeld Test (GST) is a commonly used test for this purpose [5].Using GST, a probability greater than the significance level lead to the proportional hazard assumption being satisfied.To further support our analysis, we employ a second criterion for assessing survival curves called SEP statistic [23], which is used to measure the average difference between hazard rates.If the coefficient estimate is denoted by β , where  is the subtype in Cox's proportional hazards model and the number of patients is   such that  =1   β = 0 and  =1   = , then the SEP statistic can be defined as: exp(−  =1 (  /)| β |).In order to see if sCClust was the right choice for this analysis, we performed comparison with Similarity Network Fusion (SNF) [31], a state-ofthe-art method that builds similarity networks for all data sources and integrates them non-linearly.We implemented SNF using the R package CancerSubtypes [37].

RESULTS AND DISCUSSION
We performed a case study on two TCGA cancer datasets: Colon adenocarcinoma (COAD) is a type of cancer that begins in the colon or the large intestine.While it typically impacts older people, it can happen at any age.COAD is a major cause of cancer mortality worldwide [25].Tumor heterogeneity makes early diagnosis and effective treatment of colon adenocarcinoma difficult, thus it is important to study its epigenatic changes such as DNA methylation, as they can influence tumor heterogeneity [25].Kidney renal clear cell carcinoma, or KIRC, is the most prevalent kidney cancer and its molecular pathogenesis is not completely understood [15].Like colon cancer, it is reported as a heterogeneous disease that can benefit from molecular characterization for effective treatment.

Significance of Subtypes identified by integrating CpG methylation and gene expression
We applied sCCA from SCClust to gene expression and CpG methylation datasets for colon cancer in order to project the data into a lower dimensional space.We tried different combinations of the sparsity parameters and canonical components in order to maximize   .Once lower-dimensional canonical scores were obtained, we clustered them using the -means algorithm [11].The optimal number of clusters was determined by using elbow plots [16] for aggregated scores.The number of clusters obtained for SCClust CpG methylation and gene expression subtypes for colon cancer was four.This number of clusters agreed with the well-established subtypes of colon cancer [12].The process was repeated for kidney cancer.Following cluster analysis, KM plots were made for each  For COAD SCClust CpG methylation and gene expression, the   and SEP criteria are shown in Table 2.We obtained   and  values of 1.4227 and 1.4824, respectively, which gives better separation compared to SNF.For KIRC, the number of clusters was three.This number of clusters agreed with the well-established subtypes of KIRC [19].We obtained   value of 1.60707 and  value of 1.40176, thus indicating better performance than SNF.The analysis was valid for all the subtypes identified (GST p-values > 0.05).
We examined the canonical weights of the datasets to discover the genes selected by SCClust.Here is a brief analysis of the top genes for COAD: GNB5 (G Protein Subunit Beta 5) is a protein coding gene.It was selected by SCClust with the highest weight.Over-expression of this gene contributes to cetuximab resistance in colon cancer [26].The MLH1 (mutL homolog 1) gene that provides instructions for making a protein that plays an essential role in repairing DNA.It is also a biomarker for colon cancer [33].Our analysis selected EIF1AD (eukaryotic translation initiation factor 1A domain containing).Eukaryotic translation initiation factors are correlated with tumor progression in colon cancer [39].P2RY14 (Purinergic Receptor P2Y14) is also linked to metastasis in colon cancer [14].The selection of these genes by showed that genes underlying CpG methylation and gene expression subtypes are biologically relevant to colon cancer.For KIRC, RUNX 1 (RUNX Family Transcription Factor 1) was selected with the highest weight.This gene is a driver in KIRC [34].FCHO1 (FCH And Mu Domain Containing Endocytic Adaptor 1) is a protein coding gene, and is associated with the activation of several well-known pathways in kidney cancer [1], similar to STK17a (Serine/threonine kinase 17a) which is also selected by sCCA [9].Further, the expression THRA (Thyroid Hormone Receptor Alpha) correlates with kidney cancer patients' survival rate [18].

Significance of Subtypes identified by integrating gene body methylation and gene expression
We repeated the steps for integrating gene expression with gene body methylation data.We obtained four clusters for the aggregated scores for COAD and three for KIRC.The KM plots are shown in figures 2b and 2d.We obtained   and  values of 1.5267 and 1.5611, respectively (Table 2) for the colon dataset.For kidney cancer, we obtained   value of 1.8426 and  value of 1.4954, again, improvement from SNF results.In all model fits we obtained GST p-values > 0.05, rendering the analysis valid for all subtypes i.e. log rank  −  ≤ 0.05.Examination of canonical weights for gene expression dataset revealed that genes selected by sCCA were biologically relevant.For colon cancer, the top genes selected are: CAMK2N2 (Calcium/Calmodulin Dependent Protein Kinase II Inhibitor 2) is a protein coding gene was selected by sCCA with top weight and aberrant expression of this gene has been reported in colon cancer [30].ASPHD2 (Aspartate Beta-Hydroxylase Domain Containing 2) is also a protein coding gene and silencing correlates with improved prognosis in colon cancer [29].For kidney cancer, some of the genes selected are: The EVI2A (Ecotropic Viral Integration Site 2A) is associated with a low survival rate [32] in kidney cancer.This gene was selected by our method with the highest weight.OLFML1 (Olfactomedin Like 1) is a protein coding gene which is a prognostic biomarker in kidney cancer [35].ZFPM2-AS1 (ZFPM2 Antisense RNA 1) accelerates migration and proliferation of KIRC cells [21].Moreover, ADCY3 (adenylate cyclase 3) is used in prognostic models for kidney cancer [36].As seen, for both cancer types, the genes selected by both dataset integration methods were not identical, but there is an overlap in subtyping as demonstrated below.This means that the data fused with gene expression determines what genes are selected, hence the information that SCClust reveals is different for both subtyping results.

Subtype Comparison for the two datasets
Once we identified both types of subtypes using SCClust i.e.SCClust CpG methylation plus gene expression subtypes (CpG+GE) and SCClust gene-body methylation plus gene expression subtypes (GB+GE), we performed a pairwise comparison between the two.Each colon cancer CpG+GE subtype was significantly associated with colon cancer GB+GE subtype (Figure 3a).The results were alike for kidney cancer (Figure : 3b).In the heatmaps we see the pink squares along the diagonal indicates similarity between CpG+GE versus GB+GE common subtypes whereas blue and lavender off the diagonal indicate disparity between different groups.

CONCLUSION
Although DNA methylation is a key regulator of gene expression, the comprehensive methylation landscape of metastatic cancer has not been defined comprehensively.Due to the evident and critical role that DNA methylation plays as a transcription regulator, it is mandatory that relationships among these data are investigated statistically.Owing to the high dimensionality of omics data, the crux of the problem is to jointly reduce the dimension of datasets whilst preserving their complex correlation structure.In this study, we employed SCClust to integrate TCGA gene expression with DNA methylation of CpG and gene bodies, individually.Using a case study with two cancer datasets, we illustrated the performance in terms of survival analysis.All identified subtypes are statistically significant ( −  ≤ 0.05).Hazard analysis and comparison with similarity network fusion (SNF) showed better performance of SCClust.Analysis of canonical vectors for both subtyping results showed that different genes were selected for each type of subtyping.Also, despite this difference, the genes selected were biologically pertinent to the cancer dataset they were gauged from indicating that the underlying information these different sybtypes revealed is different.Another key finding is that the identified subtypes using SCClust to combine gene expression with different types of methylation data stayed consistent.The study exhibits improved molecular classification by integrating two types of omics data rather than one as demonstrated previously [31] and this is the reason for not comparing the results with traditional subtyping methods that use one type of omic data.In the future, we plan on expanding our work to other types of genomics data for instance mi-croRNA (miRNA) to elucidate it's relationship with gene expression in different types of cancer.

Figure 1 :
Figure 1: An overview of the methodology: (a) Sparse canonical correlation analysis is performed on integrated gene expression with either CpG methylation or gene body methylation in order to get canonical scores (b) K-means clustering is performed (c) validation is carried out using Kaplan-Meier plots.(d) both subtyping results are compared obtained cluster.Efficacy was measured by the separation in the KM curves for each subtype.Figures (2a and 2c) shows KM curves for CpG methylation integrated with gene expression data for colon cancer and kidney cancer respectively.The different colors illustrate different subtypes of cancer.The greater the distance between KM curves, the better the subtyping[4].All identified subtypes were statistically significant with log-rank p-values ≤ 0.05.

Figure 2 :
Figure 2: Survival probability as a function of time for different clusters/subtypes in colon and kidney cancer.The corresponding log-rank survival p-values are stated.

Figure 3 :
Figure 3: A heatmap showing the correlation between subtypes identified by SCClust CpG methylation + Gene expression versus Gene body methylation + Gene expression.The light colored blocks in the diagonal indicate similarity between subtypes in each combination.

Table 1 :
Datasets used in this study: Gene expression and DNA methylation datasets corresponding to kidney and colon cancer

Table 2 :
(Top) Log-rank p-value, minimum hazard ratio and SEP value for COAD and KIRC for Gene expression and CpG methylation integrated data using SCClust and SNF and (bottom) for Gene expression and gene body methylation integrated data