- Open Access
Integrative analysis of somatic mutations and transcriptomic data to functionally stratify breast cancer patients
© The Author(s). 2016
- Published: 22 August 2016
Somatic mutations can be used as potential biomarkers for subtyping and predicting outcomes for cancer patients. However, cancer patients often carry many somatic mutations, which do not always concentrate on specific genomic loci, suggesting that the mutations may affect common pathways or gene interaction networks instead of common genes. The challenge is thus to identify the functional relationships among the mutations using multi-modal data. We developed a novel approach for integrating patient somatic mutation, transcriptome and clinical data to mine underlying functional gene groups that can be used to stratify cancer patients into groups with different clinical outcomes. Specifically, we use distance correlation metric to mine the correlations between expression profiles of mutated genes from different patients.
With this approach, we were able to cluster patients based on the functional relationships between the affected genes using their expression profiles, and to visualize the results using multi-dimensional scaling. Interestingly, we identified a stable subgroup of breast cancer patients that are highly enriched with ER-negative and triple-negative subtypes, and the somatic mutation genes they harbor were capable of acting as potential biomarkers to predict patient survival in several different breast cancer datasets, especially in ER-negative cohorts which has lacked reliable biomarkers.
Our method provides a novel and promising approach for integrating genotyping and gene expression data in patient stratification in complex diseases.
- Distance correlation
- Breast cancer patient stratification
- Functional analysis of somatic mutation
- Integrative analysis
The initiation, development, and metastasis of cancers are complicated processes involving multi-cell, multi-tissue interactions and communications. Most cancers confer heterogeneity among patients that lead to different clinical outcomes such as survival time and response to treatment. With recent rapid advancement in next generation sequencing (NGS) technologies and computing capacity for processing and storing large data, more and more human cancer genomes have been characterized in a systematic way, bringing great opportunities for researchers to carry out integrative analysis to identify potential molecular markers for stratifying patients into subtypes with different predicted clinical outcomes . Currently The Cancer Genome Atlas (TCGA) project harbors comprehensive data ranging from genomic sequences, genetic variants, transcriptomic and proteomic data to clinical data for multiple types of human cancer tissues as well as normal tissues. It is a great source for scientists to integrate data from different levels and mine the buried interaction among them, which will shed light on the understanding of cancer subtyping, prognosis as well as the cancer initiation and development [2–4].
In TCGA database, we often observe patients with a lot of somatic mutations that can significantly alter corresponding protein structures or functions of the genes they reside on (we named the affected gene as significantly mutated gene, or SMG). SMGs are the results of splice-site-change, nonsense, non-stop or frame-shift mutations. The prevalence of SMGs in almost all cancer types let us postulate that they may be potentially used as signatures for subtyping and outcome prediction, or as starting point to elucidate the tumorigenesis process. However, there is a big challenge in using SMGs for cancer patient stratification — the overlaps between the SMGs from different patients are usually small and the lists are usually not converging to common pathways [1, 5]. For instance, the breast cancer (BRCA) project in TCGA has identified three commonly mutated genes TP53, GATA3, and PI3KC but every patient has a much larger number of somatic mutations which cannot be easily summarized and compared even at the pathway level . Therefore, it is of great interest in identifying the potential relationships between the mutated genes from different patients.
In this paper, instead of directly working on the gene lists, we propose to examine the functional relationships of the SMGs between different patients based on functional genomics data. One of such functional measurements is gene expression profile obtained from microarray or RNA-seq experiments, which has already been curated in TCGA. Specifically, given two sets of SMGs from two patients, we develop a method to establish the relationship between them based on expression profiles of the two gene lists.
Given a list of genes with their expression profiless measured in a cohort of patients, one way to characterize their roles is to examine how these genes lead to separation of the patients. In other words, we can establish a “patient network” using the difference of the expression levels of the genes as distance metric. Then given two gene lists, we can compare the similarity between the patient networks established by each of the lists. The similarity will provide pivotal information on the similarity between the roles of these two gene lists among the patients.
Mathematically, such similarity between patient networks can be computed using a recently developed metric called distance correlation . Therefore in this paper, we develop a workflow for establishing the functional similarity among SMGs from different patients based on distance correlation. Our goal is trying to reveal the yet unknown links between different SMG, which indicate their functional relationships in the context of human gene interaction network, and use this relationship to stratify patients with different subtypes. While we demonstrate our approach using a breast cancer study, our method provides a novel promising approach of integrating genotype and gene expression data in patient stratification in complex diseases.
The key component in this workflow is to compute the distance correlation between a pair of gene lists (in this case, expression profiles of two SMG lists from two patients). The intuition behind distance correlation can be considered as following: A gene list can be used to cluster the patient cohort of a heterogeneous disease, generating a clustering result. Two different gene lists will generate two results, and the results may be similar if the two gene lists play similar functional roles in the disease phenotype. The distance correlation measures the similarity of the two results.
In our case, we used the gene expression data (RNA-seq) of the entire cohort to compute the distance correlation, although theoretically, any gene expression dataset of a cohort with similar disease diversity can be used, and from a more general point of view, any type of data which present deep enough functional relationship among genes, even on normal people, can be used.
After we obtained the distance correlation matrix of any two SMG lists in the context of gene expression, which represents the functional relationship of any two sets of SMGs in the breast cancer disease gene expression, we use this matrix to cluster the entire breast cancer cohort, and the results should show a group of patients grouped by their common underlying perturbation resulted from seemingly different SMG lists.
The Cancer Genome Atlas (TCGA http://www.cancergenome.nih.gov) level-3 breast cancer patients’ somatic mutation derived from WES and RNA-seq data were downloaded from TCGA data portal in July, 2013. Among all 876 available patients at the time of download, 445 have matching SMG and RNA-seq data. The data from these patients were chosen for further analysis. 83 normal breast sample RNA-seq level 3 data were also obtained from TCGA.
Somatic mutations derived from WES of the TCGA breast cancer patients were screened for significant mutation genes (SMG). SMG was defined as genes with frame-shift Indels, splice site change, non-stop mutation, or nonsense mutation. The mutation of mismatch, silent, RNA and in-frame indel were not included in SMG. For a specific group of patients, the number of SMG refers to the union of SMGs in that group of patients. For all the patients we analyzed in this study, their corresponding SMGs were listed in Additional file 1: Table S2.
Computing distance correlation
Distance correlation is a recently developed metric with two advantages . First, it can be used to calculate the “correlation” between two matrices instead of just two vectors. Essentially it calculates the similarity of effects of two “feature sets” on separating the same set of samples. Secondly, unlike Pearson correlation that is based on a linear model, it can respond to nonlinear relationships. These properties make it a good candidate for our purpose when comparing relationships between two gene lists.
Multidimensional scaling and clustering
In order to visualize the distribution of the patients with the proximity measurements defined by the distance correlation matrix, we applied multidimensional scaling (MDS) to embed the data points (each point represents a patient) in 3D space. Specifically we used Matlab function cmdscale() with its default settings. The distance correlation matrix was first transformed to a dissimilarity matrix (using 1 − D dCor ) before MDS. K-means clustering was performed upon the patients using data using the same dissimilarity matrix. It was carried out using Matlab k-means function with default square-Euclidean distance and replicates of 50, K = 3 or 5.
Jaccard index computing
where A and B are the two groups of SMGs from any pair of patients in the TCGA BRCA cohort. A∩B is the set of overlapping genes within the two SMG groups A and B, and A∪B is the union of these two groups.
For validation, NCBI GEO breast cancer dataset GSE1456 (containing 318 patients of mixed types)  as well as Netherlands Kanker Instituut (NKI) NKI-295 dataset (containing 295 patients of mixed types) were used . These microarray datasets (and their specific subtypes) contain gene expression data and matching survival time (years) that are needed for survival analysis. Log-rank test was performed to determine the significance of difference in survival time between two patient groups and Kaplan-Meier curves were plotted.
Pathway analysis and gene query in TCGA database
Ingenuity Pathway Analysis (IPA) was used to analyze enriched biological functions and pathways in the identified SMGs. The prevalence of SMGs on other cancer types in TCGA database was generated using the cBioPortal online tools (http://www.cbioportal.org) .
When the patients were clustered using the K-means clustering algorithm, we observed a distinctive group of patients as highlighted by the red circle in Fig. 2. The number of clusters is tested by checking the silhouette values and plots for different choice of K. The silhouette value reaches its high peak at K = 5 (data not shown) but this group is stable even when the number of clusters changed (e.g., K = 3 vs. 5). In addition, we inspected the silhouette plots and found that the clusters are more separated when K = 3. Thus we use K = 3 for most the rest analysis.
Statistical tests on the patient subtypes enriched in each group from K = 3 clustering results. No statistic test was performed for HER2 (and TN) status, due to the fact that more than 25 % patients do not contain HER2 status
χ2 adj-P value
χ2 adj-P value
Total (with sig mutation and matching RNAseq)
Total in TCGA
With recently rapid development in next-generation sequencing technology and computing capacity, huge amount of data in different modalities for cancer specimens have been accumulated in an amazing speed in public databases. Therefore, integrating and mining these data becomes a major challenge in the bioinformatics field currently. In this work, we developed a novel approach to integrate genomic, transcriptomic and clinical data of cancer patients, specifically to compare somatic mutations of patients based on their functional relationships in the context of gene expression profiles, thus tackling the challenge of low overlapping of mutated genes among cancer patients. By introducing the distance correlation metric to directly measure the relationship between two sets of genes affected by somatic mutations, we not only can cluster the patients into different groups with different clinical subtypes, but also visualize the clusters and identify group specific mutations. The power of using distance correlation freed us from comparing only gene pairs, but directly comparing gene list to list. The distance correlation captures not only linear relationship of the two lists as Pearson correlation does, but also reveals non-linear relationship as well, which covers the biological interaction in far more and deeper extent.
In summary, a common challenge in studying complex diseases such as cancers is the lack of common genetic mutations among the patients. Besides pursuing commonly affected pathways, we provide a complementary approach for integrating the genotype data with transcriptome data to study the relationships between the genetic mutations at the functional level. While our main goal is on exploring the functional relationships of mutated gene groups, the identified genes may also serve as potential biomarkers for different subtypes of cancers. Currently due to the limitation of the data, we focus on the protein coding genes from the WES experiments.
In the near future, we plan to apply the same workflow to other cancer datasets in TCGA to further test the effectiveness of this method as well as identifying diseases in which such functional relationship can lead to meaningful stratification of the patients. With the cost of whole genome sequencing decreasing dramatically, it is expected that more somatic mutations on the non-coding regions and regulatory regions can be made available and the approach need to be expanded to accommodate such mutations.
BRCA, breast cancer; ER, estrogen receptor; GEO, gene expression omnibus; HER2, human epidermal growth factor receptor 2; IPA, ingenuity pathway analysis; MDS, multi-dimensional scaling; NCBI, National Center for Biotechnology Information; NGS, next generation sequencing; PR, progesterone receptor; RNA-seq, ribonucleic acid sequencing; SMG, significant mutant gene; TCGA, the cancer genome atlas; TNBC, triple-negative breast cancer; WES, whole genome exome sequencing
We thank Ohio Supercomputer Center for computing support. This work was partially funded by NCI U01 CA188547 grant. ZA was funded by NLM fellowship.
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Genomics Volume 17 Supplement 7, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: genomics. The full contents of the supplement are available online at http://bmcgenomics.biomedcentral.com/articles/supplements/volume-17-supplement-7.
Availability of data and materials
All datasets used in this study were publicly available from the website described in the Methods section.
KH conceived of the study. JZ, ZA and JP collected the data. JZ performed the computational coding and conducted data analysis. JZ and KH drafted the manuscript, JP participated the design of the method. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012, 490(7418): 61–70.Google Scholar
- Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16(2):85–97.View ArticlePubMedGoogle Scholar
- Kristensen VN, Lingjærde OC, Russnes HG, Vollan HKM, Frigessi A, Børresen-Dale A-L. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer. 2014;14(5):299–313.View ArticlePubMedGoogle Scholar
- Wang C, Machiraju R, Huang K. Breast cancer patient stratification using a molecular regularized consensus clustering method. Methods. 2014;67(3):304–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Jia P, Zheng S, Long J, Zheng W, Zhao Z. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks. Bioinformatics. 2011;27(1):95–102.View ArticlePubMedGoogle Scholar
- Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35(6):2769–94.View ArticleGoogle Scholar
- Pawitan Y, Bjöhle J, Amler L, Borg A-L, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedrén S, Bergh J. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005;7(6):R953–64.View ArticlePubMedPubMed CentralGoogle Scholar
- van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6.View ArticlePubMedGoogle Scholar
- Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6(269):11.View ArticleGoogle Scholar
- Putnik M, Zhao C, Gustafsson J-Å, Dahlman-Wright K. Global identification of genes regulated by estrogen signaling and demethylation in MCF-7 breast cancer cells. Biochem Biophys Res Commun. 2012;426(1):26–32.View ArticlePubMedGoogle Scholar
- Banerjee S, Bacanamwo M. DNA methyltransferase inhibition induces mouse embryonic stem cell differentiation into endothelial cells. Exp Cell Res. 2010;316(2):172–80.View ArticlePubMedGoogle Scholar
- Brenton JD, Carey LA, Ahmed AA, Caldas C. Molecular classification and molecular forecasting of breast cancer: ready for clinical application? J Clin Oncol. 2005;23(29):7350–60.View ArticlePubMedGoogle Scholar
- Anders CK, Carey LA. Biology, metastatic patterns, and treatment of patients with triple-negative breast cancer. Clin Breast Cancer. 2009;Suppl 2:S73–81.View ArticleGoogle Scholar
- Wahba HA, El-Hadaad HA. Current approaches in treatment of triple-negative breast cancer. Cancer Biol Med. 2015;12(2):106–16.PubMedPubMed CentralGoogle Scholar
- Srinivasan RS, Nesbit JB, Marrero L, Erfurth F, LaRussa VF, Hemenway CS. The synthetic peptide PFWT disrupts AF4-AF9 protein complexes and induces apoptosis in t(4;11) leukemia cells. Leukemia. 2004;18(8):1364–72.View ArticlePubMedGoogle Scholar
- Montesano R, Sarközi R, Schramek H. Bone morphogenetic protein-4 strongly potentiates growth factor-induced proliferation of mammary epithelial cells. Biochem Biophys Res Commun. 2008;374(1):164–8.View ArticlePubMedGoogle Scholar
- Ignat M, Teletin M, Tisserand J, Khetchoumian K, Dennefeld C, Chambon P, Losson R, Mark M. Arterial calcifications and increased expression of vitamin D receptor targets in mice lacking TIF1alpha. Proc Natl Acad Sci U S A. 2008;105(7):2598–603.View ArticlePubMedPubMed CentralGoogle Scholar