Network-based stratification analysis of 13 major cancer types using mutations in panels of cancer genes
© Zhong et al.; licensee BioMed Central Ltd. 2015
Published: 11 June 2015
Cancers are complex diseases with heterogeneous genetic causes and clinical outcomes. It is critical to classify patients into subtypes and associate the subtypes with clinical outcomes for better prognosis and treatment. Large-scale studies have comprehensively identified somatic mutations across multiple tumor types, providing rich datasets for classifying patients based on genomic mutations. One challenge associated with this task is that mutations are rarely shared across patients. Network-based stratification (NBS) approaches have been proposed to overcome this challenge and used to classify tumors based on exome-level mutations. In routine research and clinical applications, however, usually only a small panel of pre-selected genes is screened for mutations. It is unknown whether such small panels are effective in classifying patients into clinically meaningful subtypes.
In this study, we applied NBS to 13 major cancer types with exome-level mutation data and compared the classification based on the full exome data with those focusing only on small sets of genes. Specifically, we investigated three panels, FoundationOne (240 genes), PanCan (127 genes) and TruSeq (48 genes). We showed that small panels not only are effective in clustering tumors but also often outperform full exome data for most cancer types. We further associated subtypes with clinical data and identified 5 tumor types (CRC-Colorectal carcinoma, HNSC-Head and neck squamous cell carcinoma, KIRC-Kidney renal clear cell carcinoma, LUAD-Lung adenocarcinoma and UCEC-Uterine corpus endometrial carcinoma) whose subtypes are significantly associated with overall survival, all based on small panels.
Our analyses indicate that effective patient subtyping can be carried out using mutations detected in smaller gene panels, probably due to the enrichment of clinically important genes in such panels.
Cancers are complex diseases with highly heterogeneous causes and clinical outcomes. At the molecular level histologically and clinically similar patients often exhibit drastically distinct genomic aberrations. Large-scale studies such as The Cancer Genome Atlas (TCGA) have comprehensively cataloged multi-layered genomic aberrations in multiple cancers [1, 2]. Identification of clinically meaningful subtypes of tumors based on molecular patterns is critical to provide insights into the biological mechanisms of tumor progression and to guide better treatment and prognosis. Most previous studies of tumor classification utilized messenger RNA (mRNA) expression [3–5], which often yielded subtypes that are not highly predictive of clinical outcomes [1, 6]. On the other hand, somatic mutations, which often disrupt the function of mutated genes, provide insights not only to the mechanisms of tumorigenesis and progression but also the candidates for targeted therapy [7–9]. Therefore classification of patients based on somatic mutations may provide more effective clinical guidance. However, mutations are rarely shared across patients [10, 11] so that the similarity between tumors cannot be directly measured based on mutated genes. Network-based stratification (NBS) was recently proposed to overcome this challenge by leveraging information provided in protein-protein interaction networks (PPI) . Briefly, NBS uses label propagation on PPI to assign higher values to non-mutated genes that are closer to genes (in PPI) that harbor mutations. This guilt-by-association principle governed by genetic networks has many applications for biological discovery utilizing prior knowledge [13, 14]. For somatic mutations in genes, this principle fits well with the underlying biology: driver genes are often interacting directly or indirectly in common pathways and mutations in different genes in the same pathway are likely to cause genetically similar tumors [11, 15–18]. NBS has been applied on several cancers using exome-level mutation data and showed improved association of subtypes with clinical outcomes than using mRNA data . In general NBS provides a unified framework to further investigate tumor subtyping by integrating somatic mutations with biological networks.
Exome-level mutations were used in previous NBS analyses. In routine research and clinical application, instead of exome sequencing, a viable cost-effective alternative is to screen mutations in a panel of pre-selected cancer genes [19, 20]. It is unknown whether such small panels are effective in classifying patients into clinically meaningful subtypes. Although exome sequencing provides comprehensive characterization of coding mutations, it is likely that a large portion of mutations are passengers, as it was estimated that few mutations in a patient are drivers (e.g. ranging from 2 to 8) [16, 21]. Such passengers, if included in the analysis, may obscure clinically and biologically important mutations. On the other hand, gene panels are usually designed to include known driver genes or genes involved in important pathways, resulting in highly enriched signal vs. noise. We reasoned that a panel of cancer genes, although likely that some important genes were not included, may provide adequate mutation information for effective subtyping. It is an important issue to investigate since ultimately it is only practical to routinely screen a panel of genes in clinical settings.
In this study, we set out to evaluate the effectiveness of various gene panels on classifying tumors into clinically meaningful subtypes. Specifically, we collected three panels for evaluation, representing different numbers of genes: FoundationOne, PanCan and TruSeq. Briefly, FoundationOne panel includes 240 genes; PanCan was derived from 12 cancer types analyzed by TCGA and includes 127 genes ; TruSeq was developed by Illumina and includes 48 genes. We applied the NBS approach to 13 major cancer types with a total of ~4000 solid tumor samples profiled by exome-sequencing, using the full exome-level mutation data (termed "Full" dataset) as well as the three cancer panels.
Materials & methods
Cancer gene panels and mutation data
Sample sizes of 13 tumor types before and after filtering to ensure sufficient mutations per sample and total samples.
(# mutations ≥ 6)
(# mutations ≥ 1)
(# mutations ≥ 1)
(# mutations ≥1)
Gene interaction network
HumanNet  v.1 (http://www.functionalnet.org/humannet/download.html) with the top 10 percent most confident edges were used as the gene-to-gene interaction network. Mutations were mapped to genes on this network for label propagation (see below). This gene interaction network was also transformed into a k-nearest-neighbors graph by connecting vertices (i.e. genes) v i and v j if v i is among the k-nearest neighbors of v j or if v j is among the k-nearest neighbors of v i . The result is a symmetric connectivity matrix, which was used to derive the graph Laplacian in network-regularized NMF (see below).
We set as previously described . The propagation function was run iteratively until Ft converges (|Ft+1 -Ft| < 0.001). Following the propagation, quantile normalization was applied to Ft to ensure each patient follows the same distribution for the smoothed mutation profile. We use F to denote the final normalized and smoothed mutation matrix.
where denotes the matrix Frobenius norm, W is an m by K matrix and H is a K by n matrix, with entries in both W and H non-negative. W is a collection of basis vectors or "metagenes", and H contains the loading of the basis. The value K controls the dimension reduction, and we used K = 3,4,5,6 in this study. L is the graph Laplacian of a k-nearest-neighbor network. We chose k = 11 as previously described . λ is the regularization parameter and the value was set as 200, which is on the same scale as previously described . The iterative algorithm proposed in Cai et al  was used to find solutions W and H. The iteration was run until the objective function converges (|Ft+1 -Ft | < 0.1).
In order to achieve robust clustering, we used consensus clustering  to generate final clustering of patients. Specifically, we ran network-regulated NMF using a random sample without replacement of 80% patients to construct a clustering, and repeated this process 500 times. The collection of 500 clustering results was used to construct the similarity matrix, which records the frequency with which each pair of patients was observed to share the same membership among all replicates. Hierarchical clustering with average linkage was generated based on the similarity matrix using the R "NMF" package. We used cophenetic correlation coefficient (ccc) to assess the dispersion of the consensus matrix as previously described  also using the R "NMF" package.
Survival analysis was performed using the R "survival" package. Kaplan-Meier survival curves were plotted for each NBS subtype and log-rank tests were performed to test the association of subtypes with survival. Fisher's exact tests were used to test the association of subtypes with tumor grade or stage.
We examined the clustering patterns of 13 cancer types for each of the gene sets (the three panels plus the Full dataset) by applying the NBS approach. The clustering outcomes (K = 3, 4, 5, 6) for each of the 13 cancer types were displayed in Supplemental Fig. S1-S13 (Additional file). Overall, we observed different clustering patterns across four sets of genes for a given cancer type and a rank K. For most cases, the smallest panel (TrueSeq) produced the clearest stratification while for the Full data the stratifications are less clear (e.g., ESO, GBM, KIRC, MM). In a few cases similar patterns were observed for different gene panels (e.g., CRC, HNSC). FoundationOne and panCan panels often produced clusters of unbalanced sizes. Among the three panels, FoundationOne and PanCan tend to produce similar patterns (e.g., KIRC, HNSC, GBM, LUAD, MM). For a few cancers (e.g., LUSC, OV) all three panels generated similar patterns. For a fixed panel, the clustering outcome appears insensitive to K (e.g., BLCA, KIRC, HNSC, LUSC, MEL and MM). The goodness of the cluster separation was assessed using the cophenetic correlation coefficient (ccc) and for clusters that exhibit clear patterns the ccc values are over 0.99. Consistent with the visual impression, the TrueSeq panel in general gave the clearest separation (median of ccc = 0.988), followed by PanCan (median ccc = 0.982), FoundationOne (median ccc = 0.972) and the Full set (median cccc = 0.627).
NBS subtypes associated with clinical data
Significant associations between NBS subtypes (clusters) and survival in 5 tumor types using 3 gene panels
We then investigated the association between clusters and tumor grade/stage. Of the cancer types (HNSC, KIRC, OV, UCEC) with grade or stage information available, we found no association between NBS subtypes and tumor grade/stage except for UCEC. As shown in Figure 2, the subtypes of UCEC tumors with relatively poor survival (red, pink) were also enriched for high grade (p = 0.03, Fisher exact test).
For some tumor types, e.g. HNSC, KIRC and CRC, different K values generated nested subtypes (Fig. S15-S17 in Additional file). For example, for HNSC the yellow cluster for K = 3 was further divided into yellow and black clusters for K = 4, and the black cluster for K = 4 was further divided into black and pink clusters for K = 5, while all other clusters remain the same (Fig. S15 in Additional file). For larger K values, NBS classified patients into finer subgroups with distinct survival curves, and as a result the significance of association of survival with subtypes is consistent across different K's (Fig. S15 in Additional file). Similar nested patterns were observed in KIRC and CRC (Fig. S16,S17 in Additional file).
To investigate the effectiveness of traditional clustering approaches on mutation data, we applied hierarchical clustering on the mutation data without network smoothing of the 5 tumors that showed significant associations with survival. All of the tumors except UCEC failed to produce clear clustering, and essentially all samples were clustering into a single group (data not shown). Although UCEC showed separable subtypes, such subtypes are not associated with survival, indicating that clustering without NBS failed to capture clinically relevant pathways that are encoded in protein interaction networks.
Our results were consistent with the previous work in which NBS approach was applied to three cancers: OV, UCEC and LUAD . Using the full exome-level functional mutation data, we observed classification of UCEC samples into 3 clusters with no significant difference in survival (log-rank test p-value = 0.63) (Fig. S14 in Additional file), similar to the result reported previously . We also observed significant association between OV subtypes and survival (log-rank p = 0.02) (Fig. S14 in Additional file), although the significance levels are not the same, probably due to different samples used in the two studies (e.g ~1/3 samples are different due to differences in downloaded data).
Exome and whole genome sequencing have identified somatic mutations across multiple tumors; however most mutations are deemed passengers. Comprehensive analyses have pinpointed important genes and pathways that drive tumor progression. This enables the classification of patients into genetically distinct subtypes so that biologically driven prognostication and therapy selection become feasible in clinic. Network-based approaches integrate mutations and protein interaction networks to achieve knowledge-guided subtyping and have shown promises in linking subtypes with clinical data. In this study we showed that using a panel of important genes can achieve superior classification than using the full exome-level mutations, and often has better predictive performance of clinical outcome as well. Due to extensive tumor heterogeneity, however, it is infeasible to design a single panel to fit all cancers. The performance of a gene panel will likely depend on specific tumor types. We investigated three panels with different contents and numbers of genes in an effort to span a range of scenarios. For optimal performance it may warrant a focused investigation if a custom gene panel is to be designed for a specific tumor type.
NBS approaches leverage prior knowledge of protein interaction networks to overcome the challenge of the sparsity of somatic mutations. It is unclear, however, how to construct an optimal network used in NBS. It is expected that biological networks in different organs or tissues are different, and some genes in the network may not even be expressed in certain tissues. Other complicating factors in the NBS procedure involve various tuning parameters, such as the α in the network propagation step, the γ in the graph-regularization of the NMF. Together with the influence of the genes in panels, it is unclear how to obtain robust and consistent subtyping. For example, some tumor types have clear clustering of patients but the subtypes are not associated with clinical outcomes. For such cases it is also possible that clinical relevant tumor subtypes, if present, are driven by other mechanisms, such as methylation or copy number aberration, instead of somatic mutations. It is generally unknown a priori what strategy is optimal in tumor classification for clinic use, and more integrative analyses across large cohorts may provide further guidance towards this goal.
The motivation behind NBS is to deal with the sparsity of mutation data. For mutations that are frequently observed in multiple patients, it may be tempting to apply clustering methods on the sparse mutation data without network smoothing. Our results showed that even in subtypes that are enriched for specific mutations, e.g. UCEC, it is crucial to leverage prior knowledge encoded in gene networks to classify patients into clinically relevant subtypes. Gene interaction networks used in the NBS approach are not merely to deal with the sparsity of mutations but to provide meaningful biological knowledge to have effective subtyping. It may be true in general that NBS is beneficial for subtyping for other somatic mutation data and effectiveness of NBS in other scenarios needs further exploration.
We applied NBS to 13 major cancer types using both exome-level mutation data and panels of genes for classification and showed that small panels are often more effective in clustering tumors than full exome data. In addition, subtypes discovered by panels of genes in 5 tumor types (CRC, HNSC, KIRC, LUAD and UCEC) are significantly associated with patient survival, while subtypes based on full exome mutations are less predictive of clinical data. Screening of panels of genes combined with the NBS analysis strategy has potential for clinical use and the performance may be further improved by selection of clinically important genes and proper gene interaction networks.
The publication costs for this article were funded by the last author.
This article has been published as part of BMC Genomics Volume 16 Supplement 7, 2015: Selected articles from The International Conference on Intelligent Biology and Medicine (ICIBM) 2014: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S7.
- Cancer Genome Atlas Research Network: Integrated genomic analyses of ovarian carcinoma. Nature. 2011, 474 (7353): 609-615. 10.1038/nature10166.View ArticleGoogle Scholar
- Cancer Genome Atlas Research Network, Kandoth C, Schultz N, Cherniack AD, Akbani R, Liu Y, et al: Integrated genomic characterization of endometrial carcinoma. Nature. 2013, 497 (7447): 67-73. 10.1038/nature12113.View ArticleGoogle Scholar
- Konstantinopoulos PA, Spentzos D, Cannistra SA: Gene-expression profiling in epithelial ovarian cancer. Nat Clin Pract Oncol. 2008, 5 (10): 577-587. 10.1038/ncponc1178.View ArticlePubMedGoogle Scholar
- Konstantinopoulos PA, Spentzos D, Karlan BY, Taniguchi T, Fountzilas E, et al: Gene expression profile of BRCAness that correlates with responsiveness to chemotherapy and with outcome in patients with epithelial ovarian cancer. J Clin Oncol. 2010, 28 (22): 3555-3561. 10.1200/JCO.2009.27.5719.PubMed CentralView ArticlePubMedGoogle Scholar
- Reis-Filho JS, Pusztai L: Gene expression profiling in breast cancer: classification, prognostication, and prediction. Lancet. 2011, 378 (9805): 1812-1823. 10.1016/S0140-6736(11)61539-0.View ArticlePubMedGoogle Scholar
- Cancer Genome Atlas Network: Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012, 487 (7407): 330-337. 10.1038/nature11252.View ArticleGoogle Scholar
- Chmielecki J, Pietanza MC, Aftab D, Shen R, Zhao Z, et al: EGFR-mutant lung adenocarcinomas treated first-line with the novel EGFR inhibitor, XL647, can subsequently retain moderate sensitivity to erlotinib. J Thorac Oncol. 2012, 7 (2): 434-442. 10.1097/JTO.0b013e31823c5aee.PubMed CentralView ArticlePubMedGoogle Scholar
- Olivier M, Taniere P: Somatic mutations in cancer prognosis and prediction: lessons from TP53 and EGFR genes. Curr Opin Oncol. 2011, 23 (1): 88-92. 10.1097/CCO.0b013e3283412dfa.View ArticlePubMedGoogle Scholar
- Pirazzoli V, Nebhan C, Song X, Wurtz A, Walther Z, Cai G, et al: Acquired resistance of EGFR-mutant lung adenocarcinomas to afatinib plus cetuximab is associated with activation of mTORC1. Cell Rep. 2014, 7 (4): 999-1008. 10.1016/j.celrep.2014.04.014.PubMed CentralView ArticlePubMedGoogle Scholar
- Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, et al: Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014, 505 (7484): 495-501. 10.1038/nature12912.PubMed CentralView ArticlePubMedGoogle Scholar
- Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko P, et al: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013, 499 (7457): 214-218. 10.1038/nature12213.PubMed CentralView ArticlePubMedGoogle Scholar
- Hofree M, Shen JP, Carter H, Gross A, Ideker T: Network-based stratification of tumor mutations. Nat Methods. 2013, 10 (11): 1108-1115. 10.1038/nmeth.2651.View ArticlePubMedGoogle Scholar
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140-PubMed CentralView ArticlePubMedGoogle Scholar
- Ideker T, Sharan R: Protein networks in disease. Genome Res. 2008, 18 (4): 644-652. 10.1101/gr.071852.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Ciriello G, Cerami E, Sander C, Schultz N: Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 2012, 22 (2): 398-406. 10.1101/gr.125567.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, et al: Mutational landscape and significance across 12 major cancer types. Nature. 2013, 502 (7471): 333-339. 10.1038/nature12634.PubMed CentralView ArticlePubMedGoogle Scholar
- Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. Cell. 2011, 144 (5): 646-674. 10.1016/j.cell.2011.02.013.View ArticlePubMedGoogle Scholar
- Kreeger PK, Lauffenburger DA: Cancer systems biology: a network modeling perspective. Carcinogenesis. 2010, 31 (1): 2-8. 10.1093/carcin/bgp261.PubMed CentralView ArticlePubMedGoogle Scholar
- Balko JM, Giltnane JM, Wang K, Schwarz LJ, Young CD, Cook RS, et al: Molecular profiling of the residual disease of triple-negative breast cancers after neoadjuvant chemotherapy identifies actionable therapeutic targets. Cancer Discov. 2014, 4 (2): 232-245. 10.1158/2159-8290.CD-13-0286.PubMed CentralView ArticlePubMedGoogle Scholar
- Frampton GM, Fichtenholtz A, Otto GA, Wang K, Downing SR, He J, et al: Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol. 2013, 31 (11): 1023-1031. 10.1038/nbt.2696.View ArticlePubMedGoogle Scholar
- Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW, et al: Cancer genome landscapes. Science. 2013, 339 (6127): 1546-1558. 10.1126/science.1235122.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoadley KA, Yau C, Wolf DM, Cherniack AD, Tamborero D, Ng S, et al: Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin. Cell. 2014, 158 (4): 929-944. 10.1016/j.cell.2014.06.049.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM: Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 2011, 21 (7): 1109-1121. 10.1101/gr.118992.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature. 1999, 401 (6755): 788-791. 10.1038/44565.View ArticlePubMedGoogle Scholar
- Cai DH X, Wu X, Han J: Non-negative Matrix Factorization on Manifold. IEEE. 2008, 63-72.Google Scholar
- Stefano Monti PT, Mill Mesirov, Todd Golub: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning. 2003, 52: 91-118. 10.1023/A:1023949509487.View ArticleGoogle Scholar
- Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA. 2004, 101 (12): 4164-4169. 10.1073/pnas.0308531101.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.