Reducing confounding and suppression effects in TCGA data: an integrated analysis of chemotherapy response in ovarian cancer
© Hsu et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Skip to main content
© Hsu et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Despite initial response in adjuvant chemotherapy, ovarian cancer patients treated with the combination of paclitaxel and carboplatin frequently suffer from recurrence after few cycles of treatment, and the underlying mechanisms causing the chemoresistance remain unclear. Recently, The Cancer Genome Atlas (TCGA) research network concluded an ovarian cancer study and released the dataset to the public. The TCGA dataset possesses large sample size, comprehensive molecular profiles, and clinical outcome information; however, because of the unknown molecular subtypes in ovarian cancer and the great diversity of adjuvant treatments TCGA patients went through, studying chemotherapeutic response using the TCGA data is difficult. Additionally, factors such as sample batches, patient ages, and tumor stages further confound or suppress the identification of relevant genes, and thus the biological functions and disease mechanisms.
To address these issues, herein we propose an analysis procedure designed to reduce suppression effect by focusing on a specific chemotherapeutic treatment, and to remove confounding effects such as batch effect, patient's age, and tumor stages. The proposed procedure starts with a batch effect adjustment, followed by a rigorous sample selection process. Then, the gene expression, copy number, and methylation profiles from the TCGA ovarian cancer dataset are analyzed using a semi-supervised clustering method combined with a novel scoring function. As a result, two molecular classifications, one with poor copy number profiles and one with poor methylation profiles, enriched with unfavorable scores are identified. Compared with the samples enriched with favorable scores, these two classifications exhibit poor progression-free survival (PFS) and might be associated with poor chemotherapy response specifically to the combination of paclitaxel and carboplatin. Significant genes and biological processes are detected subsequently using classical statistical approaches and enrichment analysis.
The proposed procedure for the reduction of confounding and suppression effects and the semi-supervised clustering method are essential steps to identify genes associated with the chemotherapeutic response.
Ovarian cancer is prevalent in women  and is associated with a high mortality rate as it is usually diagnosed at an advanced stage . A standard treatment of advanced ovarian cancer involves surgical resection followed by cycles of adjuvant chemotherapy, typically a combination of taxane-based regimens and platinum-based cytotoxic agents . The combination of paclitaxel and carboplatin is one of the most common first-line treatments of ovarian cancer [4, 5]. The mechanism of action (MOA) of paclitaxel is to stabilize microtubules and as a result it induces mitotic arrest and apoptosis , and the MOA of carboplatin is to bind with DNA and form intra-strand crosslinks so as to inhibit DNA replication and transcription, and eventually activate the p53-dependent apoptosis . In most patients, the initial responses to the combination of paclitaxel and carboplatin are good; however, subsequent relapses frequently occur . Unraveling the underlying mechanisms causing chemoresistance is crucial for personalized therapy and the improvement of patients' long-term survival.
Microarrays have been used to study genes and molecular functions associated with chemoresistance. For example, Jazaeri et al. (2005) detected differentially expressed genes among primary chemosensitive, primary chemoresistant, and postchemotherapy tumors using cDNA-based microarrays . Additionally, Hartmann et al. (2005) applied a supervised learning algorithm and selected 14 genes to predict the relapsed outcome of ovarian cancer patients after platinum-paclitaxel chemotherapy . Etemadmoghadam et al. (2009) further considered chromosomal aberrations and proposed that DNA copy number alterations (CNAs) at genes such as CCNE1 and NCOA3 are associated with chemoresistance . While many studies had proposed genes or pathways associated with chemotherapeutic response, most of these studies suffered from limited number of patients and patient diversity, as well as other confounding factors to a certain extent, particularly when the samples were derived from patients with different treatment plans. Since these factors such as tumor stage, subtype, and different chemotherapies may change clinical outcome significantly, reliable results could be difficult to achieve if these confounding effects were not adequately addressed during statistical analysis.
In this regard, the Cancer Genome Atlas (TCGA) data need to be carefully assessed for eligibility to a chemotherapy study. Recently, the TCGA Research Network concluded an ovarian cancer study with thousands of microarray data including mRNA expression, DNA copy number, miRNA, SNP, and CpG methylation data from more than 500 ovarian tumor samples . While a large number of samples provide ample opportunities to carry out sophisticated survival analysis, caution should be taken: patient ages, tumor stages and treatment cycles may confound the survival outcome, while various therapeutic compounds, their combination and sample processing batches may suppress the detection without proper handling. As an example, among more than 500 patients, treatments include avastin, bevacizumab, carboplatin, cisplatin, cytoxan, docetaxel, doxoribicin, etoposide, gemcitabine, navelbine, paclitaxel, and others. In addition, these samples were processed in 13 batches.
Herein, a procedure for reducing the confounding and suppression effects is proposed, in which, factors such as experimental batches, clinical treatment, patient ages, tumor stages, and molecular classifications are carefully considered and dealt with. Beginning with a batch effect correction, we chose eligible samples through a rigorous sample selection process. In this paper, we will focus only on patients with paclitaxel and carboplatin treatment in order to remove possible confounding factors due to better drug or treatment combination when examining the survival outcome, and in the meantime, to maximize the ability of discriminating tumor subtypes. After the selection, 85 ovarian cancer samples treated only with the combination of paclitaxel and carboplatin were selected for training, and another independent 83 samples treated mainly with the combination of paclitaxel and carboplatin but including some other drugs were applied for testing. Then, gene expression, copy number, and methylation data were analyzed in a novel semi-supervised clustering method. By performing a series of statistical hypothesis testing and clustering tasks, two molecular classifications with poor progression-free survival (PFS) were identified. Comparing these classifications to other samples with good PFS, genes significantly associated with chemotherapeutic response were detected, and enriched biological processes were further examined using a gene ontology enrichment analysis method.
In this paper, the proposed procedure and the semi-supervised clustering method are detailed with flow-charts and mathematical explanations in Methods. In Results, the clustering results and the subsequent differences in chemotherapeutic response are compared via Kaplan-Meier curves. Discussions of analysis results and conclusions are provided in Discussion and conclusions.
Getting started from collecting data, this section demonstrates the methods and criteria we applied for achieving the results. First of all, we downloaded the ovarian cancer data from the TCGA repository website (http://cancergenome.nih.gov/). Level 3 gene expression data derived from Aymetrix U133A platform, level 2 copy number data derived from Agilent CGH-1x1M platform, and level 3 methylation data derived from Illumina HumanMethylation27 platform were chosen and downloaded. The gene expression data are constituted of 12,042 normalized log2 values, and each value represents an expression level of a gene. Copy number data contain 962,434 normalized log2 ratios among which 358,119 ratios with gene annotations were utilized. To match these copy number ratios with gene expression levels, values corresponding to a same gene were averaged. Methylation data are 27,578 beta values, and each value refers to the percentage of methylation for a specific CpG site. Among the 27,578 beta values, 19,448 linked to 10,068 unique genes included in the gene expression data were considered in this study. In total, a cohort of 514 tumor samples comprehensively containing gene expression, copy number, and methylation profiles was obtained.
where and refer to the medians of b i and b ij , respectively.
Semi-supervised clustering has been proposed and applied for identifying molecular subtypes in a variety of studies . It outperforms unsupervised learning and naive splitting with an arbitrary survival threshold in detecting molecular subtypes since both the clinical outcome and the distribution overlapping between good prognosis and poor prognosis are considered . Herein, a carefully designed semi-supervised clustering method is proposed. By using a log-rank survival test and a series of refinement processes, in which the gene expression data were embedded, significant features related to patients' survival were selected. With an additional scoring function for data discretization, classifications with similar patterns but distinct chemotherapy responses can be identified using classical hierarchical clustering. In this study, we focused on obtaining genetically altered genes, rather than transcriptionally dysregulated genes. However, we do require genes with copy number changes with concordant gene expression change, potentially driving the phenotypic change. Following this objective, the semi-supervised clustering was applied to both the copy number data and the methylation data, and the methods from feature selection to hierarchical clustering for both data sets were basically the same. We only detail the clustering steps for copy number data. The modifications for methylation data are demonstrated later.
To reduce suppression effects due to the genomic (or epigenomic) differences and maximize the ability of discriminating poor prognosis from good prognosis, a rigorous feature selection process was applied, in which, factors such as patient ages and tumor stages were carefully considered to minimize confounding effects.
A feature in copy number data refers to a gene in human genome, and the feature selection was done basically by a univariate log-rank test, which is a non-parametric hypothesis test comparing the survival functions of two groups of samples, and a series of feature refinements. First of all, we define PFS as the interval between the date of surgical resection and the date when a failure event occurred (i.e., disease progression). For samples without progression, recurrence, or death, the dates of last follow-up were considered as PFS with censoring. Second, based on the copy number ratio c ik for gene i from sample k, copy number statuses were categorized into gain (c ik > 0.4), normal (-0.5 ≤ c ik ≤ 0.4), and loss (c ik < -0.5). Like a scanning process, the survival functions of the samples with normal copy number were tested against those with copy number gain (normal vs gain) as well as those with copy number loss (normal vs loss), feature by feature. To ensure the confidence in hypothesis testing, features leading to a small sample size (i.e., less than 15%, or less than 13) in either group were not considered, and features resulting in log-rank p-values less than 0.05 were then selected as candidates. After this, confounding factors such as ages and tumor stages were considered for feature refinement. Specifically, if the median of ages in both groups of samples were significantly different (p-values less than 0.05) in a Wilcoxon rank-sum detection test, or, if these two groups of samples' tumor stage assignment are distribute significantly differently (p-values less than 0.05 by Fisher's Exact test), then the selected features were removed from the candidates to eliminate possible detrimental effects from confounding factors. Moreover, the candidates were further refined using gene expression fold-changes. Similar to the scanning process used for survival comparison, the averaged gene expression levels derived from both groups of samples were compared, and the candidates causing expression fold-changes larger than 1.3 (or smaller than 1/1.3) were retained while those without significant expression fold-changes were disqualified for further consideration. We applied a relatively loose cutoff here only to preserve the concordance of gene expression alteration to the copy number change.
The scores were assigned on a feature-by-feature basis. If a feature was selected twice in both cases (i.e., the gains versus the normal and the losses versus the normal), the F-scores were assigned according to the selection with smaller log-rank p-value.
where F k indicates the vector of F-scores in sample k, F k' refers to the vector of F-scores in sample k', and the length of these vectors are equal to the number of selected features. With all pairs of sample-sample distance being evaluated using Eq. (7), complete linkage was applied, and subsequent clusters derived. We used Kaplan-Meier analysis to check if these clusters exhibited significantly different survival distributions, and proposed molecular classifications if the log-rank p-value was less than 0.05.
where F f indicates the array of F-scores in sample f , and F h refers to the array of F-scores in sample h. The maximum function used in Eq. (9) is to ensure no zero in denominator in Eq. (8). The sample f in the testing dataset was classified as a good profile if P f > 0 or as a poor profile otherwise. In this study, K was chosen to be 3. Samples classified as good profiles as well as those samples classified as poor profiles were compared using the Kaplan-Meier survival analysis.
where and refer to the samples with longer survival and the samples with shorter survival, respectively, determined by the log-rank test while evaluating the relation between PFS and hypermethylation or hypomethylation in feature i in the feature selection for methylation data. Also, refers to missing beta values in feature i of sample k. After data discretization, hierarchical clustering was applied to identify molecular classifications associated with chemotherapy response. Moreover, the testing datasets were also classified using the weighted KNN algorithm for classification justification.
Once the molecular classifications associated with chemotherapy response were identified, differentially expressed genes in comparing the poor copy number profiles and the good profiles, or in comparing the poor methylation profiles and the good profiles, were detected using classical statistical approaches and methods for gene expression analysis. The good profiles mentioned herein study refer to the samples neither classified as poor copy number profiles nor classified as poor methylation profiles. Since confounding and suppression effects were reduced by the proposed procedure, more strict thresholds were simultaneously applied: fold change larger than 1.5 and t-test p-values less than 0.01. After deriving the significant gene set, enriched gene ontology terms (biological processes) were identified using the Gene Ontology Enrichment Analysis Software Toolkit (GOEAST)  available online at: http://omicslab.genetics.ac.cn/GOEAST/.
From 514 tumor samples with gene expression, copy number, and methylation data, a subset of 85 samples were selected for training, and a subset of 83 samples were selected for testing according to the procedure outlined in Figure 2. As shown in the Additional file 1: Figure S2, the frequency of copy number alterations in the subset of 85 training samples is nearly the same as that in all 514 tumor samples, indicating no significant sampling bias in this regard due to the non-random selection.
The characteristics of samples
Number of samples
Number of samples in stage III
Number of samples in stage IV
Ages (years old)
Number of samples with censored observation
paclitaxel carboplatin other drugs
Significant genes and enriched ontologies were identified using classical statistical approaches and an enrichment analysis. Based on the training results, as shown in the Venn Diagram and the Kaplan-Meier curves in Figure 7, two molecular classifications associated with poor chemotherapy response (18 PPTs from copy number data and 30 PPTs from methylation data) and one group of samples with good chemotherapy response were identified. We begin with a comparison of the 18 PPTs derived from copy number profiles to the 45 GPTs. Using t-test and expression fold-change, 107 differentially expressed genes between these two groups of samples were detected, and the top 15 differentially expressed genes ranked by t-test p-values are shown in Additional file 1: Table S3. Moreover, significant biological processes were subsequently identified using GOEAST . The top 15 biological processes ranked by p-values are listed in Additional file 1: Table S4, where all of the listed biological processes resulted in a p-value less than 1.8 × 10-5. The other 30 PPTs derived from methylation profiles were also compared to the 45 GPTs, and 34 differentially expressed genes were detected using t-test and expression fold-change. Similarly, the top 15 differentially expressed genes ranked by t-test p-values are shown in Additional file 1: Table S5, and the top 15 significant altered biological processes are listed in Additional file 1: Table S6.
As high-throughput technologies such as microarray and short-read sequencing become more and more popular in biological studies, the complexity and dimensions these datasets possess make the statistical analysis more difficult. Additionally, extraneous variables resulting from sample variability may hinder analysts from avoiding false discovery and subsequently achieving reliable results. Dealing with these extraneous variables, whether confounding or suppression, is challenging. To address these issues, we proposed a procedure for the reduction of confounding and suppression effects, in which a batch effect correction, a sample selection process, and a semi-supervised clustering method were considered. The batch effect correction used an L/S adjustment to eliminate confounding effects due to experimental differences, and the sample selection was designed to focus on single chemotherapy, reducing variability due to different treatments. After batch effect correction and sample selection, a novel semi-supervised clustering method that further addressed the confounding from ages and tumor stages was applied, and gene expression values, DNA copy number ratios, methylation data, and clinical information were analyzed for unraveling molecular classifications associated with chemotherapeutic response. As a result, significant genes and enriched ontologies were identified.
The batch effect due to experimental processing is a known issue in microarray studies, and there has been several attempts to correct the batch effect for gene expression profiling experiments. For example, the software "COMBAT"  utilized empirical Bayes frameworks for adjusting the batch effect in gene expression particularly when the sample size is small. Despite the existence of sophisticated algorithms for batch adjustment in gene expression, we applied the simple L/S method for batch effect correction, since the TCGA dataset has large sample size (larger than 25) in most batches. Moreover, the simple L/S method can be easily extended and applied to the methylation beta values without causing much difficulty and computational burdens. To further improve the L/S method, one could consider to apply the median of absolute deviation (MAD) instead of Eq. (2) for estimating the standard deviation in a more robust way .
The molecular classifications were identified from copy number and methylation profiles. CNAs are common genomic aberrations shown to associate with mRNA expression level changes, gene function modifications, and significant differences in prognosis of a variety of tumors . Also, methylation status has been implicated in affecting the drug response . Since these genomic or epigenomic mutations do not necessarily lead to linear changes in gene expression levels [22, 23], expression fold-change rather than Pearson correlation was applied to evaluate their effects on transcription. For the semi-supervised clustering, the log-rank test was initially applied to select significant features associated with PFS. After removing features not affecting transcription or potentially correlated with ages and stages, a scoring function was applied to discretize and denoise data. Then, the classical hierarchical clustering method was applied, and two molecular classifications associated with poor chemotherapy response were identified. Interestingly, in the classification detected from copy number profiles, unfavorable scores were enriched in the region of 1p34.3 - 1p34.1, indicating a dominant CNA segment that might affect chemotherapy response. Also, in the classifi-cation detected from methylation profiles, samples were generally hypomethylated in the selected CpG sites. It is a good opportunity to further examine these regions for a better understanding of chemoresistance.
Differentially expressed genes detected by comparing the tumors with poor prognosis to the samples with good prognosis provide us candidate genes for studying the functions or mechanisms differentiating chemotherapy response. Comparing with the 14 genes selected by Hartmann et al.  for predicting the platinum-paclitaxel chemotherapy response, we found SF3A3 was also detected in this study. Moreover, many candidate genes such as GNL2, RRAGC, RFC3, and ENC1 listed in Additional file 1: Table S3 and Table S5 have also been reported as being differentially expressed in a related in vitro study . Obviously, these genes require further validation in order to confirm their chemotherapy response. In addition, genes involved in associated biological processes, such as microtubule polymerization or depolymerization, urogenital system development, were detected using GOEAST. These results could be re-assessed with network modeling, for example, the Boolean Network (BN) and the Probabilistic Boolean Networks (PBN). Further mathematical modeling of this type is an essential step to uncover the underlying mechanism of chemoresistance in these tumors, to provide better prognosis, and ultimately to improve the care of ovarian cancer patients.
The proposed procedure can help reduce confounding for mining useful information; it is powerful but not omnipotent. In fact, the best way to achieve a confident conclusion may rely on rigorous experimental designs, not simply relying on state-of-the-art statistical analysis techniques. In other words, the less confounding factors an experiment has, the more reliable results an experimentalist can generally achieve through a case-control study.
Based on “Identifying genes associated with chemotherapy response in ovarian carcinomas based on DNA copy number and expression profiles”, by Fang-Han Hsu, Erchin Serpedin, Tzu-Hung Hsiao, Alexander JR Bishop, Edward R Dougherty and Yidong Chen which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
The authors would like to thank the members in the Genomic Signal Processing Laboratory, Texas A&M University, College Station, and the members in the Bioinformatics and Biostatistics Core Laboratory, National Taiwan University, Taipei, Taiwan, for helpful discussions. This work was supported by the National Science Foundation under Grant 0915444. AJR Bishop is supported by the NIEHS (K22-ES12264) and a Voelcker Fund Young Investigator Award from the Max and Minnie Tomerlin Voelcker Fund; Y. Chen is supported by NIH/NCI cancer center grant (P30 CA054174-17), NIH/NCRR CTSA grant (1UL1RR025767), and partially supported by a Voelcker Fund Young Investigator Award from the Max and Minnie Tomerlin Voelcker Fund; F-H Hsu was partially supported by the Greehey Children Cancer Research Institute (GC-CRI) Summer Internship Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.