Getting started from collecting data, this section demonstrates the methods and criteria we applied for achieving the results. First of all, we downloaded the ovarian cancer data from the TCGA repository website (http://cancergenome.nih.gov/). Level 3 gene expression data derived from Aymetrix U133A platform, level 2 copy number data derived from Agilent CGH-1x1M platform, and level 3 methylation data derived from Illumina HumanMethylation27 platform were chosen and downloaded. The gene expression data are constituted of 12,042 normalized log2 values, and each value represents an expression level of a gene. Copy number data contain 962,434 normalized log2 ratios among which 358,119 ratios with gene annotations were utilized. To match these copy number ratios with gene expression levels, values corresponding to a same gene were averaged. Methylation data are 27,578 beta values, and each value refers to the percentage of methylation for a specific CpG site. Among the 27,578 beta values, 19,448 linked to 10,068 unique genes included in the gene expression data were considered in this study. In total, a cohort of 514 tumor samples comprehensively containing gene expression, copy number, and methylation profiles was obtained.
The procedure for the reduction of confounding and suppression effects is described in the following subsections. As shown in Figure 1, the first step for using the TCGA data was batch effect correction. Then, a rigorous process for sample selection was proposed for finding two independent patient cohorts who went through the combination of paclitaxel and carboplatin treatment: one for training and one for testing. A semi-supervised clustering approach was further applied for identifying molecular classifications. Based on the identified classifications, further validated through testing patient cohort, differentially expressed genes were detected by comparing samples with good response to those with poor response, and significant ontologies were found using an enrichment analysis.
Batch effect correction
Batch effect correction was applied to all 514 samples. Due to the large sample size, TCGA samples were derived from different institutions, and experiments were performed on different dates; these factors may cause quantitative differences in measurements and might lead to false discovery specifically if they are not properly dealt with [13]. In this study, location and scale (L/S) adjustments [14] were applied to eliminate batch effects in gene expression and methylation data. We did not perform batch correction to aCGH arrays due to the fact that it assumes a comparative hybridization to normal DNA on the same array; hence, it is less prone to batch effects. For gene expression, we assumed that the medians and variances of measures from different batches should be ideally the same. Let g
i
represents the vector of gene expression values for gene i from all 514 samples, g
ij
denotes the vector of gene expression values for gene i from samples corresponding to batch j, and g
ijk
represents the gene expression value for gene i from batch j and sample k. The adjusted gene expression value for gene i from batch j and sample k is given by
(1)
where M
i
refers to the median of g
i
, and M
ij
denotes the median of g
ij
. Also, and are the estimated standard deviations of g
i
and g
ij
, respectively, and whose expressions are given by
(2)
(3)
where N refers to the total number of samples, N
j
denotes the number of samples in batch j, and K
j
stands for the samples in batch j. The adjustment of methylation data is somewhat different since beta values have a bounded range (from 0 to 1) and exhibit a slight bimodal distribution [15]. Again, let b
i
and b
ij
represent two vectors of beta values for gene i across all 514 samples and batch j, respectively. To enforce beta values from different batches to have the same median, the methylation beta value b
ijk
for gene i from batch j and sample k is rescaled, and the adjusted value is given by
(4)
where and refer to the medians of b
i
and b
ij
, respectively.
Sample selection
After correcting the batch effect, a rigorous sample selection process was applied for finding the eligible samples for a targeted chemotherapy response. The ultimate goal of this step is to derive as many eligible samples as possible with the least possible confounding effects for both training and testing. To meet this requirement, clinical data were carefully examined. Since the criterion used for distinguishing a good response from a poor response was the progression free survival (PFS) time, which is defined in the next subsection, factors related to prognosis, such as tumor stages, drugs, and treatment time, were considered. As shown in Figure 2, we started with 514 samples profiled with all three techniques: gene expression, copy number, and methylation. Then, we considered samples only from advanced tumor stages (i.e., stage III and stage IV) and restricted samples to be quintessentially treated with paclitaxel and carboplatin: the treatment had to be started within 30 days after surgical resection and to last for at least 4 cycles. As a result, there were 168 samples which met this treatment requirement. A subset of 85 samples never treated with any other drugs before a failure event (tumor progression, tumor recurrence, death, or censored observation) were selected for training. Another subset of 83 samples treated with paclitaxel and carboplatin but involved in other treatment were selected for testing. The training/testing datasets were not chosen randomly since the 83 samples treated with other drugs may mislead the discovery of unique molecular patterns confounded by treatment effect. Nevertheless, these 83 samples are valuable for providing supporting evidence and justification if patterns or results found in training can also be observed in testing.
Semi-supervised clustering
Semi-supervised clustering has been proposed and applied for identifying molecular subtypes in a variety of studies [16]. It outperforms unsupervised learning and naive splitting with an arbitrary survival threshold in detecting molecular subtypes since both the clinical outcome and the distribution overlapping between good prognosis and poor prognosis are considered [17]. Herein, a carefully designed semi-supervised clustering method is proposed. By using a log-rank survival test and a series of refinement processes, in which the gene expression data were embedded, significant features related to patients' survival were selected. With an additional scoring function for data discretization, classifications with similar patterns but distinct chemotherapy responses can be identified using classical hierarchical clustering. In this study, we focused on obtaining genetically altered genes, rather than transcriptionally dysregulated genes. However, we do require genes with copy number changes with concordant gene expression change, potentially driving the phenotypic change. Following this objective, the semi-supervised clustering was applied to both the copy number data and the methylation data, and the methods from feature selection to hierarchical clustering for both data sets were basically the same. We only detail the clustering steps for copy number data. The modifications for methylation data are demonstrated later.
Feature selection
To reduce suppression effects due to the genomic (or epigenomic) differences and maximize the ability of discriminating poor prognosis from good prognosis, a rigorous feature selection process was applied, in which, factors such as patient ages and tumor stages were carefully considered to minimize confounding effects.
A feature in copy number data refers to a gene in human genome, and the feature selection was done basically by a univariate log-rank test, which is a non-parametric hypothesis test comparing the survival functions of two groups of samples, and a series of feature refinements. First of all, we define PFS as the interval between the date of surgical resection and the date when a failure event occurred (i.e., disease progression). For samples without progression, recurrence, or death, the dates of last follow-up were considered as PFS with censoring. Second, based on the copy number ratio c
ik
for gene i from sample k, copy number statuses were categorized into gain (c
ik
> 0.4), normal (-0.5 ≤ c
ik
≤ 0.4), and loss (c
ik
< -0.5). Like a scanning process, the survival functions of the samples with normal copy number were tested against those with copy number gain (normal vs gain) as well as those with copy number loss (normal vs loss), feature by feature. To ensure the confidence in hypothesis testing, features leading to a small sample size (i.e., less than 15%, or less than 13) in either group were not considered, and features resulting in log-rank p-values less than 0.05 were then selected as candidates. After this, confounding factors such as ages and tumor stages were considered for feature refinement. Specifically, if the median of ages in both groups of samples were significantly different (p-values less than 0.05) in a Wilcoxon rank-sum detection test, or, if these two groups of samples' tumor stage assignment are distribute significantly differently (p-values less than 0.05 by Fisher's Exact test), then the selected features were removed from the candidates to eliminate possible detrimental effects from confounding factors. Moreover, the candidates were further refined using gene expression fold-changes. Similar to the scanning process used for survival comparison, the averaged gene expression levels derived from both groups of samples were compared, and the candidates causing expression fold-changes larger than 1.3 (or smaller than 1/1.3) were retained while those without significant expression fold-changes were disqualified for further consideration. We applied a relatively loose cutoff here only to preserve the concordance of gene expression alteration to the copy number change.
Data discretization
After feature selection, a scoring function was applied to the copy number data in the selected genes. Instead of using the original copy number ratios, which may be noisy and contain unrelated information, the hierarchical clustering scheme was applied to the data, discretized using a scoring function, called F-score, which transfers the grouping information and the survival conditions into discrete numbers, 1 (favorable), 0 (unknown), and -1 (unfavorable). Specifically, for a feature i selected from testing the gains against the normal, the F-score F
ik
for feature i and sample k on copy number data is given by
(5)
where and refer to the samples with longer PFS survival and the samples with shorter PFS survival, respectively, determined by the log-rank test while evaluating the relation between PFS and CNAs for feature i (refer to Subsection Feature selection). Additionally, c
ik
= NaN refers to missing copy number ratios in feature i of sample k. Similarly, for a feature i selected from testing copy number losses against the normal, the F-score F
ik
for feature i and sample k on copy number data is given by
(6)
The scores were assigned on a feature-by-feature basis. If a feature was selected twice in both cases (i.e., the gains versus the normal and the losses versus the normal), the F-scores were assigned according to the selection with smaller log-rank p-value.
Identification of molecular classifications
Based on the selected features and discretized copy number data, hierarchical clustering was applied for identifying the molecular classifications associated with chemotherapy response. The distances d
kk'
between samples k and k' were evaluated using the Jaccard coffiecient, which is given by
(7)
where F
k
indicates the vector of F-scores in sample k, F
k'
refers to the vector of F-scores in sample k', and the length of these vectors are equal to the number of selected features. With all pairs of sample-sample distance being evaluated using Eq. (7), complete linkage was applied, and subsequent clusters derived. We used Kaplan-Meier analysis to check if these clusters exhibited significantly different survival distributions, and proposed molecular classifications if the log-rank p-value was less than 0.05.
Justification of selected features and identified classifications
The weighted K-Nearest Neighbor algorithm (weighted K NN) was applied to the testing datasets to justify the proposed classifications with the selected features. First of all, the testing samples were discretized in the F-scores using the same features and criterion mentioned in Subsection Data discretization. Then, the cluster of samples in the training datasets enriched with unfavorable scores were labeled as poor profiles (l = -1) while the other cluster of samples were labeled as good profiles (l = 1). Using the Jaccard coffiecients as sample-sample distances, the discriminant function P
f
for an independent sample f in the testing dataset was considered
(8)
where l
h
stands for the corresponding label of the sample h, and d
fh
denotes the distance between the sample f in the testing dataset and its h-th nearest sample in the training dataset, which is given by
(9)
where F
f
indicates the array of F-scores in sample f , and F
h
refers to the array of F-scores in sample h. The maximum function used in Eq. (9) is to ensure no zero in denominator in Eq. (8). The sample f in the testing dataset was classified as a good profile if P
f
> 0 or as a poor profile otherwise. In this study, K was chosen to be 3. Samples classified as good profiles as well as those samples classified as poor profiles were compared using the Kaplan-Meier survival analysis.
Modifications for methylation data
The proposed semi-supervised clustering was also applied to the methylation data. In order to fit the method, the batch-adjusted beta values for gene i from sample k were categorized into three statuses: hypermethylated , normal methylated , and hypomethylated . Using the same criteria for log-rank testing, features (i.e., CpG sites) resulting in p-values less than 0.05 upon hypermethylated versus non-hypermethylated, or hypomethylated versus non-hypomethylated, were selected as candidates. Because of the continuous nature of beta values, the candidate refinement was moderately adjusted: if the beta values for a candidate feature exhibit a large correlation with the sample ages (Spearman correlation > 0.5), or if the beta values for the candidate feature derived from the stage III samples and the stage IV samples were significantly different in median by Wilcoxon rank-sum test, this candidate was removed from further consideration. Except for the aforementioned differences, other steps for identifying molecular classifications such as the candidate refinement using expression fold-change (larger than 1.3 or smaller than 1/1.3), F-scoring, and the hierarchical clustering (with the Jaccard coffiecient and complete linkage) remain the same. To illustrate the F-score for feature i and sample k on methylation data more clearly, the following relation is provided. For a feature i, found PFS-related after feature refinement, the F-score for sample k is given by
(10)
where and refer to the samples with longer survival and the samples with shorter survival, respectively, determined by the log-rank test while evaluating the relation between PFS and hypermethylation or hypomethylation in feature i in the feature selection for methylation data. Also, refers to missing beta values in feature i of sample k. After data discretization, hierarchical clustering was applied to identify molecular classifications associated with chemotherapy response. Moreover, the testing datasets were also classified using the weighted K NN algorithm for classification justification.
Identification of significant genes and ontologies
Once the molecular classifications associated with chemotherapy response were identified, differentially expressed genes in comparing the poor copy number profiles and the good profiles, or in comparing the poor methylation profiles and the good profiles, were detected using classical statistical approaches and methods for gene expression analysis. The good profiles mentioned herein study refer to the samples neither classified as poor copy number profiles nor classified as poor methylation profiles. Since confounding and suppression effects were reduced by the proposed procedure, more strict thresholds were simultaneously applied: fold change larger than 1.5 and t-test p-values less than 0.01. After deriving the significant gene set, enriched gene ontology terms (biological processes) were identified using the Gene Ontology Enrichment Analysis Software Toolkit (GOEAST) [18] available online at: http://omicslab.genetics.ac.cn/GOEAST/.