Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome
© Zhang et al; licensee BioMed Central Ltd. 2007
Received: 08 December 2006
Accepted: 20 September 2007
Published: 20 September 2007
Gene expression profiles based on microarray data have been suggested by many studies as potential molecular prognostic indexes of breast cancer. However, due to the confounding effect of clinical background, independent studies often obtained inconsistent results. The current study investigated the possibility to improve the quality and generality of expression profiles by integrated analysis of multiple datasets. Profiles of recurrence outcome were derived from two independent datasets and validated by a third dataset.
The clinical background of patients significantly influenced the content and performance of expression profiles when the training samples were unbalanced. The integrated profiling of two independent datasets lead to higher classification accuracy (71.11% vs. 70.59%) and larger ROC curve area (0.789 vs. 0.767) of the testing samples. Cell cycle, especially M phase mitosis, was significantly overrepresented by the 60-gene profile obtained from integrated analysis (p < 0.0001). This profiles significantly differentiated poor and good prognosis in a third patient cohort (p = 0.003). Simulation procedures demonstrated that the change of profile specificity had more instant influence on the performance of expression profiles than the change of profile sensitivity.
The current study confirmed that the gene expression profile generated by integrated analysis of multiple datasets achieved better prediction of breast cancer recurrence. However, the content and performance of profiles was confounded by clinical background of training patients. In future studies, prognostic profile applicable to the general population should be derived from more diversified and balanced patient cohorts in larger scale.
Breast cancer involves a series of genomic disorders, making it a suitable subject of microarray experiments . Mapping microarray-based gene expression profiles to clinical phenotypes has been proposed as a solution to improve cancer diagnosis and prognosis . A number of such profiles, which are able to distinguish cell lines , normal and tumor tissues , adjacent tumors , and tumor subtypes [6, 7], have been presented. Expression profiles of cancer endpoints are more valuable in clinical practice. From a microarray dataset of 78 breast cancer patients, van 't Veer et al identified a 70-gene profile that correctly classified 5-year recurrence of 65 (83%) patients . This profile was further proved to be superior to currently used indexes [8, 9]. Similar profiles were identified by other studies [10–12]. However, these profiles shared little overlap with each other. It was further noticed that highly distinct profiles had similar performance and significant agreement on recurrence prediction [13, 14]. These observations indicate that the expression profiling of cancer prognosis is more complicated than simply identifying a list of differentially expressed genes from a single dataset.
Despite of the prospective benefits, key issues related to expression profiling of cancer prognosis still remain in question. First, it should be presumed that the classification of patient prognostic groups properly reflects the inherent difference between their gene expression patterns. Studies usually dichotomize breast cancer patients according to clinically used 5-year prognosis [8, 10]. However, this convention is established by usage rather than based upon intrinsic biological difference between tumor cells, and may reduce the statistical power of subsequent analyses. Retsky et al. discovered that the recurrence of breast cancer has a two-peak distribution independent of tumor size, number of positive nodes, and menopause status . Computer simulation of tumor progression suggested that two different models of secondary tumor growth were responsible for this distribution [15, 16]. The 18-month peak was the consequence of accelerated secondary growth stimulated by mastectomy while patients in the 60-month peak went through steady stochastic transitions of tumor progression phases.
Another issue is the influence of clinical confounders, such as ER and lymph node status. Gruvberger et al noticed that 165 of the 231 genes top-ranked in van 't Veer paper were also significantly correlated to ER status of patients [17, 18]. It was then suggested that expression profiling should be carried out for ER-positive and -negative patients separately. Expression profile derived from one patient cohort might not be applicable to other cohorts having dissimilar clinical background. Removing or reducing the confounding effect will improve the robustness of expression profiles. Nevertheless, the suggestion of Gruvberger et al may not be a practical solution because there are many known and unknown confounders intervening in the correlation between gene expression level and breast cancer recurrence.
Furthermore, comparing to the large number of genes (variables) measured by microarray experiments, sample sizes are usually too small to give enough statistical power. Consequently, gene expression profiles unavoidably include false positives due to 'multiple hypothesis testing'  while many differentially expressed genes will not be identified due to lack of statistical power. A question worthy of more discussion is how sensitivity and specificity should be optimally balanced in expression profiles.
Integrated analysis of multiple independent microarray datasets has drawn noteworthy interests recently [20–22]. Not only will this strategy increase the overall statistical power of expression profiling, but also it can reduce the influence of confounders by including diversified samples. Genes directly and consistently, but not obviously, correlated to disease outcome will be preferred by integrated analysis. A basic assumption of integrated analysis is that independently generated datasets may share common information despite of systematic variations between experiments. Ghosh et al investigated the consistence of four independent microarray datasets from prostate cancer . Meta-analysis of those datasets concluded that their gene expression profiles are significantly similar to each other. Rhodes et al compared the expression profiles of normal and tumor cells in a larger scale using 21 datasets from 12 tissue types . 67 genes consistently correlated to the normal-tumor phenotypes across datasets were proposed as a generic expression profile of neoplastic transformation.
The aim of this study is to improve the expression profiling of breast cancer recurrence by integrating independent datasets. Breast cancer patients were classified according to Retsky recurrence model. Expression profiles derived from two individual datasets and their integration were objectively compared by random re-sampling and cross-validation. It was demonstrated that the expression profiles had higher specificity after datasets were integrated. Furthermore, the resultant expression profiles were validated by a third dataset.
SEP (Score for Expression Profile) as a prognosis index of breast cancer
Fisher's exact test was used to evaluate the dependence of SEP on major clinical indexes after patients were equally separated into high score and low score groups with the threshold equal to median score. The results showed that the values of SEP were significantly dependent on ER status (positive vs. negative), PR status (positive vs. negative), tumor size (T1 vs. T2) and histological grade (1 vs. 2 vs. 3) with p < 0.001, but not on angioinvasion (positive vs. negative, p = 0.21) or age of patients (<= 40 vs. >40, p = 0.61).
Partial correlation analysis was then applied to control out the confounding effect of ER status. Correlation between recurrence outcome and residuals obtained from Formula (2) was calculated and the 127 genes having |r'| > 0.3 were used to recalculate SEP scores. The score distributions of two prognosis groups are separately plotted in Fig. 1C. Results of Fisher's exact test showed that modified SEP was not dependent on ER status (p = 0.21), but still significantly dependent on histological grade (p < 0.001) and tumor size (p = 0.006), and marginally on PR status (p = 0.04).
Analysis of two independent datasets
The current study incorporated permutated re-sampling, training/testing validation, and stepwise procedure to objectively compare performance of prognostic expression profiles. The workflow was first applied to Rosetta and Stanford datasets separately. Patients in each prognosis group of each dataset were randomly re-sampled into training (about two-thirds of total patients) and testing (the rest patients) subgroups. The expression profiles were generated from the training patients and tested by the testing patients. To avoid sampling bias, patients were repeatedly re-sampled to obtain different combinations of training/testing subgroups upon which the following analytical steps were repeated. After each re-sampling, the differential expression of each gene between two prognosis groups was tested by non-parametric Wilcoxon Rank Sum Test (RST)  using the training data and the resultant Z statistics was used to rank all 5,569 genes. The gene whose Z value had the largest magnitude was ranked the highest. Top-ranked N genes constituted an expression profile. Increasing the value of N would supposedly improve profile sensitivity, but reduce specificity at the same time. A stepwise procedure was carried out to find the optimal balance between specificity and sensitivity of profiles, during which top-ranked genes were added one by one until N = 100. Testing patients were re-scored at each step using Formula (1), while the weight of each gene equal to its Z statistic and the expected value equal to the average expression of that gene in training patients. Testing patients were classified into two groups using resultant SEP scores (positive vs. negative). The SEP-based classification was matched to actual recurrence outcome to get its accuracy. To take advantage of SEP as a continuous variable, scores were also used to build ROC curve and the area under the curve (AUC) indicated how much the prognosis groups were differentiated by SEP.
Predictability of expression profiles
Tested on Rosetta
Tested on Stanford
Cross-validation of two independent datasets
Comparison of prognostic indexes with logistic regression models
Lymph node status
Age at diagnosis
Multivariate model (include SEP)
Multivariate model (not include SEP)
Focus genes in expression profiles of breast cancer outcome
BUB1 budding uninhibited by benzimidazoles 1 homolog
low density lipoprotein receptor-related protein 8
PCTAIRE protein kinase 1
signal peptide, CUB domain, EGF-like 2
glutathione S-transferase pi
estrogen receptor 1
GATA binding protein 3
Myeloproliferative leukaemia virus oncogene
vav 3 oncogene
beclin 1 (coiled-coil, myosin like BCL2 interacting protein)
cell division cycle 25B
B-cell CLL/lymphoma 2
Complement factor B
hepsin (transmembrane protease, serine 1)
Among the 60 genes of each profile, the least significant p value of RST was 0.0002 (Rosetta), 0.0014 (Stanford), or 0.00002 (Combined), respectively corresponding to false discovery rates 0.017, 0.097, or 0.006. The improvement achieved by the combined dataset indicated that more statistical power was gained by data integration.
Gene sets enriched in the combined dataset profile
Cell cycle/M phase of mitotic cell cycle
Myosin, heavy polypeptide 10, non-muscle
E2F transcription factor 5, P130 binding
Negative regulation of apoptosis
Enzyme regulator activity
Validation of profiles with a third dataset
Cox proportional hazards analysis of SEP-based classification
Tested on all 286 patients
Tested on 209 ER+ patients
Hazard ratio (95% CI)
Hazard ratio (95% CI)
The correlation of genes in both profiles to 3-year prognosis was also validated. In the Veridex dataset, there were 69 patients who developed recurrence within three years and 180 patients who kept recurrence-free during a follow-up of three years or longer. The rest of the 37 patients were excluded from the following analyses. Gene differential expression between two prognosis groups was tested by Wilcox RST [see Additional file 4]. Respectively 17 and 37 of 51 genes in combined dataset profile had one-sided p values less than 0.01 and 0.1. All genes except PCTK1 had the same direction of group difference as expected. In Rosetta profile, the corresponding numbers were 9 and 28 and there were 7 genes had the opposite direction of group difference as expected.
With SEP threshold equal to the median of all 286 scores, the accuracy, specificity, and sensitivity of SEP-based classification were calculated (Fig. 5). While the overall results were poorer than the results in Table 2, the combined profile always outperformed Rosetta profile. Fisher's exact test rejected the independence of 3-year prognosis on both of combined dataset profile (p = 0.0002, odds ratio = 3.08) and Rosetta profile (p = 0.0006, odds ratio = 2.82). Notably, based on both profiles, the 50 patients having the highest scores only included four poor prognosis cases (92% specificity) while the expected number was 13.9.
Sensitivity and specificity of expression profiles
The balance between specificity and sensitivity is a major concern of gene expression profiling. Two simulation procedures were carried out to evaluate how the change of sensitivity or specificity will affect the predictability of profiles.
A clinically valuable expression profile of general breast cancer population, if it does exist, should at least meet two requirements: it should add extra prognostic value beyond currently used indexes and it should be independent of those indexes. This study gave promising, but inconclusive, results on the first requirement. According to Table 3 and likelihood ratio test, the difference between multivariate models with and without SEP was marginally significant. However, information of important prognostic indexes, especially molecular markers such as HER2/neu and Bcl-2, were unavailable and not included in the models. Larger samples and more complete patient information are needed to draw more decisive conclusions.
This study observed the dependence of expression profiles on clinical indexes, especially ER status (Fig. 1). Such dependence was caused by confounding effect of those indexes and their unbalanced distribution between patient groups. For instance, among the 78 patients used in Fig. 1, 80% (35/44) of good prognosis patients were ER-positive while the percentage was 62% (21/34) in poor prognosis group. A partial correlation analysis was performed and successfully controlled out ER status from expression profile, but the confounding effect of other indexes remained. Although the analysis can be recursively applied to control out other indexes, the calculation of residuals by Formula (2) will introduce extra variance into the data and the expression profile obtained from partial correlation analysis failed to achieve better performance on testing patients (data not shown). As a result, this strategy is not recommended by this study unless data from much larger patient cohort is available. It was also noticed that the 60-gene profiles performed better on ER-positive patients in Veridex datasets, most likely because the majority (68%) of training patients were ER-positive. Hence, to get generally applicable profiles, confounders need to be balanced not only between prognostic subgroups but also within the complete patient cohorts.
In reality, it is difficult for single studies to accomplish large and fully balanced sample because of the limitation of resource, the large number of known and unknown confounders and their complex interaction. A more practical alternative is to diversify the clinical background and increase the overall sample size by combining multiple patient cohorts from different studies. A potential pitfall of this strategy, however, is whether independently generated datasets are combinable since the systematic bias between microarray experiments is commonly considered substantial. The current study tested the feasibility of integrated analysis by simply combining two datasets after normalizing the expression measurements within dataset. Profiles were objectively compared and the profiles of the combined dataset outperformed those of the individual datasets in most statistical analyses (Fig. 3, 5 and Table 1, 6). Furthermore, subsets of the combined dataset had better agreement on differentially expressed genes (Fig. 4), indicating that higher specificity of profiles was accomplished.
Results of this study indicated that high sensitivity of expression profile may not be necessary: median AUC reached a plateau when N was about 60 (Fig. 2); two mostly different expression profiles performed similarly in cross-validation; and more convincingly, the artificial reduction of profile sensitivity could be tolerated to an extensive level (Fig. 6). These results are consistent to the studies of Fan et al  and Ein-Dor et al , which noticed that very different profiles could significantly agree with each other and achieve equally good predictability. This observation can be explained by gene co-expression and the large number of genes correlated, directly or indirectly, to prognosis. Nevertheless, higher sensitivity may improve the robustness of profiles, which needs further investigation in future studies. On the other hand, profile specificity seems to be more critical. According to Fig. 7, performance of profiles dropped quickly when the ratio of false positives was increased. When the combined dataset profile was validated, about one-third of the genes did not have significant differential expression in Veridex patients, suggesting that specificity of this profile could be further improved. Furthermore, decreased specificity made the performance of profiles more variable (Fig. 7). For instance, while Ein-Dor et al noticed that there were always low-ranked genes showing quality similar to top-ranked genes, consecutive gene set often performed differently although they should have very close sensitivity and specificity.
It should be noted that genes indirectly correlated to prognosis do not fit to a profile intended to the general population because the observed correlation may be very strong in some disease subtypes, but weak or even absent in the others. The number of such 'false positives' in a profile cannot be simply estimated based on p value or false discovery rate. Instead, the ranking of genes should be derived from diversified patient cohorts, so genes directly and consistently correlated to disease outcome will have their advantage. One may question the existence of such genes and as suggested by many researchers, attempt to identify a profile for each disease subtype. However, the conclusion cannot be drawn before large-scale, cross-study screening is performed.
This study applied an atypical classification of breast cancer patients according to their 3-year prognosis. The 5-year classification, however, is commonly applied mainly for convenience, but not based on intrinsic difference of gene expression patterns between patient groups. Beside the support of Retsky model , 3-year classification may increase the statistical power of differential expression analysis by amplifying group difference. For instance, in the original Rosetta dataset, 1,418 of 24,481 genes were differentially expressed between 5-year prognosis groups according to RST. When 3-year classification was applied, the number of differentially expressed genes was increased to 1,759 even though the overall sample size was smaller (82 vs. 97). It was also shown that the expression profiles of 3-year prognosis were robust and successfully distinguished good and poor prognosis patients in a third dataset (Fig. 5).
SEP was demonstrated as a valuable MPI (molecular prognostic index) despite of its simple form. The parameters (gene weights) in formula (1) are estimated independent of each other, making SEP more robust than many other classifiers such as Linear Discriminate Analysis, and robustness is essential for analysis performed on independent datasets. Unlike the suggestion of Teschendorff et al , the distribution of SEP did tend to be bi-modal, or tri-modal when confounding effect was presented (Fig. 1). Although most analyses of this study dichotomized SEP scores as a conservative strategy, it is possible to apply more quantitative analysis in the future to take advantage of SEP as a continuous variable. For instance, it was demonstrated by Veridex patients that the most of highly scored patients (>90%) had good prognosis. Such high specificity, as suggested by van 't Veer et al , will help good prognosis patients avoid unnecessary radical treatments. However, we noticed that SEP scores of independent patient cohorts usually have different locations and scales. Consequently, we could only classify Veridex patients according to relative SEP values. Such a limitation of SEP or similar classifier is presumably caused by technical variations between microarray datasets, especially different array platforms. Without a common reference, the current method will not be able to classify a single testing patient before the platform and protocol of microarray experiments are standardized. To achieve the direct comparison of SEP between different patient cohorts, we suggest that all data-generating studies about the same topic should include one or more pairs of common reference samples.
The current study strongly advocates the clinical value of microarray data on breast cancer prognosis and the advantage of performing expression profiling across multiple datasets. However, the generality of profiles was diminished by the confounding effect of currently used clinical indexes. A larger number of training patients with more diversified and balanced clinical background should be used by future studies to further pursue this topic.
Two published microarray datasets, Rosetta  and Stanford , were used to generate gene expression profiles of breast cancer prognosis. Both datasets provided information about disease outcome and clinical indexes in addition to gene expression measurements. Breast cancer patients were classified into two prognosis groups. Patients who recurred within three years after mastectomy were classified into poor prognosis group, while those who were followed up for at least three years and kept recurrence-free at the end of follow-up were put into good prognosis group. Patients not fit to either group were excluded from this study. Consequently, Rosetta dataset included 51 good prognosis and 31 poor prognosis patients and Stanford dataset included 25 and 37 patients respectively. Microarray sequence features were mapped to NCBI Unigene clusters , and redundant clusters were condensed by averaging expression measurements. Totally 5,569 clusters were presented in both datasets. Only these genes were examined in this study. Sample/reference ratios were log10-transformed, followed by normalizing each patient to median equal zero and standard deviation equal to one. To make the same genes comparable to each other between different datasets, expression measurements of each gene were also normalized to zero median and one standard deviation separately in each dataset [see Additional file 1].
SEP: Score for Expression Profile
A designed variable, Score for Expression Profile (SEP), was used as a weighted linear summation of gene expression profile. Given an expression profile including N genes, SEP score of each patient was calculated as:
SEP = ∑N[wi(Xi - Ei)] (1)
In formula (1), wi was the weight of the i th gene, a parameter empirically estimated based on training data. When a statistical test was used to evaluate differential gene expression, the resultant test statistic or its transformation, such as correlation coefficient, Z score, or log10-transformed p value, could be used as the value of w. For each gene in the profile, the magnitude of its w should reflect its relative weight and the sign of its w should correspond to its direction of differential expression between sample groups. Xi was the expression level of the i th gene in the patient and Ei was its expected expression level estimated from training data. When patient outcome was dichotomous, Ei was the expression level that had equal probability to be found in either sample group and could be denoted as E (Xi | p+ = p- = 0.5).
Partial correlation analysis
A partial correlation analysis was used in this study to control the confounding effect of clinical indexes on gene-outcome correlation. This analysis first controlled out a confounder from expression measurements by calculating residuals as:
Xresidual = X - E (X | Controlled Variable) (2)
In Formula (2), X was an observed expression measurement and E was the expected value of X given a specific value of the variable to be controlled, such as positive or negative ER status. Patients were classified according to the controlled variable and the E values of each gene were estimated as group means. Subsequently, the partial correlation coefficient (r') of each gene to disease outcome was calculated using the residuals.
Statistical analyses of this study were carried out by R 2.4.1 computing language and environment . The functions used for analyses were: area of ROC curve – colAUC (package: caTools); logistic regression model: lrm (package: Design); likelihood ratio test: lrtest (package: lrtest); false discovery rate: qvalue (package: qvalue); survival analysis – survfit (package: survival); and Cox proportional hazards analysis – cph (package: Design).
Area Under (ROC) Curve
- ROC curve:
Receiver Operating Characteristic curve
Rank Sum test
Score for Expression Profile
We thank Xiaowu Gai and Juan Perin at Children's Hospital of Philadelphia for helpful discussions and technical supports. This work is partially funded by Biomedical Informatics Program from the H. Lee Moffitt Cancer Center & Research Institute.
- Merikangas KR, Risch N: Genomic priorities and public health. Science. 2003, 302: 599-601. 10.1126/science.1091468.PubMedView ArticleGoogle Scholar
- Russo G, Zegar C, Giordano A: Advantages and limitations of microarray technology in human cancer. Oncogene. 2003, 22: 6497-6507. 10.1038/sj.onc.1206865.PubMedView ArticleGoogle Scholar
- Kudoh K, Ramanna M, Ravatn R, Elkahloun AG, Bittner ML, Meltzer PS, Trent JM, Dalton WS, Chin KV: Monitoring the expression profiles of doxorubicin-induced and doxorubicin-resistant cancer cells by cDNA microarray. Cancer Res. 2000, 60: 4161-4166.PubMedGoogle Scholar
- Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, Børresen-Dale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA. 2001, 98: 10869-10874. 10.1073/pnas.191367098.PubMed CentralPubMedView ArticleGoogle Scholar
- Unger MA, Rishi M, Clemmer VB, Hartman JL, Keiper EA, Greshock JD, Chodosh LA, Liebman MN, Weber BL: Characterization of adjacent breast tumors using oligonucleotide microarrays. Breast Cancer Res. 2001, 3: 336-341. 10.1186/bcr317.PubMed CentralPubMedView ArticleGoogle Scholar
- Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng A, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA. 2003, 100: 8418-8423. 10.1073/pnas.0932692100.PubMed CentralPubMedView ArticleGoogle Scholar
- Pusztai L, Sotiriou C, Buchholz TA, Meric F, Symmans WF, Esteva FJ, Sahin A, Liu ET, Hortobagi GN: Molecular profiles of invasive mucinous and ductal carcinomas of the breast: a molecular case study. Cancer Genet Cytogenet. 2003, 141: 148-153. 10.1016/S0165-4608(02)00737-9.PubMedView ArticleGoogle Scholar
- van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530-536. 10.1038/415530a.PubMedView ArticleGoogle Scholar
- van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GL, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.PubMedView ArticleGoogle Scholar
- Bertucci F, Nasser V, Granjeaud S, Eisinger F, Adelaide J, Tagett R, Loriod B, Giaconia A, Benziane A, Devilard E, Jacquemier J, Viens P, Nguyen C, Birnbaum D, Houlgatte R: Gene expression profiles of poor-prognosis primary breast cancer correlate with survival. Hum Mol Genet. 2002, 11: 863-872. 10.1093/hmg/11.8.863.PubMedView ArticleGoogle Scholar
- Ahr A, Karn T, Solbach C, Seiter T, Strebhardt K, Holtrich U, Kaufmann M: Identification of high risk breast-cancer patients by gene expression profiling. Lancet. 2002, 359: 131-132. 10.1016/S0140-6736(02)07337-3.PubMedView ArticleGoogle Scholar
- Jenssen TK, Kuo WP, Stokke T, Hovig E: Associations between gene expressions in breast cancer and patient survival. Hum Genet. 2002, 111: 411-420. 10.1007/s00439-002-0804-5.PubMedView ArticleGoogle Scholar
- Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't Veer LJ, Perou CM: Concordance among gene-expression-based predictors for breast cancer. N Engl J Med. 2006, 355: 560-569. 10.1056/NEJMoa052933.PubMedView ArticleGoogle Scholar
- Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2005, 21: 171-178. 10.1093/bioinformatics/bth469.PubMedView ArticleGoogle Scholar
- Retsky MW, Demicheli R, Swartzendruber DE, Bame PD, Wardwell RH, Bonadonna G, Speer JF, Valagussa P: Computer simulation of a breast cancer metastasis model. Breast Cancer Res Treat. 1997, 45: 193-202. 10.1023/A:1005849301420.PubMedView ArticleGoogle Scholar
- Demicheli R, Retsky MW, Swartzendruber DE, Bonadonna G: Proposal for a new model of breast cancer metastatic development. Ann Oncol. 1997, 8: 1075-1080. 10.1023/A:1008263116022.PubMedView ArticleGoogle Scholar
- van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Bernards R, Friend SH: Expression profiling predicts outcome in breast cancer. Breast Cancer Res. 2003, 5: 57-58. 10.1186/bcr562.PubMed CentralPubMedView ArticleGoogle Scholar
- Gruvberger SK, Ringner M, Eden P, Borg A, Ferno M, Peterson C, Meltzer PS: Expression profiling to predict outcome in breast cancer: the influence of sample selection. Breast Cancer Res. 2003, 5: 23-26. 10.1186/bcr548.PubMed CentralPubMedView ArticleGoogle Scholar
- Sokal RR, Rohlf FJ: Biometry: the principles and practice of statistics in biological research. 1981, San Francisco: W. H. Freeman, 2Google Scholar
- Moreau Y, Aerts S, De Moor B, De Strooper B, Dabrowski M: Comparison and meta-analysis of microarray data: from the bench to the computer desk. Trends Genet. 2003, 19: 570-577. 10.1016/j.tig.2003.08.006.PubMedView ArticleGoogle Scholar
- Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM: Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Funct Integr Genomics. 2003, 3: 180-188. 10.1007/s10142-003-0087-5.PubMedView ArticleGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA. 2004, 101: 9309-9314. 10.1073/pnas.0401994101.PubMed CentralPubMedView ArticleGoogle Scholar
- Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB: Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002, 18: 1454-1461. 10.1093/bioinformatics/18.11.1454.PubMedView ArticleGoogle Scholar
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4: P3-10.1186/gb-2003-4-5-p3.PubMedView ArticleGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365: 671-679.PubMedView ArticleGoogle Scholar
- Teschendorff AE, Naderi A, Barbosa-Morais NL, Pinder SE, Ellis IO, Aparicio S, Brenton JD, Caldas C: A consensus prognostic gene expression classifier for ER positive breast cancer. Genome Biol. 2006, 7: R101-10.1186/gb-2006-7-10-r101.PubMed CentralPubMedView ArticleGoogle Scholar
- Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge Ø, Pergamenschikov A, Williams C, Zhu SX, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature. 2000, 406: 747-752. 10.1038/35021093.PubMedView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pontius JU, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005, 33: D39-45. 10.1093/nar/gki062.PubMed CentralPubMedView ArticleGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna. 2007Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.