A simple and reproducible breast cancer prognostic test
© Marchionni et al.; licensee BioMed Central Ltd. 2013
Received: 18 December 2012
Accepted: 4 May 2013
Published: 17 May 2013
A small number of prognostic and predictive tests based on gene expression are currently offered as reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been identified in other genomic-based predictors and the success rate for developing clinically useful genomic signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting of computational research. As a result, a need has emerged for a template for reproducible development of genomic signatures that incorporates full transparency, data sharing and statistical robustness.
Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival.
Our study provides a prototypic example for reproducible development of computational algorithms for learning prognostic biomarkers in the era of personalized medicine.
KeywordsReproducible research Gene expression analysis Biomarkers Top scoring pair Prediction Genomics Personalized medicine Breast cancer MammaPrint
Currently, a number of molecular-based prognostic and predictive tests for breast cancer are offered as laboratory services for clinical use [1, 2]. Such assays, which include MammaPrint , OncotypeDx , PAM50 Breast Cancer Intrinsic Subtype Classifier , MapQuant Dx  and Theros Breast Cancer Index , are implemented by providing multiple gene expression measurements obtained from tissue samples to multivariate classification algorithms. Currently, published evidence on clinical validity and utility for such assays as they are offered to the patients is only available for MammaPrint and OncotypeDx; for the remainder of these tests the evidence derives from analyses performed in academic settings .
According to a recent report  from the Institute of Medicine (IOM), OncotypeDx was the most widely used among these breast cancer assays, with more than 175,000 patients tested as of mid 2011, followed by MammaPrint, used for 14,000 patients. OncotypeDX combines the expression levels of 21 genes and was developed to predict the risk of distant recurrence at 10 years for women with lymph node negative, estrogen receptor (ER) positive breast cancer . MammaPrint utilizes 70 genes to report a good or bad prognosis for each patient, and was developed from microarray experiments to predict 5-year metastatic recurrence of breast cancer as a first event among ER positive and negative patients [9, 10]. The MammaPrint algorithm is based on correlating the 70-gene expression profile of a patient with a stored cancer profile in order to determine a risk score for the patient.
A relative small fraction of published cancer prognostic markers have subsequently been introduced in clinical practice, despite the large number of available studies focusing on biomarkers development. A major hurdle hindering the translation of this research into clinically useful assays has been identified in the lack of rigorous criteria to report and publish tumor prognostic marker studies . This issue has been addressed by introducing the REMARK guidelines, a set of recommendations for tumor marker prognostic studies, which provides the necessary framework for reporting all relevant information about prognostic marker development (i.e. study design, specimen and patient characteristics, analytical and statistical methods) . Another key issue in the development of cancer biomarkers is the need for detailed and complete disclosure of all data and software [8, 12, 13]. This need is not specific to the development of predictive signatures from high-throughput molecular data but extends to many other branches of computational medicine and biology [14, 15]. Whereas the guidelines for transparency in genomic data sharing date back a decade to the adoption of the Minimal Information About Microarray Experiments (MIAMIE) standards , the recent scandal leading to the decision to cancel three clinical trials based on microarray-based gene expression screening tests has dramatically underscored the need for revised genomics research criteria  that extend and/or integrate the REMARK and MIAME guidelines.
Maximizing the level of evidence on the spectrum of reproducibility requires complete, independent replication . As measured by this criterion, neither of the two successful breast cancer assays, MammaPrint and OncotypeDX, provides a paradigmatic example of the way genomic predictors should be developed. In the case of OncotypeDX, the prediction algorithm is described in detail and can be reprogrammed, but the original datasets used for the implementation and validation  of the assay were never placed in the public domain. Conversely, in the case of MammaPrint, although the original discovery and validation datasets [3, 19] are available, the pre-processing protocol and prediction algorithm are only partially described.
Thus the entire development, including data and code, is not available for either MammaPrint nor OncotypeDX. However, in the case of MammaPrint it is possible to undertake a transparent re-analysis of the data using an alternative approach, since the raw microarray data are available. We therefore focus here our efforts on reproducing the results of Mammaprint. We collect and organize the original MammaPrint discovery and validation data. We also coordinate the associated metadata for these experiments and develop reproducible documents for their analysis. We reproduce and implement the preprocessing described in the original manuscripts. These data represent a resource that can be used by other investigators both to verify the original claims about the MammaPrint signature and to build alternative predictors. As an example of the utility of these data, we use the MammaPrint discovery and validation data to develop an alternative signature and prognostic test for breast cancer, which is based on several two-gene comparisons [20, 21]. This provides a detailed, transparent and fully reproducible example of constructing a multi-gene classifier.
Data assembly and code
We collected the data from the original experiments used to identify  and develop  the MammaPrint 70-gene prognostic signature as provided as additional files with the original manuscripts. We also collected from ArrayExpress  the dataset used to retrain this signature on the custom array currently used in the MammaPrint assay  as well as the independent validation cohort using the same array . All of these datasets have been organized in an open resource that can be used to develop and compare prognostic signatures for breast cancer (available at http://luigimarchionni.org/breastTSP.html) and Bioconductor . This resource also encompasses the R  code and libraries used to retrieve, pre-process, manipulate, annotate, and analyze these data. The code, fully annotated and executable, is provided in the Additional files 1 and 2. All the analyses performed in our study were based on de-identified publically available data, and they were performed in compliance to the Helsinki declaration. The research did not involve any experiment on human subjects or animals and for this reason no ethical approval was necessary.
An example of reproducible signature development
Building the K-TSP classifier
We recorded the relative ordering of each pair of genes in the 70-gene MammaPrint signature in each of the 78 training samples. In other words, for each pair of genes g and g’, and for each sample j, we record whether the expression of g in sample j is larger than the expression of g’ in sample j or vice-versa. The “signature” for the TSP classifier is the pair of genes that most consistently changes its relative expression ordering between the two groups of patients and the corresponding decision rule for a new profile is determined entirely by the ordering between these two genes: choose group one if the observed ordering was most often seen in group one and group two otherwise. Here, the two groups of patients are those that recurred within 5 years (poor prognosis) and those that who did not recur (good prognosis). The K-TSP algorithm uses K pairs of genes. It proceeds by first identifying the TSP, removing these two genes from the 70-gene signature, then searching for the pair of genes among the 68 remaining that most often switch their ordering between groups, removing these from the list, and so forth. Individually, each pair of genes “votes” for one of the two groups based on the observed ordering. For a fixed number K of pairs, the final prognostic score is the sum of the votes for the poor prognosis group among all K pairs. The higher the score, the more evidence there is for poor prognosis.
Selecting the number of pairs
Validation of the 8-TSP signature in an independent patients cohort
To evaluate the classifier on a new sample, the relative ordering of each of the K = 8 pairs of genes is determined and the sample is assigned to the poor prognosis group if there are two or more votes for poor prognosis (Figure 3), using the same procedures previously defined in the training set of patients. The 8-TSP signature and the MammaPrint test were hence compared in terms of classification performance, using standard measures such as accuracy, sensitivity, specificity, and AUC, and in term of survival, by Kaplan-Meier and Cox regression analyses.
Results and discussion
We have therefore built a prognostic classifier based on the genes from the MammaPrint signature that is as accurate in predicting 5-year disease-free survival as the MammaPrint prognostic test based. Our classifier only requires the measurement of expression for 16 of the 70 genes used in Mammaprint. Moreover, the new test is easy to interpret and is robust with respect to any preprocessing of the expression data that maintains the ordering among expression levels within sample profiles.
Finally, all design decisions and choices of parameters were based entirely on the training set. There was no “data leakage”: no test data was examined until all aspects of classifier development were “locked up.” These are considered critical steps in developing reproducible and accurate genomic signatures as defined by the IOM report . The two key parameters are K, the number of pairs of genes in the signature, and the score threshold. We only considered values of K between 6 and 10 since these values maximized overall performance, and we only considered thresholds that obtained 100% sensitivity. Under these design constraints, we selected the K = 8 since this value maximized specificity at 100% sensitivity (Figure 2). Our final classifier labels a sample as poor prognosis if two or more among the 8 pairs votes for the poor prognosis group (Figure 3).
Our 8-TSP signature can be viewed as the combination of multiple coordinated biological processes. Of the 70 genes originally identified in the study by van’t Veer and colleagues , 18 genes had expression values positively associated with good prognosis, while 52 were associated with metastatic recurrence. Four of the K = 8 pairs combine genes positively correlated with good prognosis (RTN4RL1, LGP2, MS4A7, and GSTM3) with genes associated with bad prognosis (OXCT1, HRASLS, Contig40831_RC, and MELK). These pairs represent a coordinated change from good prognosis expression patterns to poor prognosis patterns across multiple gene pairs. The remaining pairs comprise only genes originally associated with a poor prognosis (GPR180, DTL, IGFBP5, SERF1A, GNAZ, RFC4, CDCA7, and UCHL5), suggesting that it is the quantitative level of expression of these genes that is important for predicting prognosis.
It is of note that each individual TSP involved in the final classification scheme can be viewed as a separate molecular switch between the two prognostic groups, possibly entailing also a mechanistic underpinning. To this end some of the pairs we have identified appear to have an additional underlying mechanistic biological relationship. For instance one of the gene pairs, DTL-RCF4, appears to be tightly associated with the regulation of the replication fork and the DNA damage response. DTL and RCF4 physically interact and modulate the activity of the proliferating cell nuclear antigen (PCNA) [32–34], which plays a central role in the coordination of these processes. Similarly, another pair, GPR180-GNAZ, code for proteins involved in G protein mediated cellular signaling.
Our goal was to provide a transparent example of the manner in which a genomics-based cancer predictor might be developed from training data and evaluated on independent test data with sufficient detail and documentation to allow the full process to be replicated by other researchers. Due to the unavailability of the original data, it was not possible carry out this process for OncotypeDX, which is presently the most used and validated predictor of this kind. Consequently, we performed a re-analysis of MammaPrint data. To this end, we selected the same samples and end-point originally used for the implementation of this assay, although we are aware that a stratified analysis across ER positive and negative patients would be much more appropriate. In order to illustrate the development process from end to end, including a transparent decision rule, we have introduced a more parsimonious classifier with sensitivity, specificity, and overall accuracy very similar to the 70-gene MammaPrint signature.
Our analysis was performed in complete adherence to the principles of transparent and reproducible research [13, 18], providing all data sources used, and the complete code and software necessary for data preprocessing, analysis and validation. To our knowledge, this is one of the few, if not the first, development of a genomic signature adhering to these standards.
Top scoring pair
Receiver operator curve
Area under the curve
Food and drug administration
Institute of medicine
Reporting recommendations for tumour marker prognostic studies
Minimal information about microarray experiments
Reticulon 4 receptor-like 1
DHX58 DEXH (ASP-GLU-X-HIS) box polypeptide 58
MS4A7 membrane-spanning 4-domains, subfamily A, member 7
Glutathione S-Transferase MU 3 (BRAIN)
3-oxoacid coa transferase 1
Maternal embryonic leucine zipper kinase
G Protein-coupled receptor 180
Denticleless E3 ubiquitin protein ligase homolog (drosophila)
Insulin-like growth factor binding protein 5
Small edrk-rich factor 1A (TELOMERIC)
Guanine nucleotide binding protein (G protein), alpha Z polypeptide
Replication factor C (activator 1) 4, 37KDA
Cell division cycle associated 7
Ubiquitin carboxyl-terminal hydrolase L5
Proliferating cell nuclear antigen.
The authors express gratitude to Antonio C. Wolff for the invaluable comments, and Annuska M Glas for the information on the datasets.
This work was supported by the Johns Hopkins Breast Cancer Program through funding from the Safeway Research Foundation, and by the National Institute of Health (P30 CA006973 to LM, and R01 GM08308 to JTL).
- Marchionni L, Wilson RF, Wolff AC, Marinopoulos S, Parmigiani G, Bass EB, Goodman SN: Systematic review: gene expression profiling assays in early-stage breast cancer. Ann Intern Med. 2008, 148 (5): 358-369. 10.7326/0003-4819-148-5-200803040-00208.View ArticlePubMedGoogle Scholar
- Paik S: Is gene array testing to be considered routine now?. Breast. 2011, 20 (Suppl 3): S87-S91.View ArticlePubMedGoogle Scholar
- Glas AM, Floore A, Delahaye LJ, Witteveen AT, Pover RC, Bakx N, Lahti-Domenici JS, Bruinsma TJ, Warmoes MO, Bernards R: Converting a breast cancer microarray signature into a high-throughput diagnostic test. BMC Genomics. 2006, 7: 278-10.1186/1471-2164-7-278.PubMed CentralView ArticlePubMedGoogle Scholar
- Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004, 351 (27): 2817-2826. 10.1056/NEJMoa041588.View ArticlePubMedGoogle Scholar
- Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z: Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009, 27 (8): 1160-1167. 10.1200/JCO.2008.18.1370.PubMed CentralView ArticlePubMedGoogle Scholar
- Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM, Gillet C, Ellis P, Harris A, Bergh J, Foekens JA: Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J Clin Oncol. 2007, 25 (10): 1239-1246. 10.1200/JCO.2006.07.1522.View ArticlePubMedGoogle Scholar
- Ma XJ, Salunga R, Dahiya S, Wang W, Carney E, Durbecq V, Harris A, Goss P, Sotiriou C, Erlander M: A five-gene molecular grade index and HOXB13:IL17BR are complementary prognostic factors in early stage breast cancer. Clin Cancer Res. 2008, 14 (9): 2601-2608. 10.1158/1078-0432.CCR-07-5026.View ArticlePubMedGoogle Scholar
- IOM (Institute of Medicine): Evolution of translational Omics: lessons learned and the path forward. 2012, Washington, D.C: The National Academy PressGoogle Scholar
- van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-536. 10.1038/415530a.View ArticleGoogle Scholar
- van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347 (25): 1999-2009. 10.1056/NEJMoa021967.View ArticlePubMedGoogle Scholar
- McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM: Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst. 2005, 97 (16): 1180-1184. 10.1093/jnci/dji237.View ArticlePubMedGoogle Scholar
- Leek JT, Peng RD, Anderson RR: Personalized medicine: keep a way open for tailored treatments. Nature. 2012, 484 (7394): 318-PubMed CentralView ArticlePubMedGoogle Scholar
- Baggerly K: Disclose all data in publications. Nature. 2010, 467 (7314): 401-View ArticlePubMedGoogle Scholar
- Peng RD: Reproducible research and biostatistics. Biostatistics. 2009, 10 (3): 405-408. 10.1093/biostatistics/kxp014.View ArticlePubMedGoogle Scholar
- Peng RD, Dominici F, Zeger SL: Reproducible epidemiologic research. Am J Epidemiol. 2006, 163 (9): 783-789. 10.1093/aje/kwj093.View ArticlePubMedGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001, 29 (4): 365-371. 10.1038/ng1201-365.View ArticlePubMedGoogle Scholar
- Goozner M: Duke scandal highlights need for genomics research criteria. J Natl Cancer Inst. 2011, 103 (12): 916-917. 10.1093/jnci/djr231.View ArticlePubMedGoogle Scholar
- Peng RD: Reproducible research in computational science. Science. 2012, 334 (6060): 1226-1227.View ArticleGoogle Scholar
- Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, d'Assignies MS, Bergh J, Lidereau R, Ellis P: Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst. 2006, 98 (17): 1183-1192. 10.1093/jnci/djj329.View ArticlePubMedGoogle Scholar
- Geman D, d'Avignon C, Naiman DQ, Winslow RL: Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol. 2004, 3: Article 19-Google Scholar
- Leek JT: The tspair package for finding top scoring pair classifiers in R. Bioinformatics. 2009, 25 (9): 1203-1204. 10.1093/bioinformatics/btp126.PubMed CentralView ArticlePubMedGoogle Scholar
- Brazma A, Kapushesky M, Parkinson H, Sarkans U, Shojatalab M: Data storage and analysis in ArrayExpress. Methods Enzymol. 2006, 411: 370-386.View ArticlePubMedGoogle Scholar
- A simple and reproducible breast cancer prognostic test. http://luigimarchionni.org/breastTSP.html,
- Ihaka R, Gentleman R: R: A language for data analysis and graphics. J Comput Graph Stat. 1996, 5: 299-314.Google Scholar
- Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics. 2005, 21 (20): 3896-3904. 10.1093/bioinformatics/bti631.PubMed CentralView ArticlePubMedGoogle Scholar
- Price ND, Trent J, El-Naggar AK, Cogdell D, Taylor E, Hunt KK, Pollock RE, Hood L, Shmulevich I, Zhang W: Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas. Proc Natl Acad Sci U S A. 2007, 104 (9): 3414-3419. 10.1073/pnas.0611373104.PubMed CentralView ArticlePubMedGoogle Scholar
- Weichselbaum RR, Ishwaran H, Yoon T, Nuyten DS, Baker SW, Khodarev N, Su AW, Shaikh AY, Roach P, Kreike B: An interferon-related gene signature for DNA damage resistance is a predictive marker for chemotherapy and radiation for breast cancer. Proc Natl Acad Sci U S A. 2008, 105 (47): 18490-18495. 10.1073/pnas.0809242105.PubMed CentralView ArticlePubMedGoogle Scholar
- Raponi M, Lancet JE, Fan H, Dossey L, Lee G, Gojo I, Feldman EJ, Gotlib J, Morris LE, Greenberg PL: A 2-gene classifier for predicting response to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia. Blood. 2008, 111 (5): 2589-2596. 10.1182/blood-2007-09-112730.View ArticlePubMedGoogle Scholar
- Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL, Doetsch F, Colman H: The transcriptional network for mesenchymal transformation of brain tumours. Nature. 2010, 463 (7279): 318-325. 10.1038/nature08712.PubMed CentralView ArticlePubMedGoogle Scholar
- van Belle G, Fisher LD, Heagerty PJ, Lumley T: Biostatistics: A methodology for the health sciences. 2004, Hoboken, New Jersey: John Wiley and Sons, 2View ArticleGoogle Scholar
- Tian S, Roepman P, Van't Veer LJ, Bernards R, de Snoo F, Glas AM: Biological functions of the genes in the mammaprint breast cancer profile reflect the hallmarks of cancer. Biomark Insights. 2010, 5: 129-138.PubMed CentralPubMedGoogle Scholar
- Zhang G, Gibbs E, Kelman Z, O'Donnell M, Hurwitz J: Studies on the interactions between human replication factor C and human proliferating cell nuclear antigen. Proc Natl Acad Sci U S A. 1999, 96 (5): 1869-1874. 10.1073/pnas.96.5.1869.PubMed CentralView ArticlePubMedGoogle Scholar
- Ohta S, Shiomi Y, Sugimoto K, Obuse C, Tsurimoto T: A proteomics approach to identify proliferating cell nuclear antigen (PCNA)-binding proteins in human cell lysates. Identification of the human CHL12/RFCs2-5 complex as a novel PCNA-binding protein. J Biol Chem. 2002, 277 (43): 40362-40367. 10.1074/jbc.M206194200.View ArticlePubMedGoogle Scholar
- Jascur T, Fotedar R, Greene S, Hotchkiss E, Boland CR: N-methyl-N'-nitro-N-nitrosoguanidine (MNNG) triggers MSH2 and Cdt2 protein-dependent degradation of the cell cycle and mismatch repair (MMR) inhibitor protein p21Waf1/Cip1. J Biol Chem. 2011, 286 (34): 29531-29539. 10.1074/jbc.M111.221341.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.