Volume 13 Supplement 6
A Steiner tree-based method for biomarker discovery and classification in breast cancer metastasis
© Jahid and Ruan; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Metastatic breast cancer is a leading cause of cancer-related deaths in women worldwide. DNA microarray has become an important tool to help identify biomarker genes for improving the prognosis of breast cancer. Recently, it was shown that pathway-level relationships between genes can be incorporated to build more robust classification models and to obtain more useful biological insight from such models. Due to the unavailability of complete pathways, protein-protein interaction (PPI) network is becoming more popular to researcher and opens a new way to investigate the developmental process of breast cancer.
In this study, a network-based method is proposed to combine microarray gene expression profiles and PPI network for biomarker discovery for breast cancer metastasis. The key idea in our approach is to identify a small number of genes to connect differentially expressed genes into a single component in a PPI network; these intermediate genes contain important information about the pathways involved in metastasis and have a high probability of being biomarkers.
We applied this approach on two breast cancer microarray datasets, and for both cases we identified significant numbers of well-known biomarker genes for breast cancer metastasis. Those selected genes are significantly enriched with biological processes and pathways related to cancer carcinogenic process, and, importantly, have much higher stability across different datasets than in previous studies. Furthermore, our selected genes significantly increased cross-data classification accuracy of breast cancer metastasis.
The randomized Steiner tree based approach described in this study is a new way to discover biomarker genes for breast cancer, and improves the prediction accuracy of metastasis. Though the analysis is limited here only to breast cancer, it can be easily applied to other diseases.
The identification of marker genes involved in cancer is a central problem in system biology. Many studies have used gene expression data for marker identification in breast cancer and other diseases [1, 2]. However, noisy data, small sample sizes, and heterogeneous experimental platforms make the marker selection procedure difficult and dataset-specific. As a result, different studies on the same disease often have very few gene markers in common. For example, two studies [3, 4] identified 70 and 76 gene marker for breast cancer, which were also validated later by two other studies [5, 6], but they have only three genes in common.
To improve the stability of marker selection, other complementary genomic information such as pathways has been used [7–9]. The problem of pathway-based approach, however, is that the majority of human genes are not assigned to a specific pathway ; therefore there is a strong possibility that a true marker may be out of consideration for not being assigned to a pathway. To circumvent this problem,  proposed to incorporate protein-protein interaction (PPI) networks for discovering small sub-networks, which may represent novel pathways, as potential markers. They found that such subnetwork-based markers can both improve classification accuracy and increase cross-dataset stability. Other studies have attempted to use gene co-expression networks or hybrid networks developed from various sources instead of PPI networks [11, 12]. Recently several studies also paid much attention to the association between PPI network topology and disease. For example,  found that inter-modular hubs are more associated with breast cancer than intra-modular hubs;  used pair-wise shortest paths between differentially expressed genes to identify candidate markers,  used probabilistic activity inference method to identify diagnostic subnetworks.
In this study we propose a network topology-based approach to identify candidate biomarker genes, motivated by the key observation that disease genes play a role in connecting differentially expressed (DE) genes in PPI networks . For example, breast cancer biomarkers P53 and KRAS are not differentially expressed in metastatic breast cancer but they connect many DE genes in the human PPI network and play a central role in carcinogenic process . The main idea of our approach is to find a small number of genes that can connect DE genes into a singly connected component in a PPI network, which maps to the well-known Steiner tree problem in graph theory and is solved using a heuristic algorithm. In addition, we combine multiple suboptimal Steiner trees to increase the chance of finding the optimal solution and to capture alternative pathways. Applying our approach on three breast cancer datasets, we found that the candidate markers selected by our method are highly enriched in pathways that are well-known to be dysregulated in breast cancer metastasis, and cover a significant number of known breast cancer susceptibility genes. Remarkably, the markers identified from multiple datasets have much higher reproducibility than in previous studies, and significantly increase the cross-datasets classification accuracy.
Datasets and PPI networks
In this study we used two microarray datasets herein referred as van de Vijver and Wang dataset [4, 5] respectively. The two datasets have 295 and 286 breast cancer patients where 78 and 106 patients have distant metastasis within five years of follow-up visit respectively. The microarray platform used for van de Vijver et al was Agilent Hu25K and for other dataset was Affymetrix HG-U133a. The first dataset was downloaded from the Netherland Cancer Institute website (http://bioinformatics.nki.nl/index.php) while the other dataset was obtained from GEO with the accession number GSE2034 . SAM (Significant Analysis of Microarray)  was used to select genes that are significantly differentially expressed between metastatic and non-metastatic tumors (DE genes). We controlled the delta parameter in SAM to select a similar number of DE genes from each dataset. As a result a total of 333 and 319 DE genes were selected for van de Vijver and Wang datasets, corresponding to FDR 0.7% and 8.2% respectively. Varying the number of DE genes between 200 and 1000 only slightly changed the percent of overlap while the significance of the overlap is essentially not affected (data not shown).
Two human protein-protein interaction networks were used by this study. The first network was obtained from Protein Interaction Network Analysis (PINA) and contains 10,920 genes and 61,746 binary connections . The second network was compiled by  from six different sources, and contains 57,235 interactions among 11,203 genes. In this study we only considered the largest connected component in each PPI network, which contains 10,794 genes and 56,864 connections for Chuang PPI network and 10,770 genes and 61,658 connections for PINA PPI network.
Randomized Steiner tree approach
Steiner tree algorithm
Input: Weighted PPI network, G = (V,E,w); DE genes, R
Start with a forest comprising the DE genes, R, but no edges
While is not a tree do Connect two shortest-distance disconnected vertices u, v ∈ and add vertices on the path to
Build a minimum spanning tree (T) with the subgraph of G induced by the vertices in
Delete any leaf node in T that is not in R
In a Steiner tree the intermediate vertices (both Steiner vertices and non-leaf DE genes) play important roles in connecting DE genes together. We consider all the intermediate vertices as potential biomarkers (S teiner T ree-based M arkers or STM s) for breast cancer metastasis, as it is known that disease related genes play an important role in connecting DE genes in PPI network . For example, for van de Vijver dataset with Chuang-PPI network, a single Steiner tree uses 136 vertices to connect 333 DE genes (Figure 1B-C). Among the DE genes, 264 are leaf nodes (degree = 1 in the tree) and the other 69 are internal nodes (degree > 1 in the tree). As these internal DE genes are important in connecting the remaining DE genes, we combine the Steiner vertices with these internal DE genes as potential biomarkers for breast cancer metastasis. Thus for this single tree we consider those 205 internal genes as potential biomarker (Figure 1B-C).
Next, we proposed a simple strategy to obtain multiple Steiner trees. The motivation is two-fold. First, as the heuristic algorithm does not guarantee optimality, by obtaining multiple solutions we increase the chance of finding the optimal Steiner vertices. Second, multiple solutions with similar qualities may represent alternative or redundant pathways that cannot be covered by a single Steiner tree. To obtain alternative Steiner trees without any modification to our Steiner tree algorithm, we assign to each edge in PPI network a random weight between 0.99 to 1 and run the standard Steiner tree algorithm. These random weights effectively break ties, so that if there are two paths with the same weight in the original network, one path will be chosen randomly. This procedure was repeated multiple times with different random weights from 0.99 to 1, until the total number of unique STMs converges approximately. Depending on the PPI network and microarray data, the rate of new coming STMs reduced significantly after 200-300 iterations (for example, see Figure 1D). After that, we take union of all internal nodes (or genes) of those trees and consider them as potential biomarkers. As previously mentioned, we called these genes as S teiner T ree-based M arkers (STM s). We obtained 1047 and 1100 STMs for Chuang PPI network and, 932 and 1135 STMs for PINA PPI network for van de Vijver and Wang dataset respectively (see Additional file 1 for complete gene list).
Statistical test of overlap significance
where C(n, k) is the binomial coefficient.
To evaluate the prediction ability of different features (STMs and DEs), we built logistic regression and support vector machine (SVM) classifiers to distinguish breast cancer patients who developed metastasis within five years after the date of the initial diagnosis from those who did not. We used the implementation in WEKA (version 3.6.3) and default parameter settings for this classification purpose . To avoid overfitting and provide a realistic evaluation, we concentrated on cross-data classification where features obtained from one dataset were used to construct classifiers for the other dataset, because DEs were selected using the complete dataset and very specific to that particular dataset. Classification performance was estimated 100 times using 10-fold cross validation where iteratively one-tenth of the data were used for testing and nine-tenth were for training. Performance was measured by AUC (area under ROC curve).
Stability of STMs
Stability of STMs
Cross PPI stability of STMs
van de Vijver dataset
Number of common genes (% of overlap)
Chuang PPI vs PINA PPI
PINA-PPI vs Chuang PPI
Functional enrichment and pathway analysis of STMs
STMs correspond to novel biomarkers of cancer
Breast cancer genes in STMs
Breast cancer known genes in STMs (van de Vijver dataset)
Breast cancer known genes in STMs (Wang dataset)
We also collected 288 breast cancer susceptibility genes from Genetic Association Database of Disease from DAVID. STMs covered 26.39% and 26.04% of those known breast cancer susceptibility genes for Wang and van de Vijver datasets, respectively. For both datasets, genes selected by our method clearly outperformed DE genes (5.55% and 5.28% for Wang and van de Vijver datasets respectively). Evaluation results using PINA PPI show similar results.
Steiner tree-based markers improves the cross-dataset classification accuracy
Classification accuracy of STMs
Chuang et almethod
Steiner tree based method
Single gene marker
% increase of AUC
% increase of AUC
van de Vijver dataset
In this article we proposed a randomized Steiner tree-based approach that integrates a PPI network and gene expression microarray data for biomarker discovery in breast cancer metastasis. The genes selected by our method are significantly enriched in functional categories and pathways that are known for cancer development. Furthermore, a significant portion of selected genes by our method are already known for breast cancer susceptibility. We applied the method to three different breast cancer microarray data and two different PPI networks. For all combinations of microarray and PPI datasets our approach has similarly significant results. The reproducibility across different datasets also increases significantly in both genomic and pathway level compared to previous studies. Finally, Steiner tree-based markers, significantly increase cross-dataset classification accuracy. Thus the method proposed in this article validates the hypothesis that disease causal genes play a role in connecting differentially expressed genes, and opens a new possibility to identify the inner dynamics and biomarker of breast cancer progression.
protein protein interaction
Steiner tree-based marker
area under ROC curve.
Based on “Identification of biomarkers in breast cancer metastasis by integrating protein-protein interaction network and gene expression data”, by Md Jamiul Jahid and Jianhua Ruan which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
This research was supported in part by NIH grants SC3GM086305, U54CA113001, P30CA054174 and R01CA152063.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.View ArticlePubMed
- Sotiriou C, Pusztai L: Gene-expression signatures in breast cancer. The New England journal of medicine. 2009, 360 (8): 790-800. 10.1056/NEJMra0801289.View ArticlePubMed
- van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-536. 10.1038/415530a.View Article
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365 (9460): 671-679. 10.1016/S0140-6736(05)70933-8.View ArticlePubMed
- van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van Der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. The New England journal of medicine. 2002, 347 (25): 1999-2009. 10.1056/NEJMoa021967.View ArticlePubMed
- Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MSS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JG, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C, TRANSBIG Consortium: Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clinical cancer research: an official journal of the American Association for Cancer Research. 2007, 13 (11): 3207-3214. 10.1158/1078-0432.CCR-06-2765.View Article
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex. Neurochemical Research. 2004, 29 (6): 1213-1222.View ArticlePubMed
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (38): 13544-13549. 10.1073/pnas.0506577102.PubMed CentralView ArticlePubMed
- Wei Z, Li H: A Markov Random Field Model for Network-based Analysis of Genomic Data. Bioinformatics. 2007, 23 (12): 1537-1544. 10.1093/bioinformatics/btm129.View ArticlePubMed
- Chuang H, Lee E, Liu Y, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140-PubMed CentralView ArticlePubMed
- Ma S, Shi M, Li Y, Yi D, Shia BC: Incorporating gene co-expression network in identification of cancer prognosis markers. BMC Bioinformatics. 2010, 11: 271-10.1186/1471-2105-11-271.PubMed CentralView ArticlePubMed
- Wu G, Feng X, Stein L: A human functional protein interaction network and its application to cancer data analysis. Genome biology. 2010, 11: R53-10.1186/gb-2010-11-5-r53.PubMed CentralView ArticlePubMed
- Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL: Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature Biotechnology. 2009, 27 (2): 199-204. 10.1038/nbt.1522.View ArticlePubMed
- Dezso Z, Nikolsky Y, Nikolskaya T, Miller J, Cherba D, Webb C, Bugrim A: Identifying disease-specific genes based on their topological significance in protein networks. BMC systems biology. 2009, 3: 36-10.1186/1752-0509-3-36.PubMed CentralView ArticlePubMed
- Su J, Yoon BJ, Dougherty E: Identification of diagonostic subnetwork markers for cancer in human protein-protein interaction network. BMC Bioinformatics. 2010, 11 (Suppl 6): S8-10.1186/1471-2105-11-S6-S8.PubMed CentralView ArticlePubMed
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMed CentralView ArticlePubMed
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-21. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMed
- Vo S: Steiner's problem in graphs: heuristic methods. Discrete Applied Mathematics. 1992, 40: 45-72. 10.1016/0166-218X(92)90021-2.View Article
- Rayward-Smith VJ: The computation of nearly minimal Steiner trees in graphs. Internat J Math Ed Sci Tech. 1983, 14: 15-23. 10.1080/0020739830140103.View Article
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor Newsl. 2009, 11: 10-18. 10.1145/1656274.1656278.View Article
- Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protocols. 2008, 4 (1): 44-57. 10.1038/nprot.2008.211.View Article
- Hwang T, Tian Z, Kuang R, Kocher JP: Learning on Weighted Hypergraphs to Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction. ICDM '08 Eighth IEEE International Conference on Data Mining. 2008, 293-302.
- Yao C, Li H, Zhou C, Zhang L, Zou J, Guo Z: Multi-level reproducibility of signature hubs in human interactome for breast cancer metastasis. BMC systems biology. 2010, 4: 151-10.1186/1752-0509-4-151.PubMed CentralView ArticlePubMed
- Genkin , Alexander , Lewis , David D, Madigan , David : Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics. 2007, 49 (3): 291-304. 10.1198/004017007000000245.View Article
- Jahid MJ, Ruan J: Identification of biomarkers in breast cancer metastasis by integrating protein-protein interaction network and gene expression data. Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on: 4-6 December 2011. 2011, 60-63. 10.1109/GENSiPS.2011.6169443.View Article
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.