A Steiner tree-based method for biomarker discovery and classification in breast cancer metastasis
© Jahid and Ruan; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Skip to main content
© Jahid and Ruan; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Metastatic breast cancer is a leading cause of cancer-related deaths in women worldwide. DNA microarray has become an important tool to help identify biomarker genes for improving the prognosis of breast cancer. Recently, it was shown that pathway-level relationships between genes can be incorporated to build more robust classification models and to obtain more useful biological insight from such models. Due to the unavailability of complete pathways, protein-protein interaction (PPI) network is becoming more popular to researcher and opens a new way to investigate the developmental process of breast cancer.
In this study, a network-based method is proposed to combine microarray gene expression profiles and PPI network for biomarker discovery for breast cancer metastasis. The key idea in our approach is to identify a small number of genes to connect differentially expressed genes into a single component in a PPI network; these intermediate genes contain important information about the pathways involved in metastasis and have a high probability of being biomarkers.
We applied this approach on two breast cancer microarray datasets, and for both cases we identified significant numbers of well-known biomarker genes for breast cancer metastasis. Those selected genes are significantly enriched with biological processes and pathways related to cancer carcinogenic process, and, importantly, have much higher stability across different datasets than in previous studies. Furthermore, our selected genes significantly increased cross-data classification accuracy of breast cancer metastasis.
The randomized Steiner tree based approach described in this study is a new way to discover biomarker genes for breast cancer, and improves the prediction accuracy of metastasis. Though the analysis is limited here only to breast cancer, it can be easily applied to other diseases.
The identification of marker genes involved in cancer is a central problem in system biology. Many studies have used gene expression data for marker identification in breast cancer and other diseases [1, 2]. However, noisy data, small sample sizes, and heterogeneous experimental platforms make the marker selection procedure difficult and dataset-specific. As a result, different studies on the same disease often have very few gene markers in common. For example, two studies [3, 4] identified 70 and 76 gene marker for breast cancer, which were also validated later by two other studies [5, 6], but they have only three genes in common.
To improve the stability of marker selection, other complementary genomic information such as pathways has been used [7–9]. The problem of pathway-based approach, however, is that the majority of human genes are not assigned to a specific pathway ; therefore there is a strong possibility that a true marker may be out of consideration for not being assigned to a pathway. To circumvent this problem,  proposed to incorporate protein-protein interaction (PPI) networks for discovering small sub-networks, which may represent novel pathways, as potential markers. They found that such subnetwork-based markers can both improve classification accuracy and increase cross-dataset stability. Other studies have attempted to use gene co-expression networks or hybrid networks developed from various sources instead of PPI networks [11, 12]. Recently several studies also paid much attention to the association between PPI network topology and disease. For example,  found that inter-modular hubs are more associated with breast cancer than intra-modular hubs;  used pair-wise shortest paths between differentially expressed genes to identify candidate markers,  used probabilistic activity inference method to identify diagnostic subnetworks.
In this study we propose a network topology-based approach to identify candidate biomarker genes, motivated by the key observation that disease genes play a role in connecting differentially expressed (DE) genes in PPI networks . For example, breast cancer biomarkers P53 and KRAS are not differentially expressed in metastatic breast cancer but they connect many DE genes in the human PPI network and play a central role in carcinogenic process . The main idea of our approach is to find a small number of genes that can connect DE genes into a singly connected component in a PPI network, which maps to the well-known Steiner tree problem in graph theory and is solved using a heuristic algorithm. In addition, we combine multiple suboptimal Steiner trees to increase the chance of finding the optimal solution and to capture alternative pathways. Applying our approach on three breast cancer datasets, we found that the candidate markers selected by our method are highly enriched in pathways that are well-known to be dysregulated in breast cancer metastasis, and cover a significant number of known breast cancer susceptibility genes. Remarkably, the markers identified from multiple datasets have much higher reproducibility than in previous studies, and significantly increase the cross-datasets classification accuracy.
In this study we used two microarray datasets herein referred as van de Vijver and Wang dataset [4, 5] respectively. The two datasets have 295 and 286 breast cancer patients where 78 and 106 patients have distant metastasis within five years of follow-up visit respectively. The microarray platform used for van de Vijver et al was Agilent Hu25K and for other dataset was Affymetrix HG-U133a. The first dataset was downloaded from the Netherland Cancer Institute website (http://bioinformatics.nki.nl/index.php) while the other dataset was obtained from GEO with the accession number GSE2034 . SAM (Significant Analysis of Microarray)  was used to select genes that are significantly differentially expressed between metastatic and non-metastatic tumors (DE genes). We controlled the delta parameter in SAM to select a similar number of DE genes from each dataset. As a result a total of 333 and 319 DE genes were selected for van de Vijver and Wang datasets, corresponding to FDR 0.7% and 8.2% respectively. Varying the number of DE genes between 200 and 1000 only slightly changed the percent of overlap while the significance of the overlap is essentially not affected (data not shown).
Two human protein-protein interaction networks were used by this study. The first network was obtained from Protein Interaction Network Analysis (PINA) and contains 10,920 genes and 61,746 binary connections . The second network was compiled by  from six different sources, and contains 57,235 interactions among 11,203 genes. In this study we only considered the largest connected component in each PPI network, which contains 10,794 genes and 56,864 connections for Chuang PPI network and 10,770 genes and 61,658 connections for PINA PPI network.
Input: Weighted PPI network, G = (V,E,w); DE genes, R
Output: Tree, T, that spans R
1. Start with a forest comprising the DE genes, R, but no edges
2. While is not a tree do Connect two shortest-distance disconnected vertices u, v ∈ and add vertices on the path to
3. Build a minimum spanning tree (T) with the subgraph of G induced by the vertices in
4. Delete any leaf node in T that is not in R
In a Steiner tree the intermediate vertices (both Steiner vertices and non-leaf DE genes) play important roles in connecting DE genes together. We consider all the intermediate vertices as potential biomarkers (Steiner Tree-based Markers or STMs) for breast cancer metastasis, as it is known that disease related genes play an important role in connecting DE genes in PPI network . For example, for van de Vijver dataset with Chuang-PPI network, a single Steiner tree uses 136 vertices to connect 333 DE genes (Figure 1B-C). Among the DE genes, 264 are leaf nodes (degree = 1 in the tree) and the other 69 are internal nodes (degree > 1 in the tree). As these internal DE genes are important in connecting the remaining DE genes, we combine the Steiner vertices with these internal DE genes as potential biomarkers for breast cancer metastasis. Thus for this single tree we consider those 205 internal genes as potential biomarker (Figure 1B-C).
Next, we proposed a simple strategy to obtain multiple Steiner trees. The motivation is two-fold. First, as the heuristic algorithm does not guarantee optimality, by obtaining multiple solutions we increase the chance of finding the optimal Steiner vertices. Second, multiple solutions with similar qualities may represent alternative or redundant pathways that cannot be covered by a single Steiner tree. To obtain alternative Steiner trees without any modification to our Steiner tree algorithm, we assign to each edge in PPI network a random weight between 0.99 to 1 and run the standard Steiner tree algorithm. These random weights effectively break ties, so that if there are two paths with the same weight in the original network, one path will be chosen randomly. This procedure was repeated multiple times with different random weights from 0.99 to 1, until the total number of unique STMs converges approximately. Depending on the PPI network and microarray data, the rate of new coming STMs reduced significantly after 200-300 iterations (for example, see Figure 1D). After that, we take union of all internal nodes (or genes) of those trees and consider them as potential biomarkers. As previously mentioned, we called these genes as Steiner Tree-based Markers (STMs). We obtained 1047 and 1100 STMs for Chuang PPI network and, 932 and 1135 STMs for PINA PPI network for van de Vijver and Wang dataset respectively (see Additional file 1 for complete gene list).
where C(n, k) is the binomial coefficient.
To evaluate the prediction ability of different features (STMs and DEs), we built logistic regression and support vector machine (SVM) classifiers to distinguish breast cancer patients who developed metastasis within five years after the date of the initial diagnosis from those who did not. We used the implementation in WEKA (version 3.6.3) and default parameter settings for this classification purpose . To avoid overfitting and provide a realistic evaluation, we concentrated on cross-data classification where features obtained from one dataset were used to construct classifiers for the other dataset, because DEs were selected using the complete dataset and very specific to that particular dataset. Classification performance was estimated 100 times using 10-fold cross validation where iteratively one-tenth of the data were used for testing and nine-tenth were for training. Performance was measured by AUC (area under ROC curve).
Stability of STMs
Cross PPI stability of STMs
van de Vijver dataset
Number of common genes (% of overlap)
Chuang PPI vs PINA PPI
PINA-PPI vs Chuang PPI
Breast cancer genes in STMs
Breast cancer known genes in STMs (van de Vijver dataset)
Breast cancer known genes in STMs (Wang dataset)
We also collected 288 breast cancer susceptibility genes from Genetic Association Database of Disease from DAVID. STMs covered 26.39% and 26.04% of those known breast cancer susceptibility genes for Wang and van de Vijver datasets, respectively. For both datasets, genes selected by our method clearly outperformed DE genes (5.55% and 5.28% for Wang and van de Vijver datasets respectively). Evaluation results using PINA PPI show similar results.
Classification accuracy of STMs
Chuang et al method
Steiner tree based method
Single gene marker
% increase of AUC
% increase of AUC
van de Vijver dataset
In this article we proposed a randomized Steiner tree-based approach that integrates a PPI network and gene expression microarray data for biomarker discovery in breast cancer metastasis. The genes selected by our method are significantly enriched in functional categories and pathways that are known for cancer development. Furthermore, a significant portion of selected genes by our method are already known for breast cancer susceptibility. We applied the method to three different breast cancer microarray data and two different PPI networks. For all combinations of microarray and PPI datasets our approach has similarly significant results. The reproducibility across different datasets also increases significantly in both genomic and pathway level compared to previous studies. Finally, Steiner tree-based markers, significantly increase cross-dataset classification accuracy. Thus the method proposed in this article validates the hypothesis that disease causal genes play a role in connecting differentially expressed genes, and opens a new possibility to identify the inner dynamics and biomarker of breast cancer progression.
protein protein interaction
Steiner tree-based marker
area under ROC curve.
Based on “Identification of biomarkers in breast cancer metastasis by integrating protein-protein interaction network and gene expression data”, by Md Jamiul Jahid and Jianhua Ruan which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
This research was supported in part by NIH grants SC3GM086305, U54CA113001, P30CA054174 and R01CA152063.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.