Molecular pathway identification using biological network-regularized logistic models
- Wen Zhang†1, 2,
- Ying-wooi Wan†1, 2, 3,
- Genevera I Allen2, 4,
- Kaifang Pang1, 2,
- Matthew L Anderson3, 6 and
- Zhandong Liu1, 2, 5, 6Email author
© Zhang et al.; licensee BioMed Central Ltd. 2013
Published: 9 December 2013
Selecting genes and pathways indicative of disease is a central problem in computational biology. This problem is especially challenging when parsing multi-dimensional genomic data. A number of tools, such as L1-norm based regularization and its extensions elastic net and fused lasso, have been introduced to deal with this challenge. However, these approaches tend to ignore the vast amount of a priori biological network information curated in the literature.
We propose the use of graph Laplacian regularized logistic regression to integrate biological networks into disease classification and pathway association problems. Simulation studies demonstrate that the performance of the proposed algorithm is superior to elastic net and lasso analyses. Utility of this algorithm is also validated by its ability to reliably differentiate breast cancer subtypes using a large breast cancer dataset recently generated by the Cancer Genome Atlas (TCGA) consortium. Many of the protein-protein interaction modules identified by our approach are further supported by evidence published in the literature. Source code of the proposed algorithm is freely available at http://www.github.com/zhandong/Logit-Lapnet.
Logistic regression with graph Laplacian regularization is an effective algorithm for identifying key pathways and modules associated with disease subtypes. With the rapid expansion of our knowledge of biological regulatory networks, this approach will become more accurate and increasingly useful for mining transcriptomic, epi-genomic, and other types of genome wide association studies.
Technologies for high throughput genetic profiling have revolutionized the study of human development and disease. Expression profiles spanning the entire human genome not only allow investigators to better understand disease subtypes , but also define new categories associated with sensitivity to pharmacologic treatment [2, 3] and other clinical outcomes [4, 5]. A central problem in these genomic studies is to construct an accurate predictive model and delineate specific genes or pathways driving a phenotype. Logistic regression is widely used for classification [6–9]. The number of genes, however, in high-throughput studies are often much larger than the number of specimens in a given study. This limitation causes instability in the algorithms used to select driver genes and poor performance of predictive models. The lasso algorithm for logistic regression was introduced to address these problems and perform feature selection through L1-norm regularization . However, the number of variables selected by the lasso is bounded by the number of observations in an experiment. Furthermore, correlated variables are rarely selected as part of the predictive model at the same time. Various extensions of the Lasso algorithm, such as elastic net, pelora, grouped lasso and fused lasso introduce grouping or smoothness regularization terms to address these limitations [11–14]. Of these, the fused lasso and elastic net have been successfully applied to a largest number of gene expression and genome wide association studies [15–18]. Both fused Lasso and elastic net allow correlated genes or neighboring genes on a chromosome to be selected into a predictive model together. However, these algorithms tend to ignore functional interactions between individual gene products documented in the scientific literature. Integration of network information in the gene marker identification has shown to outperform methods without network information by Chuang et al . Specifically, their networked-based scoring and greedy search algorithm identified more robust gene markers with better prediction accuracy on metastasis status of breast cancer patients in two cohorts. In recent years, vast amounts of data detailing biologic networks has been organized into searchable databases. For example, BioGRID documents protein and genetic interactions from more than 39,991 publications . The KEGG pathway database similarly curates molecular interactions and relational networks representing systemic functions at the level of both the cell and organism . To incorporate biological network information into regression models, a network-constrained regularization algorithm has been previously proposed for use with linear regression . Use of this network-constrained algorithm has been shown to out-perform both lasso and elastic net analyses executed independently of biologic input.
Classification algorithms integrating network structure information have been proposed in other settings. For example, a network constrained support vector machine has been proposed to analyze functional magnetic resonance imaging data  and cancer microarray data [24–26]. Network based prior has also been demonstrated to improve variable selection accuracy under the Bayesian inference framework [27, 28].
Here, we propose a graphical Laplacian network regularized logistic regression method following the framework established by Li et al. . We hypothesize that the integration of biological networks, such as Protein-Protein interactions, will improve prediction accuracy and variable selection in logistic models. To validate use of the proposed algorithm, we studied its theoretical properties and compared its performance to L1-norm regularized logistic regression and elastic net logistic regression on both simulated and real biologic data. Lastly, we demonstrate the utility of the proposed algorithm by using it to differentiate breast cancer subtypes and delineate biologic network modules associated with the triple negative breast cancer (TNBC) subtypes.
Materials and methods
Graph laplacian regularized logistic regression model
The parameter β can be estimated by maximizing the log likelihood function of the logistic model. However, it is well known that this estimation procedure performs poorly for both prediction purposes and variable selection when p ≫ n. Although various sparse parameter estimation procedures have been introduced to address these problems, these approaches tend to generate disconnected biomarkers that are rarely interpretable.
d u , d v are the degrees of nodes u and v respectively. In (3), X i is the ith row of the matrix X. The first term in equation (3) is the negative log likelihood function of the logistic model. The second term is an L1-norm penalization on β, which encourages sparsity on the coefficients. The last term is a generalized L2-norm penalty using the graph Laplacian matrix, which encourages smoothness on coefficients of genes that are connected in the biological network. If we set α = 1, the Logit-Lapnet criteria is equal to the simple lasso logistic regression. When L = I which is identity matrix, the Logit-Lapnet criteria becomes the elastic net logistic regression.
To solve problem (4) we used CVX, a package for specifying and solving convex programs [29, 30]. The convexity property of Logit-Lapnet guarantees an optimal solution using any convex optimization solver.
Theoretical properties of logit-lapnet
To study the behavior of the proposed method, we analyzed the theoretical properties of Logit-Lapnet.
Lemma 1 If , then for any α < 1 and λ > 0.
The proof of this lemma is given in the additional file 1. This lemma states that if two predictors are identical, the coefficients of these two predictors will be the same. In the L1-norm penalized logistic regression, only one predictor is selected.
where is the sample correlation of x i and x j .
The upper bound in (5) provides a quantitative description for the grouping effect of Logit-Lapnet on the network structure. For two highly correlated genes that are connected in a biological network, the difference on the estimated coefficient is almost zero. As α goes to 1, the Lapnet becomes lasso logistic regression and the difference becomes unbounded.
The maximum value of λ is reciprocal to α. It is clear that when α = 1, the Logit-Lapnet criteria becomes Lasso and a small penalization can produce an empty model. When α = 0, the penalization becomes much larger to generate an empty model. In practice, Lemma 2 provides a search guidance on the regularization path. The proof of Lemma 2 is also given in the additional file 1.
Gene expression profiles and differential gene analysis
Gene expression data from breast cancer specimens profiled by TCGA consortium was used to test the proposed Logit-Lapnet method. Level III RNA-Seq data (Illumina HigSeq RNASeqV2 from UNC) from patients with invasive breast carcinoma was obtained from the TCGA data portal https://tcga-data.nci.nih.gov/tcga/ in September, 2012. These data profiled 20501 genes in 806 distinct breast cancer specimens. Normalized read counts were used for all analyses and were log-transformed prior to their use. Genes with normalized read counts less than 10 in more than 90% of patients were excluded from further analyses. The normalized read counts were log-transformed and standardized prior to applying the method.
Of the 806 breast cancers characterized by the TCGA, 261 patients had incomplete information on the three markers used to define TNBC and were excluded. Of the remaining cancers, 85 TNBC and 460 non-TNBC were further separated into training (n = 327; 51 TNBC, 276 non-TNBC) and test (n = 218; 34 TNBC, 184 non-TNBC) sets. A total of 4871 differentially expressed gene products (P ≤ 0.05, t-test; ≥ 1.5 fold-change between TNBC vs. non-TNBC on training) were merged with protein-protein interaction (PPI) networks. This provided us with 797 genes and PPI networks of 937 interactions available for analysis using our proposed algorithm.
Protein-protein interaction (PPI) network
To construct the graph Laplacian matrix for testing the Logit-Lapnet method on breast cancer data, networks of protein-protein interactions (PPI) identified by two-hybrid screening in Homo sapiens were obtained from The Biological General Repository for Interaction Datasets (BioGRID, version 3.2.98) http://www.thebiogrid.org. At the time it was accessed, BioGRID documented 21483 interactions between 7700 genes.
Results and discussion
The clinical variable was defined as y = [Prob (y = 1|X; β) > ϱ] where ϱ ~ U(0,1). Expression levels for the 200 TFs were then simulated using a standard normal distribution. Each TF and its target genes were jointly distributed as a bivariate normal with correlation of 0.7.
where the |·|0 counts the number of non-zero elements in a vector, is the logical not operator on a vector and.* is the element-wise product.
Results of simulations.
Bernoulli Error Loss
Biomarker identification using logit-lapnet in human breast cancers
Breast cancers are clinically categorized according to the expression of several gene products, including the estrogen receptor (ESR1), progesterone receptor (PGR) and human epidermal growth factor receptor 2 (ERBB2). These biomarkers are routinely used not only to define prognosis but also determine treatment. [31, 32]. Of the breast cancer subtypes defined by these biomarkers, triple negative breast cancers (TNBC) lacking expression of ESR1, PAR and ERBB2(her2) are the most clinically aggressive. TNBC demonstrate high rates of disease progression and recurrence . Outcomes for this breast cancer subtype are generally poor, largely due to the fact that treatment options for women with TNBC are limited. However, a subgroup of TNBC are highly sensitive to conventional chemotherapy .
We chose to next validate use of our method by testing its ability to identify the key biomarkers used to define breast cancer subtypes. To accomplish this goal, we used patterns of gene expression in a large set of breast cancer specimens recently profiled by the Cancer Genome Atlas consortium. The tuning parameter for each method was selected through 10-fold cross-validation using the training data. We observed the same classification accuracy in predicting TNBC tumors (95%) when lasso, elastic net and Logit-Lapnet were applied to the testing data. The similar efficacy of these 3 algorithms is not surprising since TNBC tumors are dramatically different from the non-TNBCs in terms of their gene expression profiles.
Use of logit-lapnet for inferring novel biologic relationships
In addition to identifying each of the 3 key biomarkers used to categorize breast cancer subtypes, we found that our method identified multiple subnetworks of gene expression in the breast cancer specimens profiled by TCGA. These subnetworks potentially reflect relationships with clinical or biologic significance. For example, one of subnetworks we identified includes multiple genes (AR, ESR1, MED1, MED24, RARA, PRAME, and HMGA1) involved in steroid hormone signaling. As demonstrated in Figure 2, Logit-Lapnet found that this cluster is closely connected to a known breast cancer gene, GATA3. This connection suggests a functional relationship. Evidence to support this relationship could not be found by directly searching the database of protein-protein interactions used to construct our algorithm. However, at least one report published since our initial analyses has now shown that GATA3 mediates genomic ESR1-binding upstream of FOXA1 . This confirms that the integration of genomic and PPI data by our method has the capacity to identify new and otherwise unanticipated relationships with biologic significance.
Enhanced network specificity provided by logit-lapnet
Another advantageous feature of Logit-Lapnet is the potential functional specificity of the subnetworks delineated by our algorithm. Subnetworks identified by Logit-Lapnet allow investigators to more readily focus on key genes for subsequent downstream functional analyses. For example, the subnetwork connecting AR, ESR1 and RARA identified by our algorithm in TNBC represents a cluster of genes involved in hormone receptor signaling (Figure 2). The subnetwork 2 identified by our method contains the genes PLK1, BCL2, BNIPL, and tumor suppressor p53 binding protein TP53BP2. PLK1, BNIPL and BCL2 are well-known oncogenes and part of TP53 pathway . The expression of BCL2 has been proposed as a prognostic marker for breast cancer patients. This subnetwork is also interesting, as it predicts a relationships between gene products that have been previously shown to impact the G1-S (TP53BP) and G2-M cell cycle checkpoints (PLK1). Dysregulation of both G1-S and G2-M are key hallmarks of human cancer as defined by Weinberg and others. Furthermore, TP53 dysfunction has been previously shown to lead to the overexpression of PLK1 and other gene products important for driving cells with genomic instability through the cell cycle. Thus, the ability of Logit-Lapnet to detect this relationship underscores its capacity to detect key events in breast and other human cancers. These results convey a clear message that interactions between genes in clusters might be biologically relevant.
In contrast, alternative algorithms such as elastic net identify fewer subnetworks containing larger number of genes. This means that subnetworks identified by Logit-Lapnet allow investigators to more readily focus on key targets for subsequent downstream functional validation. Although network modularization methods can be used to further dissemble networks defined by elastic net into smaller modules, these smaller modules may not reflect direct interactions between their individual components. This is because elastic net defines groups of co-expressed genes rather than networks of functionally interacting gene products. Alternatively, use of Logit-Lapnet can be considered to place gene correlations in the context of biologic function. Thus, its use can be reasonably anticipated to define relationships that are more likely to be biologically and clinically relevant.
In summary, the proposed algorithm identified set of genes associated with breast tumors. In addition, integration of PPI in the algorithm enables us to recover the genes not only in association to the breast cancer subtypes but also their interacting partners which are also breast cancer related. Most importantly, from the comparison with elastic net and lasso, our method selected a reasonable size of genes and is the only algorithm capable of identified all three marker genes in defining TNBC.
We have developed a graph Laplacian regularized logistic regression model for selecting the genes or network modules associated with clinical variables from gene expression profiles. Through simulation studies, we have demonstrated that our approach is much more sensitive for identifying clinically relevant genes and network modules. We have presented a case study on network module identification for triple negative breast cancer sub-type using mRNA-seq data. Our results indicate that Logit-Lapnet is a superior algorithm compared to Lasso and elastic net in disease gene and network module selection. Further work to biologically validate our predicted modules is needed to gain a more complete picture of the regulatory process in the TNBC sub-type. Beyond genomics, there are many potential applications of Logit-Lapnet to utilize the rich information provided by network structure, such as metabolomics and proeteomics studies. In conclusion, our work developing the network structure regularized logistic regression model has many implications and has provided a new tool for research in genomic studies.
This work is supported in part through the Collaborative Advances in Biomedical Computing Seed Funding Program at the Ken Kennedy Institute for Information Technology at Rice University supported by the John and Ann Doerr Fund for Computational Biomedicine and through the Center for Computational and Integrative Biomedical Research Seed Funding Program at Baylor College of Medicine. GA is also partially supported by NSF DMS-1209017. ZD is supported by the Houston Bioinformatics Endowment and NSF DMS-1263932.
The publication fee is supported by the Houston Endowment.
This article has been published as part of BMC Genomics Volume 14 Supplement 8, 2013: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S8.
- Golub TR, Slonim DK, Tamayo P, Huard C: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
- Natsoulis G, El Ghaoui L, Lanckriet GRG, Tolley AM, Leroy F, Dunlea S, Eynon BP, Pearson CI, Tugendreich S, Jarnagin K: Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. Genome research. 2005, 15 (5): 724-736. 10.1101/gr.2807605.PubMedPubMed CentralView ArticleGoogle Scholar
- Holleman A, Cheok MH, den Boer ML: Gene-Expression Patterns in Drug-Resistant Acute Lymphoblastic Leukemia Cells and Response to Treatment. New England Journal of Medicine. 2004, 351 (6): 533-542. 10.1056/NEJMoa033513. [PMID: 15295046]PubMedView ArticleGoogle Scholar
- Veer Lv, Dai H, Van De Vijver MJ, He YD: Gene expression profiling predicts clinical outcome of breast cancer. nature. 2002, 415 (6871): 530-536. 10.1038/415530a.View ArticleGoogle Scholar
- Wang Y, Klijn J, Zhang Y, Sieuwerts AM, Look MP: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365: 671-679.PubMedView ArticleGoogle Scholar
- Zhu J, Hastie T: Classification of gene microarrays by penalized logistic regression. Biostatistics. 2004, 5 (3): 427-443. 10.1093/biostatistics/kxg046.PubMedView ArticleGoogle Scholar
- Shen L, Tan EC: Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2: 166-175. 10.1109/TCBB.2005.22.PubMedView ArticleGoogle Scholar
- Liao JG, Chin KV: Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics. 2007, 23 (15): 1945-1951. 10.1093/bioinformatics/btm287.PubMedView ArticleGoogle Scholar
- Bootkrajang J, Kabán A: Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics. 2013, 29 (7): 870-877. 10.1093/bioinformatics/btt078.PubMedView ArticleGoogle Scholar
- Tibshirani R: Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, Series B. 1994, 58: 267-288.Google Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005, 67 (2): 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Dettling M, Bühlmann P: Finding predictive gene groups from microarray data. Special Issue on Multivariate Methods in Genomic Data Analysis. 2004, 90: 106-131.View ArticleGoogle Scholar
- Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006, 68: 49-67. 10.1111/j.1467-9868.2005.00532.x.View ArticleGoogle Scholar
- Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K: Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005, 67: 91-108. 10.1111/j.1467-9868.2005.00490.x.View ArticleGoogle Scholar
- Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009, 25 (6): 714-721. 10.1093/bioinformatics/btp041.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim S, Xing EP: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS genetics. 2009, 5 (8): e1000587-10.1371/journal.pgen.1000587.PubMedPubMed CentralView ArticleGoogle Scholar
- Tian Z, Zhang H, Kuang R: Sparse Group Selection on Fused Lasso Components for Identifying Group-Specific DNA Copy Number Variations. ICDM'12. 2012, 665-674.Google Scholar
- Liu J, Huang J, Ma S, Wang K: Incorporating group correlations in genome-wide association studies using smoothed group Lasso. Biostatistics. 2013, 14 (2): 205-219.PubMedPubMed CentralGoogle Scholar
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140-PubMedPubMed CentralView ArticleGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Research. 2006, 34 (suppl 1): D535-D539.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research. 2012, 40 (D1): D109-D114. 10.1093/nar/gkr988.PubMedPubMed CentralView ArticleGoogle Scholar
- Li C, Li H: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008, 24 (9): 1175-1182. 10.1093/bioinformatics/btn081.PubMedView ArticleGoogle Scholar
- Grosenick L, Klingenberg B, Katovich K, Knutson B, Taylor JE: Interpretable whole-brain prediction analysis with GraphNet. NeuroImage. 2013, 72: 304-321.PubMedView ArticleGoogle Scholar
- Chen L, Xuan J, Riggins R, Clarke R, Wang Y: Identifying cancer biomarkers by networkconstrained support vector machines. BMC systems biology. 2011, 5: 161-10.1186/1752-0509-5-161.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhu Y, Shen X, Pan W: Network-based support vector machine for classification of microarray samples. BMC Bioinformatics. 2009, 10 (Suppl 1): S21-10.1186/1471-2105-10-S1-S21.PubMedPubMed CentralView ArticleGoogle Scholar
- Pan W, Xie B, Shen X: Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2010, 66 (2): 474-484. 10.1111/j.1541-0420.2009.01296.x.PubMedPubMed CentralView ArticleGoogle Scholar
- Stingo FC, Chen YA, Tadesse MG, Vannucci M: Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. The Annals of Applied Statistics. 2011, 5 (3): 1978-2002. 10.1214/11-AOAS463.PubMedView ArticleGoogle Scholar
- Hill S, Neve R, Bayani N, Kuo WL, Ziyad S, Spellman P, Gray J, Mukherjee S: Integrating biological knowledge into variable selection: an empirical Bayes approach with an application in cancer biology. BMC bioinformatics. 2012, 13: 94-10.1186/1471-2105-13-94.PubMedPubMed CentralView ArticleGoogle Scholar
- Grant M, Boyd S: Graph implementations for nonsmooth convex programs. Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences. Edited by: Blondel V, Boyd S, Kimura H. 2008, Springer-Verlag Limited, 95-110. [http://www.stanford.edu/~boyd/papers/pdf/graph_dcp.pdf]Google Scholar
- CVX Research I: CVX: Matlab Software for Disciplined Convex Programming, version 2.0. 2012, [http://cvxr.com/cvx]Google Scholar
- Siegel R, Naishadham D, Jemal A: Cancer statistics, 2013. CA: A Cancer Journal for Clinicians. 2013, 63: 11-30. 10.3322/caac.21166.Google Scholar
- The TCGA: Comprehensive molecular portraits of human breast tumours. Nature. 2012, 490 (7418): 61-70. 10.1038/nature11412.View ArticleGoogle Scholar
- Stevens KN, Vachon CM, Couch FJ: Genetic susceptibility to triple-negative breast cancer. Cancer Res. 2013, 73 (7): 2025-2030. 10.1158/0008-5472.CAN-12-1699.PubMedPubMed CentralView ArticleGoogle Scholar
- Foulkes WD, Smith IE, Reis-Filho JS: Triple-negative breast cancer. N Engl J Med. 2010, 363 (20): 1938-1948. 10.1056/NEJMra1001389.PubMedView ArticleGoogle Scholar
- Theodorou V, Stark R, Menon S, Carroll JS: GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility. Genome Research. 2013, 23: 12-22. 10.1101/gr.139469.112.PubMedPubMed CentralView ArticleGoogle Scholar
- Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer. 2004, 4 (3): 177-183. 10.1038/nrc1299.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.