Volume 10 Supplement 1
Reverse engineering module networks by PSO-RNN hybrid modeling
© Zhang et al; licensee BioMed Central Ltd. 2009
Published: 7 July 2009
Inferring a gene regulatory network (GRN) from high throughput biological data is often an under-determined problem and is a challenging task due to the following reasons: (1) thousands of genes are involved in one living cell; (2) complex dynamic and nonlinear relationships exist among genes; (3) a substantial amount of noise is involved in the data, and (4) the typical small sample size is very small compared to the number of genes. We hypothesize we can enhance our understanding of gene interactions in important biological processes (differentiation, cell cycle, and development, etc) and improve the inference accuracy of a GRN by (1) incorporating prior biological knowledge into the inference scheme, (2) integrating multiple biological data sources, and (3) decomposing the inference problem into smaller network modules.
This study presents a novel GRN inference method by integrating gene expression data and gene functional category information. The inference is based on module network model that consists of two parts: the module selection part and the network inference part. The former determines the optimal modules through fuzzy c-mean (FCM) clustering and by incorporating gene functional category information, while the latter uses a hybrid of particle swarm optimization and recurrent neural network (PSO-RNN) methods to infer the underlying network between modules. Our method is tested on real data from two studies: the development of rat central nervous system (CNS) and the yeast cell cycle process. The results are evaluated by comparing them to previously published results and gene ontology annotation information.
The reverse engineering of GRNs in time course gene expression data is a major obstacle in system biology due to the limited number of time points. Our experiments demonstrate that the proposed method can address this challenge by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes based on gene expression data and gene functional category information to identify biologically meaningful modules, thereby reducing the dimensionality of the data; (3) modeling GRNs with the PSO-RNN method between the modules to capture their nonlinear and dynamic relationships. The method is shown to lead to biologically meaningful modules and networks among the modules.
In recent years, high throughput biotechnologies have made large-scale gene expression surveys a reality. Gene expression data provide an opportunity to directly review the activities of thousands of genes simultaneously. However, computational methods that can handle the complexity (noisy, substantial amount of variables, high dimensionality, etc.) of these biological data are often unavailable . Powerful computational methods and data mining tools are needed for biologically meaningful inferences from gene expression data.
Cluster analysis has been used to separate genes into groups based on their expression profiles , in which similar expression profiles will be more likely in the same group. Although cluster analysis gives insight into the groups of genes that may share similar functions, the inference of the relationships among these groups is beyond what cluster analysis can do.
A variety of continuous or discrete, static or dynamic, quantitative or qualitative models have been proposed for inference of biological networks. These include biochemically driven methods , linear models [4, 5], Boolean networks , fuzzy logic [7, 8], Bayesian networks , and recurrent neural networks [10–12]. Biochemically inspired models are developed on the basis of the reaction kinetics between different components of a network. However, most of the biochemically relevant reactions under participation of proteins do not follow linear reaction kinetics, and the full network of regulatory reactions is very complex and hard to unravel in a single step. Linear models attempt to solve a weight matrix that represents a series of linear combinations of the expression level of each gene as a function of other genes, which is often underdetermined since gene expression data usually have far fewer dimensions than the number of genes. In a Boolean network, the interactions between genes are modeled as Boolean function. Boolean networks assume that genes are either "on" or "off" and attempt to solve the state transitions for the system. The validity of the assumptions that genes are only in one of these two states has been questioned by a number of researchers, particularly among those in the biological community. In , an approach is proposed based on fuzzy rules of a known activator/repressor model of gene interaction. This algorithm transforms expression values into qualitative descriptors that can be evaluated by using a set of heuristic rules and searches for regulatory triplets consisting of activator, repressor, and target gene. This approach, though logical, is a brute force technique for finding gene relationships. It involves significant computation time, which restricts its practical usefulness. In , we propose the use of clustering as an interface to a fuzzy logic-based method to improve the computational efficiency. In a Bayesian network model, each gene is considered as a random variable and the edges between a pair of genes represent the conditional dependencies entailed in the network structure. Bayesian statistics are applied to find certain network structure and the corresponding model parameters that maximize the posterior probability of the structure given the data. Unfortunately, this learning task is NP-hard, and it also has the underdetermined problem. The recurrent neural network (RNN) model has received considerable attention because it can capture the nonlinear and dynamic aspects of gene regulatory interactions. Several algorithms have been applied for RNN training in network inference tasks, such as fuzzy-logic  and genetic algorithm . In [10, 13], we applied particle swarm optimization (PSO) method to train the RNN for network inference, yielding promising results.
As variant sources of biological data are becoming available now, it is very necessary and helpful to infer gene regulatory network (GRN) not only from one single data source, but from data fusion of multiple complementary data sources. A few previous studies combined time course gene expression data with other data sources, such as genomic location data  and sequence motif . Prior knowledge of GRN helps understand gene interactions in important biological processes such as differentiation, cell cycle, and development. Due to the specific properties of gene expression data, the task of inferring GRNs involves several challenges including: (1) living cells contain thousands of genes (high dimensionality); (2) each gene interacts with one or more other genes directly or indirectly with complex dynamic and nonlinear relationships, (3) current technologies generate data that involve a substantial amount of noise, and (4) due to the cost of large-scale gene expression profiling experiments, the sample size is extremely low compared with the number of genes. In this study, we address these challenges by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes with gene expression data and gene functional category information to find the optimal modules with biological significance and reduce the problem dimensionality; (3) modeling GRNs with the particle swarm optimization – recurrent neural network (PSO-RNN) method between the modules to capture their nonlinear and dynamic relationships.
Our previous studies [10, 13] demonstrate that we can benefit by incorporating known gene functional category information in terms of improving the inferential power of our framework. Moreover, instead of using fully connected RNN model, we propose a network pruning method to select the statistically significant weights for the final GRN structure using PSO. The hybrid PSO-RNN algorithm is applied to infer networks of interactions from two real-world gene expression data. The inferred GRNs are confirmed with previous studies.
Results and discussion
In this section, we demonstrate the inference ability of our proposed method via two experimental studies: the rat central nervous system (CNS) and yeast cell cycle process. Both data were preprocessed in the original studies [16, 17]. To proceed with the module network inference process, we first imputed the missing values in the data by using the Bayesian principal component analysis (BPCA) method . Following that, we standardized the data between zero and one.
Rat CNS data
This case study is based on the data published in , consisting of gene expression levels for 112 genes during the development of the CNS of rats. Each gene was measured at nine different points in time (of which the last, measured for the adult animal, was not used here). The first measurement was made 10 days before birth, and the intervals between measurements were 2 or 3 days in the period before birth and 7 days after birth. The gene functional category information can also be found in .
Yeast cell cycle data
The yeast cell cycle data presented in  consist of six time series (cln3, clb2, alpha, cdc15, cdc28, and elu) expression measurements of the transcript (mRNA) levels of S. cerevisiae genes. 800 genes were identified as cell cycle regulated based on cluster analysis in . Here, we used the cdc15 time course data of the 800 genes since it has the largest number of time points (24).
Mapping of expression clusters to functional gene classes.
Reverse engineering of GRNs from time course gene expression data is a major obstacle in system biology due to the limited number of time points. We demonstrate that our method can address this challenge by decomposing the reverse engineering problem into modules, where two steps are involved: the gene expression data is clustered into modules with biological significances to reduce the problem dimensionality, and the network is built based on the expression profiles of modules. We evaluate the performance of the algorithm using two real data sets: rat CNS data and yeast cell cycle data. The results indicate that biologically meaningful modules are selected and biologically plausible networks between modules are estimated. For example, in CNS data, the inferred network at module level is a combination of the networks verified in the other two studies [19, 20]. Our future research will focus on network module inference with more detailed gene category/regulation information. Multiple data sources (e.g. ChIP-on-Chip data , motif information, and gene ontology annotation) can be used for this purpose. Also, the data fusion from complementary data sources will not only help solve the underdetermined problem in GRN inference, but also increase the prediction accuracy. Another direction to address the under-determined reverse engineering problem is to decompose the GRN into small subnetworks, called network motifs (NMs) .
The proposed method includes two parts: module selection and network inference. In the module selection part, we cluster the genes by FCM clustering. The optimal number of clusters is determined by the relative entropy estimate method, which incorporates the gene functional category information; each cluster is considered as a module representing certain co-regulated genes. After the modules are determined, the PSO-RNN inference algorithm is applied. In this algorithm, each module is considered as a neuron in the RNN structure, and any regulation between two modules is a weight in the RNN. To find the best fit network among the modules, a generalized PSO method, including basic PSO and neural network pruning technique, is used to determine RNN structure and its parameters.
Clustering has been a major method to partition the genes into groups of co-expressed genes . However, most of these clustering methods are purely data-driven with no prior biological knowledge. Here we present a new clustering method based on FCM clustering. Instead of using purely data-driven estimate methods, we propose a new estimate method to select the optimal number of clusters by incorporating gene functional category information.
FCM is a method of clustering which allows a data point to belong to two or more clusters. The detailed description of FCM method can be found in . Several methods have been used for estimating the optimal number of clusters, e.g. Xie-Beni statistic and gap statistic . Because all these methods are purely data-driven, it is not suitable to estimate the clustering of gene expression data.
Estimating the number of modules
where Λ is the sample space of x. The goal is to identify the clusters with significant relative entropy.
In building an RNN to infer a network of interactions, the identification of the correct structure and determination of the free parameters (weights and biases) to mimic measured data is a challenging task given the limited available quantity of data and complex search space. In this paper, we apply PSO and neural network pruning methods to select the optimal architecture of an RNN and update its free parameters.
where x i is the gene expression level of the i th gene (1 ≤ i ≤ N), N is the number of genes in the model), φ(·) is a activation function, w ij represents the effect of j th gene on the i th gene (1 ≤ i, j ≤ N), b i denotes the bias for the i th gene, and τ is the decay rate parameter. The function φ (·) introduces nonlinearity to the model.
where x i (t) and are the true and predicted values (expression levels) for the i th neuron (entity) at time t. The goal is to determine the structure and weights that minimize this cost function.
There exist many algorithms for RNN training in the literature, e.g., back-propagation through time (BPTT)  and genetic algorithm (GA) . BPTT is an extension of the standard back-propagation algorithm, using gradient descent method to find the best solution. However, the use of the gradient descent requires the error function to be differentiable, and also makes the procedure easy to get stuck in local minima. GA, inspired by the natural evolution process, has been applied to optimize the GRN in some applications [12, 28].
Here, we use PSO  for RNN structure training. It has been shown that PSO requires less computational cost and can achieve faster convergence than conventional back-propagation in training neural networks for approximating a nonlinear function . Compared with GA, PSO is easy to implement and there are few parameters to adjust. Particularly, PSO has memory for the previous best solutions to avoid the possible loss of learned knowledge. All these features make PSO suitable for GRN inference.
In PSO, each particle is represented as a vector and instantaneous trajectory vector , describing its direction of motion in the search space at iteration k. The index i refers to the i th particle. The core of the PSO algorithm is the position update rule (7) which governs the movement of each of the n particles through the search space.
The key strength of the PSO algorithm is the interaction among particles. The second term in (7), , is considered to be a "social influence" term. While this term tends to pull the particle towards the globally best solution, the first term, , allows each particle to think for itself. The net combination is an algorithm with excellent trade-off between total swarm convergence, and each particle's capability for global exploration. Moreover, the relative contribution of the two terms is weighted stochastically.
The algorithm consists of repeated application of the velocity and position update rules presented above. Termination can occur by specification of a minimum error criterion, maximum number of iterations, or alternately when the position change of each particle is sufficiently small as to assume that each particle has converged.
PSO Parameter setting
Maximum search space range, |W max |
Acceleration constants, c1 &c2
Size of swarm
PSO-RNN hybrid algorithm
In this section, we illustrate how PSO optimizes the parameters of RNN and how the structure of RNN is pruned to mimic the response of an unknown network of interactions. Since PSO is a stochastic algorithm, a single solution may not reflect the underlying network. We therefore collect a number of solutions from the PSO-RNN algorithm and use them to determine a single output network that receives the majority vote. Specifically, we applied 100 runs for each network inference. If the absolute value of the average of one parameter in hundred runs is larger than its standard deviation, it is said significant and will be selected for the final network, otherwise it will be set to zero. The following reverse engineering procedure is utilized:
1. Run the reverse engineering algorithm without introducing any particular constraints (except the maximum-allowed values) in the network parameters. Perform hundred runs, and select the networks with mean squared error (MSE) less than certain threshold for further network parameter evaluation.
2. Determine the average and standard deviations of the network parameters using the results from Step 1.
3. Set non-significant parameters (if any) to zero. If there is no non-significant parameter, the procedure is stopped.
4. Return to the reverse engineering algorithm, with non-significant weights set to zero. If the results (measured by the fitness) are as good, or almost good, as for the previous sets of runs, form the network averages, and return to Step 3. If instead the results are worse than in the previous run, discontinue the procedure.
List of abbreviations used
Central nervous system
Gene regulatory network
mean square error
particle swarm optimization
recurrent neural network.
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
- Nasmyth K, Dirick L: The role of SW14 and SW16 in the activity of Gl cyclins in yeast. Cell. 1991, 66: 995-1013.View ArticlePubMed
- D'Haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000, 16 (8): 707-726.View ArticlePubMed
- Naraghi M, Neher E: Linearized buffered Ca2+ diffusion in microdomains and its implications for calculation of [Ca2+] at the mouth of a calcium channel. J Neurosci. 1997, 17 (18): 6961-6973.PubMed
- Chen T, He HL, Church GM: Modeling gene expression with differential equations. Pac Symp Biocomput. 1999, 29-40.
- D'Haeseleer P, Wen X, Fuhrman S, Somogyi R: Linear modeling of mRNA expression levels during CNS development and injury. Pac Symp Biocomput. 1999, 41-52.
- Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002, 18 (2): 261-274.View ArticlePubMed
- Woolf PJ, Wang Y: A fuzzy logic approach to analyzing gene expression data. Physiol Genomics. 2000, 3 (1): 9-15.PubMed
- Ressom H, Reynolds R, Varghese RS: Increasing the efficiency of fuzzy logic-based gene expression data analysis. Physiol Genomics. 2003, 13 (2): 107-117.View ArticlePubMed
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601-620.View ArticlePubMed
- Ressom HW, Zhang Y, Xuan J, Wang J, Clarke R: Inferring network interactions using recurrent neural networks and particle swarm optimization. Proceedings of the First International Conference on Computational Systems Biology. 2006
- Maraziotis I, Dragomir A, Bezerianos A: Gene networks inference from expression data using a recurrent neuro-fuzzy approach. Conf Proc IEEE Eng Med Biol Soc: 2005. 2005, 4834-4837.
- Chiang JH, Chao SY: Modeling human cancer-related regulatory modules by GA-RNN hybrid algorithms. BMC Bioinformatics. 2007, 8: 91-PubMed CentralView ArticlePubMed
- Zhang Y, Xuan J, de los Reyes BG, Clarke R, Ressom HW: Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological data. BMC Bioinformatics. 2008, 9: 203-PubMed CentralView ArticlePubMed
- Yeang C, Jaakkola T: Time Series Analysis of Gene Expression and Location Data. Proc Of the 3-rd IEEE Symposium on Bioinformatics and BioEngineering (BIBE'03): 2003. 2003
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22 (3): 281-285.View ArticlePubMed
- Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R: Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA. 1998, 95 (1): 334-339.PubMed CentralView ArticlePubMed
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9 (12): 3273-3297.PubMed CentralView ArticlePubMed
- Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003, 19 (16): 2088-2096.View ArticlePubMed
- Tang Y, Kitisin K, Jogunoori W, Li C, Deng CX, Mueller SC, Ressom HW, Rashid A, He AR, Mendelson JS, et al: Progenitor/stem cells give rise to liver cancer due to aberrant TGF-beta and IL-6 signaling. Proc Natl Acad Sci USA. 2008, 105 (7): 2445-2450.PubMed CentralView ArticlePubMed
- Wahde M, Hertz J: Modeling genetic regulatory dynamics in neural development. J Comput Biol. 2001, 8 (4): 429-442.View ArticlePubMed
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298 (5594): 799-804.View ArticlePubMed
- Alon U: Network motifs: theory and experimental approaches. Nat Rev Genet. 2007, 8 (6): 450-461.View ArticlePubMed
- Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. J Comput Biol. 1999, 6 (3–4): 281-297.View ArticlePubMed
- Xie XL, Beni G: A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991, 13 (8): 841-847.View Article
- Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a datset via the Gap statistic. Royal Statistical Society: Series B (Statistical Methodology). 2001, 63 (2): 411-423.View Article
- Xu R, Wunsch DC: Gene regulatory networks inference with recurrent neural network models. IEEE International Joint Conference on Neural Networks: 31 July–4 Aug. 2005. 2005, 286-291.
- Werbos PJ: Backpropagation Through Time: What It Does And How to Do It. Proceedings of IEEE. 1990, 78 (10): 1550-1560.View Article
- Keedwell E, Narayanan A: Discovering gene regulatory networks with a neural-genetic hybrid. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2005, 2 (3): 231-243.View ArticlePubMed
- Kennedy J, Eberhart RC: Particle swarm optimization. Proceedings of the 1995 IEEE International Conference on Neural Networks (Perth, Australia). 1995, IV: 1942-1948.
- Gudise V, Venayagamoorthy G: Comparison of Particle Swarm Optimization and Backpropagation as Training Algorithms for Neural Networks. Proceedings of the 2003 IEEE Svarm Intelligence Symposium. 2003, 110-117.
- Birge B: PSOt – a particle swarm optimization toolbox for use with Matlab. Swarm Intelligence Symposium, 2003 SIS '03 Proceedings of the 2003 IEEE: 2003. 2003, 182-186.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.