Reverse engineering module networks by PSO-RNN hybrid modeling
© Zhang et al. 2009
Published: 7 July 2009
Skip to main content
© Zhang et al. 2009
Published: 7 July 2009
Inferring a gene regulatory network (GRN) from high throughput biological data is often an under-determined problem and is a challenging task due to the following reasons: (1) thousands of genes are involved in one living cell; (2) complex dynamic and nonlinear relationships exist among genes; (3) a substantial amount of noise is involved in the data, and (4) the typical small sample size is very small compared to the number of genes. We hypothesize we can enhance our understanding of gene interactions in important biological processes (differentiation, cell cycle, and development, etc) and improve the inference accuracy of a GRN by (1) incorporating prior biological knowledge into the inference scheme, (2) integrating multiple biological data sources, and (3) decomposing the inference problem into smaller network modules.
This study presents a novel GRN inference method by integrating gene expression data and gene functional category information. The inference is based on module network model that consists of two parts: the module selection part and the network inference part. The former determines the optimal modules through fuzzy c-mean (FCM) clustering and by incorporating gene functional category information, while the latter uses a hybrid of particle swarm optimization and recurrent neural network (PSO-RNN) methods to infer the underlying network between modules. Our method is tested on real data from two studies: the development of rat central nervous system (CNS) and the yeast cell cycle process. The results are evaluated by comparing them to previously published results and gene ontology annotation information.
The reverse engineering of GRNs in time course gene expression data is a major obstacle in system biology due to the limited number of time points. Our experiments demonstrate that the proposed method can address this challenge by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes based on gene expression data and gene functional category information to identify biologically meaningful modules, thereby reducing the dimensionality of the data; (3) modeling GRNs with the PSO-RNN method between the modules to capture their nonlinear and dynamic relationships. The method is shown to lead to biologically meaningful modules and networks among the modules.
In recent years, high throughput biotechnologies have made large-scale gene expression surveys a reality. Gene expression data provide an opportunity to directly review the activities of thousands of genes simultaneously. However, computational methods that can handle the complexity (noisy, substantial amount of variables, high dimensionality, etc.) of these biological data are often unavailable . Powerful computational methods and data mining tools are needed for biologically meaningful inferences from gene expression data.
Cluster analysis has been used to separate genes into groups based on their expression profiles , in which similar expression profiles will be more likely in the same group. Although cluster analysis gives insight into the groups of genes that may share similar functions, the inference of the relationships among these groups is beyond what cluster analysis can do.
A variety of continuous or discrete, static or dynamic, quantitative or qualitative models have been proposed for inference of biological networks. These include biochemically driven methods , linear models [4,5], Boolean networks , fuzzy logic [7,8], Bayesian networks , and recurrent neural networks [10–12]. Biochemically inspired models are developed on the basis of the reaction kinetics between different components of a network. However, most of the biochemically relevant reactions under participation of proteins do not follow linear reaction kinetics, and the full network of regulatory reactions is very complex and hard to unravel in a single step. Linear models attempt to solve a weight matrix that represents a series of linear combinations of the expression level of each gene as a function of other genes, which is often underdetermined since gene expression data usually have far fewer dimensions than the number of genes. In a Boolean network, the interactions between genes are modeled as Boolean function. Boolean networks assume that genes are either "on" or "off" and attempt to solve the state transitions for the system. The validity of the assumptions that genes are only in one of these two states has been questioned by a number of researchers, particularly among those in the biological community. In , an approach is proposed based on fuzzy rules of a known activator/repressor model of gene interaction. This algorithm transforms expression values into qualitative descriptors that can be evaluated by using a set of heuristic rules and searches for regulatory triplets consisting of activator, repressor, and target gene. This approach, though logical, is a brute force technique for finding gene relationships. It involves significant computation time, which restricts its practical usefulness. In , we propose the use of clustering as an interface to a fuzzy logic-based method to improve the computational efficiency. In a Bayesian network model, each gene is considered as a random variable and the edges between a pair of genes represent the conditional dependencies entailed in the network structure. Bayesian statistics are applied to find certain network structure and the corresponding model parameters that maximize the posterior probability of the structure given the data. Unfortunately, this learning task is NP-hard, and it also has the underdetermined problem. The recurrent neural network (RNN) model has received considerable attention because it can capture the nonlinear and dynamic aspects of gene regulatory interactions. Several algorithms have been applied for RNN training in network inference tasks, such as fuzzy-logic  and genetic algorithm . In [10,13], we applied particle swarm optimization (PSO) method to train the RNN for network inference, yielding promising results.
As variant sources of biological data are becoming available now, it is very necessary and helpful to infer gene regulatory network (GRN) not only from one single data source, but from data fusion of multiple complementary data sources. A few previous studies combined time course gene expression data with other data sources, such as genomic location data  and sequence motif . Prior knowledge of GRN helps understand gene interactions in important biological processes such as differentiation, cell cycle, and development. Due to the specific properties of gene expression data, the task of inferring GRNs involves several challenges including: (1) living cells contain thousands of genes (high dimensionality); (2) each gene interacts with one or more other genes directly or indirectly with complex dynamic and nonlinear relationships, (3) current technologies generate data that involve a substantial amount of noise, and (4) due to the cost of large-scale gene expression profiling experiments, the sample size is extremely low compared with the number of genes. In this study, we address these challenges by: (1) preprocessing gene expression data (e.g. normalization and missing value imputation) to reduce the data noise; (2) clustering genes with gene expression data and gene functional category information to find the optimal modules with biological significance and reduce the problem dimensionality; (3) modeling GRNs with the particle swarm optimization – recurrent neural network (PSO-RNN) method between the modules to capture their nonlinear and dynamic relationships.
Our previous studies [10,13] demonstrate that we can benefit by incorporating known gene functional category information in terms of improving the inferential power of our framework. Moreover, instead of using fully connected RNN model, we propose a network pruning method to select the statistically significant weights for the final GRN structure using PSO. The hybrid PSO-RNN algorithm is applied to infer networks of interactions from two real-world gene expression data. The inferred GRNs are confirmed with previous studies.
In this section, we demonstrate the inference ability of our proposed method via two experimental studies: the rat central nervous system (CNS) and yeast cell cycle process. Both data were preprocessed in the original studies [16,17]. To proceed with the module network inference process, we first imputed the missing values in the data by using the Bayesian principal component analysis (BPCA) method . Following that, we standardized the data between zero and one.
This case study is based on the data published in , consisting of gene expression levels for 112 genes during the development of the CNS of rats. Each gene was measured at nine different points in time (of which the last, measured for the adult animal, was not used here). The first measurement was made 10 days before birth, and the intervals between measurements were 2 or 3 days in the period before birth and 7 days after birth. The gene functional category information can also be found in .
The yeast cell cycle data presented in  consist of six time series (cln3,clb2,alpha,cdc15,cdc28, andelu) expression measurements of the transcript (mRNA) levels ofS. cerevisiaegenes. 800 genes were identified as cell cycle regulated based on cluster analysis in . Here, we used thecdc15time course data of the 800 genes since it has the largest number of time points (24).
Mapping of expression clusters to functional gene classes.
Reverse engineering of GRNs from time course gene expression data is a major obstacle in system biology due to the limited number of time points. We demonstrate that our method can address this challenge by decomposing the reverse engineering problem into modules, where two steps are involved: the gene expression data is clustered into modules with biological significances to reduce the problem dimensionality, and the network is built based on the expression profiles of modules. We evaluate the performance of the algorithm using two real data sets: rat CNS data and yeast cell cycle data. The results indicate that biologically meaningful modules are selected and biologically plausible networks between modules are estimated. For example, in CNS data, the inferred network at module level is a combination of the networks verified in the other two studies [19,20]. Our future research will focus on network module inference with more detailed gene category/regulation information. Multiple data sources (e.g. ChIP-on-Chip data , motif information, and gene ontology annotation) can be used for this purpose. Also, the data fusion from complementary data sources will not only help solve the underdetermined problem in GRN inference, but also increase the prediction accuracy. Another direction to address the under-determined reverse engineering problem is to decompose the GRN into small subnetworks, called network motifs (NMs) .
The proposed method includes two parts: module selection and network inference. In the module selection part, we cluster the genes by FCM clustering. The optimal number of clusters is determined by the relative entropy estimate method, which incorporates the gene functional category information; each cluster is considered as a module representing certain co-regulated genes. After the modules are determined, the PSO-RNN inference algorithm is applied. In this algorithm, each module is considered as a neuron in the RNN structure, and any regulation between two modules is a weight in the RNN. To find the best fit network among the modules, a generalized PSO method, including basic PSO and neural network pruning technique, is used to determine RNN structure and its parameters.
Clustering has been a major method to partition the genes into groups of co-expressed genes . However, most of these clustering methods are purely data-driven with no prior biological knowledge. Here we present a new clustering method based on FCM clustering. Instead of using purely data-driven estimate methods, we propose a new estimate method to select the optimal number of clusters by incorporating gene functional category information.
FCM is a method of clustering which allows a data point to belong to two or more clusters. The detailed description of FCM method can be found in . Several methods have been used for estimating the optimal number of clusters, e.g. Xie-Beni statistic and gap statistic . Because all these methods are purely data-driven, it is not suitable to estimate the clustering of gene expression data.
where Λ is the sample space ofx. The goal is to identify the clusters with significant relative entropy.
In building an RNN to infer a network of interactions, the identification of the correct structure and determination of the free parameters (weights and biases) to mimic measured data is a challenging task given the limited available quantity of data and complex search space. In this paper, we apply PSO and neural network pruning methods to select the optimal architecture of an RNN and update its free parameters.
wherex i is the gene expression level of thei th gene (1 ≤i≤N),Nis the number of genes in the model),φ(·) is a activation function,w ij represents the effect ofj th gene on thei th gene (1 ≤i,j≤N),b i denotes the bias for thei th gene, andτis the decay rate parameter. The functionφ(·) introduces nonlinearity to the model.
wherex i (t) and are the true and predicted values (expression levels) for thei th neuron (entity) at timet. The goal is to determine the structure and weights that minimize this cost function.
There exist many algorithms for RNN training in the literature, e.g., back-propagation through time (BPTT)  and genetic algorithm (GA) . BPTT is an extension of the standard back-propagation algorithm, using gradient descent method to find the best solution. However, the use of the gradient descent requires the error function to be differentiable, and also makes the procedure easy to get stuck in local minima. GA, inspired by the natural evolution process, has been applied to optimize the GRN in some applications [12,28].
Here, we use PSO  for RNN structure training. It has been shown that PSO requires less computational cost and can achieve faster convergence than conventional back-propagation in training neural networks for approximating a nonlinear function . Compared with GA, PSO is easy to implement and there are few parameters to adjust. Particularly, PSO has memory for the previous best solutions to avoid the possible loss of learned knowledge. All these features make PSO suitable for GRN inference.
In PSO, each particle is represented as a vector and instantaneous trajectory vector , describing its direction of motion in the search space at iterationk. The indexirefers to thei th particle. The core of the PSO algorithm is the position update rule (7) which governs the movement of each of thenparticles through the search space.
The key strength of the PSO algorithm is the interaction among particles. The second term in (7), , is considered to be a "social influence" term. While this term tends to pull the particle towards the globally best solution, the first term, , allows each particle to think for itself. The net combination is an algorithm with excellent trade-off between total swarm convergence, and each particle's capability for global exploration. Moreover, the relative contribution of the two terms is weighted stochastically.
The algorithm consists of repeated application of the velocity and position update rules presented above. Termination can occur by specification of a minimum error criterion, maximum number of iterations, or alternately when the position change of each particle is sufficiently small as to assume that each particle has converged.
PSO Parameter setting
Maximum search space range, |W max |
Acceleration constants,c 1&c 2
Size of swarm
In this section, we illustrate how PSO optimizes the parameters of RNN and how the structure of RNN is pruned to mimic the response of an unknown network of interactions. Since PSO is a stochastic algorithm, a single solution may not reflect the underlying network. We therefore collect a number of solutions from the PSO-RNN algorithm and use them to determine a single output network that receives the majority vote. Specifically, we applied 100 runs for each network inference. If the absolute value of the average of one parameter in hundred runs is larger than its standard deviation, it is said significant and will be selected for the final network, otherwise it will be set to zero. The following reverse engineering procedure is utilized:
1. Run the reverse engineering algorithm without introducing any particular constraints (except the maximum-allowed values) in the network parameters. Perform hundred runs, and select the networks with mean squared error (MSE) less than certain threshold for further network parameter evaluation.
2. Determine the average and standard deviations of the network parameters using the results from Step 1.
3. Set non-significant parameters (if any) to zero. If there is no non-significant parameter, the procedure is stopped.
4. Return to the reverse engineering algorithm, with non-significant weights set to zero. If the results (measured by the fitness) are as good, or almost good, as for the previous sets of runs, form the network averages, and return to Step 3. If instead the results are worse than in the previous run, discontinue the procedure.
Central nervous system
Gene regulatory network
mean square error
particle swarm optimization
recurrent neural network
This article has been published as part ofBMC GenomicsVolume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online athttp://www.biomedcentral.com/1471-2164/10?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.