 Proceedings
 Open access
 Published:
Anomaly detection in gene expression via stochastic models of gene regulatory networks
BMC Genomics volumeÂ 10, ArticleÂ number:Â S26 (2009)
Abstract
Background
The steadystate behaviour of gene regulatory networks (GRNs) can provide crucial evidence for detecting diseasecausing genes. However, monitoring the dynamics of GRNs is particularly difficult because biological data only reflects a snapshot of the dynamical behaviour of the living organism. Also most GRN data and methods are used to provide limited structural inferences.
Results
In this study, the theory of stochastic GRNs, derived from GNetworks, is applied to GRNs in order to monitor their steadystate behaviours. This approach is applied to a simulation dataset which is generated by using the stochastic gene expression model, and observe that the GNetwork properly detects the abnormally expressed genes in the simulation study. In the analysis of real data concerning the cell cycle microarray of budding yeast, our approach finds that the steadystate probability of CLB2 is lower than that of other agents, while most of the genes have similar steadystate probabilities. These results lead to the conclusion that the key regulatory genes of the cell cycle can be expressed in the absence of CLB type cyclines, which was also the conclusion of the original microarray experiment study.
Conclusion
Gnetworks provide an efficient way to monitor steadystate of GRNs. Our method produces more reliable results then the conventional ttest in detecting differentially expressed genes. Also Gnetworks are successfully applied to the yeast GRNs. This study will be the base of further GRN dynamics studies cooperated with conventional GRN inference algorithms.
Background
Identifying the key features and dynamics of gene regulatory networks (GRNs) is an important step towards understanding behaviours of biological systems. Thanks to the development of highthroughput technology, the amount of microarray gene expression data has greatly increased, and numerous mathematical models attempt to explain gene regulations using gene networks [1, 2]. Once a network structure is inferred, its dynamics needs to be considered. However, most methods focus on the inference of network structure which only provides a snapshot of a given dataset. Probabilistic Boolean Networks (PBNs) represent the dynamics of GRNs [3], but PBNs are limited by the computational complexity of the related algorithms [4].
In [5], a new approach to the steadystate analysis of GRNs based on GNetwork theory [6, 7] is proposed, while GNetworks were firstly applied to GRNs with simplifying assumptions concerning gene expression in [8]. However, the GNetwork approach also exhibits specific difficulties because of the large number of parameters that are needed to compute their steadystate solution. Thus, in this study we reduce the number of model parameters on the basis of biological assumptions and focus on estimating two parameters in particular: the total input rate and steadystate probability of a gene.
A GNetwork is a probabilistic queuing network having special customers which include positive and negative "customers", signals and triggers [6, 7]. It was originally developed also as a model of stochastic neuronal networks [9] with "negative and positive signals or spikes" which represent inhibition and excitation. In terms of GRNs, a queue is a "place" in which mRNAs are stored, and an mRNA can be considered to be a "customer" of the GNetwork. The positive and negative signals are interpreted as the protein activities such as transcription factors, inducers and repressors. Note that the customers or signals of the GNetwork can be any biological molecules. However, in our study, we focus on behaviours of mRNAs because the available GRN data are usually mRNA expressions. Each queue has an input and service rates which represent a transcription and degradation processes, respectively. Our interest is to estimate the steadystate probability that a queue is busy, which corresponds to the probability that an mRNA is present, and we are also interested in the total mRNA input rate of each queue. To evaluation the accuracy of the proposed method, we generated a simple simulation dataset by using the stochastic gene expression models processed with the widely accepted Gillespie algorithm [10, 11]. We also examine a real biological dataset obtained from the cell cycle of the budding yeast [12].
Although queueing theory is a common computational tool, GNetworks are an essential departure from queueing theory; in particular conventional queues could not be possibly applied to GRNs because the notion of inhibition does not exist in queueing theory but was introduced by GNetwork theory. There are two other essential novelties in our work. First, our approach enables us to obtain the steadystate of GRNs with only polynomial computational complexity due to the product form solution of GNetworks; the computational cost due to large memory space and nonpolynomial computational complexity are basic limitations in conventional methods such as PBN. Also our method can provide more reliable measures to detect differentially expressed genes in microarray analysis (as shown in our simulation study).
Gnetworks and gene regulatory networks
The GRN model used in this study is the probabilistic gene regulatory model introduced in [5]. In this model, let K_{ i }(t) be integervalued random variables which represent a quantity (mRNA) of the gene i at time t. If the K_{ i }(t) is zero, the gene i cannot interact with other genes. Then we have the following Probabilities,
where Î›_{ i }is the total input rate (sum of transcription rate, Î»_{ i }and increment rate of mRNAs come from outside of system, I_{ i }), Î¼_{ i }is the service rate (e.g. Degradation rate of mRNAs). o(Î”t) â†’ 0 as t â†’ 0. Let r_{ i }is representing the activity (signal process) rate of each gene i. Then 1/r_{ i }is the average time between successive interactions of gene i with other genes. If the i th gene interacts with other genes, the following events occur:

With probability P^{+} (i, j), gene i activates gene j; when this happens, K_{ i }(t) is depleted by 1 and K_{ j }(t) is increased by 1

With probability P^{} (i, j), gene i inhibits gene j; when this happens, both K_{ i }(t) and K_{ j }(t) are depleted by 1

With probability Q(i, j, l) gene i joins with gene j to act upon gene l in excitatory mode, as a result of which both K_{ i }(t) and K_{ j }(t) are reduced by 1, while K_{ l }(t) is increased by 1

With probability d_{ i }, which is defined as follow,
the signal of gene i exits the system so K_{ i }(t) is depleted by 1
Let's define a random process K(t) = [K_{1}(t), ..., K_{ n }(t)], t â‰¥ 0 and an nvector of nonnegative integers k = [k_{1}, ..., k_{ n }]. The P (k, t) is the probability that K(t) takes k at time t, P (k, t) = P (K(t) = k). Then the probability that K(t) have k at time t + Î”t is defined by
where is a vector that the value of i th element is k_{ i }+ 1 (k_{ i } 1) and I(x) is indicator function which is 1 if the condition, x, is satisfied or 0 other wise. The first and second terms describe the increment and decrement of the length of queue i, respectively. Third term is the probability that the gene i is activated but nothing is happened except queue i lose one mRNA. From fourth to sixth terms are the probabilities that gene i is activated and interacts with gene j. The rest terms of (1) represent the probabilities that the interaction of gene i and gene j affect the gene l (length of l th queue). Divide (1) by Î”t and introduce the equilibrium probability distribution of the system P(k) = lim_{t â†’ âˆž}P (k, t) then we obtain following dynamic behaviour,
Now, let's consider following equations, and
Where q_{ i }(= /(r_{ i }+ )) represents the probability that gene i is expressed in steadystate. Using (2) and (3), E. Gelenbe showed the following product form is satisfied [5, 7].
where for any subset I âŠ‚ 1, ..., n such that q_{ m }< 1 for each m âˆˆ I, and I{m_{1}, ..., m_{I}}.
Results and discussion
Simple gene regulatory networks using stochastic gene expression model
In order to assess our GNetwork model, we construct a simple GRN structure and generate the expression data using a synthetic stochastic gene expression model [13, 14]. This stochastic gene expression model has several important features such as protein dimerization [15] and time delay for protein signalling [13]. Figure 1 shows the simulated network structure which is based on the following basic principles: the number of proteins per cell chases the number of mRNAs which in turn chases the number of active genes [14]. Figure 2 depicts the assumptions of our model and (5)~(11) give the corresponding processes (RPo: RNA open complex, Pro: promoter, R: mRNA, P: protein monomer, PP: protein dimmer, 0: degradation, t: time, and Î”t: time increment):
where i, j âˆˆ {A, B, C, D} in Figure 1. In addition, we assume that proteins such as transcription factors and repressors require accumulation times for their activation [11, 13], and use the modified Gillespie algorithm to generate the expression data [10, 11]. The cell growth rate and cell volume are fixed, and we consider five cells. Detailed parameters are summarized in Table 1 with their references.
The transcription process in (5) follows an exponential distribution with transcription initiation rate Î»_{2} [16]. The translation processes are given in (6) and include direct competition between the ribosome binding site and the RNAseE binding site which degrade the mRNAs. Thus the translation process follows a geometric distribution with probability p and busting size b = p(1  p) [13, 16]. T_{ D }is the average time interval between successive competitions, and the number of surviving mRNAs n_{2} in the population after transcription is blocked with n_{2} = n_{2,0} . This is equal to T_{ half }= (log(2)/log(p))T_{ D }[13]. Thus the translation initiation rate, Î»_{3} = 1/T_{ D }, can be computed. The protein dimer association and disassociation rates are k_{a 2}and k_{d 2}, respectively, as shown in (7) and (8) [17]. We also consider the DNAprotein association and disassociation rates (k_{a 1}and k_{d 2}in (9) and (10), respectively) [18]. The degradation rate of mRNA and of proteins are obtained by using the halflife of each molecule (11) [16, 17].
We generate three sets of expression data (Dataset 1, 2, and 3); each dataset has two groups, the normal and the case group. These groups are obtained with the same parameter values except for the transcription initiation rate of G_{ A }in case group is 0.0012 sec^{1} which is half of the transcription rate in normal group, 0.0025 sec^{1}. Both groups are simulated during 3000 seconds. In order to compare these two groups, we perform not only the GNetwork analysis but also the ttest which is widely used to find differentially expressed genes in microarray analysis. Datasets 1 and 2 consist of 50 samples each which are drawn from all the data points. In Dataset 1, the expression of G_{ A }is significantly different (pvalue of ttest < 0.01 in Table 2) while the difference of the G_{ A }expression in Dataset 2 is not significant. The third dataset consists of 500 samples which are randomly chosen from the original observations.
Table 2 summarizes the results of the three datasets. In the case groups of Datasets 1 and 2, both the q_{ A }and Î›_{ A }have the lowest values among the four nodes while the ttest of the G_{ A }expression in Dataset 2 shows that it is not significant (pvalue = 0.202). In the small sample results (Datasets 1 and 2), our method provides consistent results with large sample analysis (Dataset 3). The ratios (case/normal) also show that the q_{ A }and Î›_{ A }, in the case group, are smaller than one while the other ratios stay around one. In Dataset 3, the pvalue of G_{ B }is significant along with that of G_{ A }because the expression of G_{ A }directly affects the expression of G_{ B }. However, G_{ B }is not the causal gene in this study. Our GNetwork analysis reveals that only G_{ A }has lower q and Î› values than other nodes including G_{ B }. All these results concur with the simulation data generated with one half of the normal transcription rate.
Modeling cell cycle gene regulatory networks in budding yeast
The cell cycle regulated transcription and its overall controls have been studied in detail for budding yeast [19]. Recent developments in highthroughput microarray techniques help to reveal many of yeast genes controlling the cell cycle [20] which consists of four distinct phases: Gap1 (G1), Synthesis (S), Gap2 (G2), and Mitosis (M). The cells grow during their G1 and G2 phases and their DNA is replicated during the S phase. In the M phase, cell growth stops and the cell divides into two daughter cells that include nuclear division. Many genes are involved with specific cell cycle phases, but the number of key regulators that are responsible for the control of the cell cycle process is much smaller. Thus, based on published information, we build a cell cycle GRN with the key regulators in budding yeast as shown in Figure 3, although the relationships that contribute to the true regulatory network structure of the cell cycle still remain uncertain. Therefore we simplify the cell cycle network structure by selecting thirteen key regulatory genes (the gray circles in Figure 3) and connect the genes without regard to the transcriptional and posttranscriptional processes. Figure 4 shows the reconstructed regulatory network structure.
The activity of cyclindependent kinases (CDKs) plays an important role in controlling periodic events during cell cycle. Some studies of cell cycle with highthroughput technologies have suggested alternative regulation models of periodic transcription [20]. D. Olando et., al. [12] measured the transcription levels of cell cycle related genes with the use of Yeast 2.0 oligonucleotide array and determined the manner in which transcription factor networks contribute to CDKs and to global regulation of the cellcycle transcription process. This microarray dataset is used in our study with the cell cycle network structure of Figure 4; it consists of two groups: one group is obtained from wildtype (WT) cells and the other is from cyclinmutant (CM) cells which are disrupted for all Sphase and mitotic cyclins (mutate clb1, 2, 3, 4, 5, and 6).
The microarray data consist of a total of 30 data points taken over 270 minutes. We subdivide it into five states (groups), each consisting of 6 data points. The expression levels are transformed by taking the natural logarithm. Figure 5 depicts the transformed expression profiles of the 13 genes with 5 states. The black and gray solid lines are the expression profiles from WT and CM cells, respectively, and S1, S2, ..., S5 represent the five states. It is obvious that the profiles of CLB2 are different between WT and CM cells because the CM dataset is designed to monitor the cell cycle processes without the clb cyclines.
Table 3 summarizes the steadystate probabilities of 13 genes in the cell cycle GRN. All genes have similar steadystate probabilities in the WT and CM cell groups except for CLB2 in the CM group, which has a lower steadystate probability than the elements of the WT group: as shown in Table 3, the ratio of CM/WT is smaller than one (bold letter). This smaller probability can be explained by considering the experimental design of the CM dataset which is obtained without clb cyclines. Also, the original study of this dataset suggested alternative cell cycle regulatory pathways in [12] which had revealed that over 70% of the cell cycle related genes were expressed periodically without the clb cyclines. In our results, the steadystate probabilities of the CM group are consistent with that of the WT group. These results draw the same conclusion as the original study, i.e. that the steadystate of the 12 genes does not entirely depend on the expression of CLB2. Table 4 shows the estimated total input rate of the 13 genes. These results also show that only the input rates of CLB2 decrease in the CM group.
Conclusion
This paper has used the GNetwork approach [5â€“8] to model GRNs. Two model parameters, the steadystate probability, q_{ i }, and the total input rate, Î›_{ I }, are estimated by determining the boundary of Î›_{ i }and using a grid search. We first use simulated gene expression data generated on the basis of a stochastic gene expression model. Two groups (normal and case) of expression data are examined. These two groups are exactly the same except for one parameter, the transcription initiation rate. We have observed that the GNetwork based method is able to detect the abnormally expressed genes, while the ttest produces false positives. Then, using real data, we have observed that the steadystate probability of CLB2 is lower than that of other agents and concluded that the key genes of cell cycle regulation can be expressed without the clb cyclines; this result is consistent with the original experimental study.
However, the unchanged steadystate probabilities in all the five states may need to be considered, because the cell cycle has four phases (G1, S, G2, M) and expressions of genes involved with a specific phase are expected to be different from those in other phases. Also the small decrease rate and relatively large total input rates of CLB2 may require a more careful analysis of the GNetwork approach in relation to cell cycle GRN structure. The manner in which we have used GNetwork models in this paper did not currently include simultaneous interactions with three or more nodes. However this is not really a limiting effect of the model, since it suffices to include chain representations of dependencies in the GNetwork model as has been done for neuronal networks [9] to cover excitatory and inhibitory effects that involve three or more nodes, and in fact random chains of nodes of any length. Although in this study the probabilities that gene i affect gene j, P^{+} (i, j) and P^{} (i, j) in (3), are fixed at the value one, we think that the conventional reverse engineering GRN methods using the "Ensemble" method [21] can provide these probabilities more accurately for an improved steadystate analysis of GRNs.
In conclusion, our study has illustrated the use of GNetworks as a new approach for the steadystate analysis of GRNs, and has shown their usefulness in obtaining quantities such as the effective transcription rate and the steadystate probabilities, using them to detect differentially expressed genes, thus introducing a new approach which differs from more conventional microarray analysis methods. Future research will investigate the ensemble approaches to GRNs [21] based on the GNetwork methodology in [5], which will allow to infer GRN structures, and also to monitor their steadystate behaviour.
Methods
Once a GRN structure is determined, it is necessary to estimate the total input rate (Î›_{ i }) of i th queue and its steadystate probability, (q_{ i }). For the simplicity, the probabilities, P^{+} (i, j), P^{} (i, j), and Q(i, j, l) in (3) are set to be one. Then, it can be rewritten as follows
In (12), the Î›_{ i }and R_{ i }is the total input (Î›_{ i }= Î»_{ i }+ I_{ i }) and total output rates (R_{ i }= r_{ i }+ Î¼_{ i }), respectively. is a function of activation probabilities of genes which affect to gene i positively and is a function of activation probabilities of genes which affect to gene i negatively. We fix the r_{ i }as the number of out degrees of gene i and the degradation rate of mRNA, i, as a constant (Table 1) because the total output rate, R_{ i }is not our interest. Therefore, we need to estimate two parameters, the total input rate, Î›_{ i }, and the steadystate probability, q_{ i }.
Let is the lower bound of the Î›_{ i }, which is larger than zero. The lower bound of total input is regarded as an initial transcription rate without any external input. In this study, we use = 0.0025 [16]. The upper bound of Î›_{ i } is obtained by assuming inputs from other nodes are zero and the queues fully work. That is
where the probabilities and are one.
Let is the initial value of q_{ i }. Then can be obtained as follow,
where x_{ ij }is the observed expression level (number of mRNAs) of i th gene at the j th observation and max(x_{ ij }) is the maximum value among all observed values of i th gene. Let Î›_{ iu }is a value of total input rate between the lower bound and the upper bound of Î›_{ i }(). Then the steadystate probability q_{ i }can be obtained numerically by solving (12) with the and the Î›_{ iu }. Once the steadystate probability is determined, the loglikelihood of the given model can be computed by using (4) which is the same form of the likelihood of geometric distribution. It is known that the loglikelihood of geometric distribution is convex so we choose appropriate value Î›_{ i }which maximizes the loglikelihood function.
For each value of total input, Î›_{ iu }(), we compute the steadystate probability, q_{ iu }, with initial value, , and obtain the loglikelihood score, log L_{ iu }, which is used to choose the optimal I total input rate, ,
Note that the q_{ iu }is a numerical solution of (12) with initial value, , and total input rate, Î›_{ iu }. In order to compute , the initial value, in (13) is substituted by which is a numerical solution of (12) with initial value, , and total input rate, . Then the steadystate probability of gene i, q_{ i }, can be obtained by updating its value iteratively until the d^{(t)}<Î´ where d^{(t)}is the difference between and at t th iteration. In this study, Î´ is 0.0001.
Note
Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/14712105/10?issue=S15.
References
Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. Journal of computational biology. 2000, 7 (34): 601620. 10.1089/106652700750050961.
Schafer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks. Bioinformatics. 2005, 21 (6): 754764. 10.1093/bioinformatics/bti062.
Shmulevich I, Gluhovsky I, Hashimoto R, Dougherty E, Zhang W: Steadystate analysis of genetic regulatory networks modelled by probabilistic Boolean networks. Comparative and Functional Genomics. 2003, 4 (6): 601608. 10.1002/cfg.342.
Ching W, Zhang S, Ng M, Akutsu T: An approximation method for solving the steadystate probability distribution of probabilistic Boolean networks. Bioinformatics. 2007, 23 (12): 151110.1093/bioinformatics/btm142.
Gelenbe E: Steadystate solution of probabilistic gene regulatory networks. J Theor Biol Phys Rev E. 2007, 76: 03190310.1103/PhysRevE.76.031903.
Gelenbe E: Productform queueing networks with negative and positive customers. Journal of Applied Probability. 1991, 656663. 10.2307/3214499.
Gelenbe E: Gnetworks with triggered customer movement. Journal of Applied Probability. 1993, 742748. 10.2307/3214781.
Arazi A, BenJacob E, Yechiali U: Bridging genetic networks and queueing theory. Physica A: Statistical Mechanics and its Applications. 2004, 332: 585616. 10.1016/j.physa.2003.07.009.
Gelenbe E, Timotheou S: Random neural networks with synchronized interactions. Neural Computation. 2008, 20 (9): 23082324. 10.1162/neco.2008.0407509.
Gillespie D, et al: Exact stochastic simulation of coupled chemical reactions. The journal of physical chemistry. 1977, 81 (25): 23402361. 10.1021/j100540a008.
Bratsun D, Volfson D, Tsimring L, Hasty J: Delayinduced stochastic oscillations in gene regulation. Proc Natl Acad Sci U S A. 2005, 102 (41): 1459314598. 10.1073/pnas.0503858102.
Orlando D, Lin C, Bernard A, Wang J, Socolar J, Iversen E, Hartemink A, Haase S: Global control of cellcycle transcription by coupled CDK and network oscillators. Nature. 2008, 453 (7197): 944947. 10.1038/nature06955.
McAdams H, Arkin A: Stochastic mechanisms in gene expression. 1997
Paulsson J: Models of stochastic gene expression. Physics of life reviews. 2005, 2 (2): 157175. 10.1016/j.plrev.2005.03.003.
Ribeiro A, Zhu R, Kauffman S: A general modeling strategy for gene regulatory networks with stochastic dynamics. Journal of Computational Biology. 2006, 13 (9): 16301639. 10.1089/cmb.2006.13.1630.
Thattai M, van Oudenaarden A: Intrinsic noise in gene regulatory networks. Proc Natl Acad Sci U S A. 2001, 98 (15): 86148619. 10.1073/pnas.151588598.
Buchler N, Gerland U, Hwa T: Nonlinear protein degradation and the function of genetic circuits. Proc Natl Acad Sci U S A. 2005, 102 (27): 95599564. 10.1073/pnas.0409553102.
Goeddel D, Yansura D, Caruthers M: Binding of synthetic lactose operator DNAs to lactose repressors. Proc Natl Acad Sci U S A. 1977, 74 (8): 32923296. 10.1073/pnas.74.8.3292.
Bloom J, Cross F: Multiple levels of cyclin specificity in cellcycle control. Nature Reviews Molecular Cell Biology. 2007, 8 (2): 149160. 10.1038/nrm2105.
Fink G, Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell. 1998, 9 (12): 32733297.
Kim H, Lee J, Park T: Inference of Large Scale Gene Regulatory Networks Using Regressionbased Approach. Journal of Bioinformatics and Computational Biology. 2009,
Golding I, Paulsson J, Zawilski S, Cox E: Realtime kinetics of gene activity in individual bacteria. Cell. 2005, 123 (6): 10251036. 10.1016/j.cell.2005.09.031.
Acknowledgements
Some of this research has been supported by the EU FP7 DIESIS Project.
This article has been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/10?issue=S3.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Haseong Kim developed the data analysis techniques including synthetic data generation and tested the models on the data. He wrote the first draft of the paper.
E. Gelenbe developed the GNetwork models and the specific application of these models to GRNs. He rewrote the paper for submission, and then finalised the accepted paper in preparation for its publication.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Kim, H., Gelenbe, E. Anomaly detection in gene expression via stochastic models of gene regulatory networks. BMC Genomics 10 (Suppl 3), S26 (2009). https://doi.org/10.1186/1471216410S3S26
Published:
DOI: https://doi.org/10.1186/1471216410S3S26