Volume 13 Supplement 1
Selected articles from the Tenth Asia Pacific Bioinformatics Conference (APBC 2012)
Identifying regulational alterations in gene regulatory networks by state space representation of vector autoregressive models and variational annealing
- Kaname Kojima^{1},
- Seiya Imoto^{1}Email author,
- Rui Yamaguchi^{1},
- André Fujita^{2},
- Mai Yamauchi^{1},
- Noriko Gotoh^{1} and
- Satoru Miyano^{1}
DOI: 10.1186/1471-2164-13-S1-S6
© Kojima et al.; licensee BioMed Central Ltd. 2012
Published: 17 January 2012
Abstract
Background
In the analysis of effects by cell treatment such as drug dosing, identifying changes on gene network structures between normal and treated cells is a key task. A possible way for identifying the changes is to compare structures of networks estimated from data on normal and treated cells separately. However, this approach usually fails to estimate accurate gene networks due to the limited length of time series data and measurement noise. Thus, approaches that identify changes on regulations by using time series data on both conditions in an efficient manner are demanded.
Methods
We propose a new statistical approach that is based on the state space representation of the vector autoregressive model and estimates gene networks on two different conditions in order to identify changes on regulations between the conditions. In the mathematical model of our approach, hidden binary variables are newly introduced to indicate the presence of regulations on each condition. The use of the hidden binary variables enables an efficient data usage; data on both conditions are used for commonly existing regulations, while for condition specific regulations corresponding data are only applied. Also, the similarity of networks on two conditions is automatically considered from the design of the potential function for the hidden binary variables. For the estimation of the hidden binary variables, we derive a new variational annealing method that searches the configuration of the binary variables maximizing the marginal likelihood.
Results
For the performance evaluation, we use time series data from two topologically similar synthetic networks, and confirm that our proposed approach estimates commonly existing regulations as well as changes on regulations with higher coverage and precision than other existing approaches in almost all the experimental settings. For a real data application, our proposed approach is applied to time series data from normal Human lung cells and Human lung cells treated by stimulating EGF-receptors and dosing an anticancer drug termed Gefitinib. In the treated lung cells, a cancer cell condition is simulated by the stimulation of EGF-receptors, but the effect would be counteracted due to the selective inhibition of EGF-receptors by Gefitinib. However, gene expression profiles are actually different between the conditions, and the genes related to the identified changes are considered as possible off-targets of Gefitinib.
Conclusions
From the synthetically generated time series data, our proposed approach can identify changes on regulations more accurately than existing methods. By applying the proposed approach to the time series data on normal and treated Human lung cells, candidates of off-target genes of Gefitinib are found. According to the published clinical information, one of the genes can be related to a factor of interstitial pneumonia, which is known as a side effect of Gefitinib.
Background
Gene network estimation from time series gene expression data is a key task for elucidating cellular systems. Thus far, wide variety of approaches have been proposed based on the vector autoregressive (VAR) model [1, 3], the state space model [4–6], and the dynamic Bayesian network [7, 8]. Recently, time series gene expression data on multiple conditions aiming at analyzing effects of cell treatment such as drug dosing and heat shock are available. We here assume that some gene regulations are disrupted but many of the gene regulations do not change due to some treatment of interest, and try to find a small number of changes on regulations as keys for elucidating effects by the treatment.
A possible way for finding changes on regulations is to estimate networks from two data sets separately and then compare their structures. However, due to the limited length of time series data (usually less than 10 time points) and unignorable measurement noise, networks are estimated with high error rates and the estimation errors cause the serious failure on identifying changes on regulations. Thus, approaches using two time series data in an efficient manner are strongly demanded. Also, widely used statistical methods such as the VAR model and dynamic Bayesian network assume equally spaced time points in time series data. However, observed time points on usually available time series data are not equally spaced [5, 6, 9], and approaches that can handle unequally spaced time series data in a theoretically correct way should be considered.
We propose a new statistical model that estimates gene networks on two different conditions in order to identify changes on regulations between the conditions. As the basis of the proposed model, we employ the state space representation for VAR model (VAR-SSM), in which observation noise is considered between the measured or observed gene expressions and the true gene expressions in observation model and gene regulations between true gene expressions are considered in the system model [10]. The VAR-SSM can handle unequally spaced time series data by ignoring observation model on the non-observed time points. For considering the changes on regulations, we introduce hidden variables to the VAR-SSM in order to indicate the presence of regulations in each condition. If hidden binary variables on two conditions indicating the presence of a regulation are both estimated as one, the regulation is considered as a commonly existing regulation. On the other hand, if only one of the hidden binary variables for the regulation is estimated as one, the regulation is considered as a condition specific regulation. We also introduce a potential function between the hidden binary variables that is designed to take high probability if the hidden binary variables on two conditions take the same value. From the design of the potential function, the similarity of networks on two conditions is automatically considered. Since the time series data on both conditions are used for estimating commonly existing regulation due to the use of the hidden binary variables, an efficient data assignment is achieved. In addition, from the more accurate estimation of commonly existing regulations by the efficient data assignment, accurate identification of changes on regulations is induced.
The hidden binary variables are estimated by searching the configuration of binary variables that maximizes the marginal likelihood of the model. However, searching the optimal configuration is computationally intractable. Thus, as an alternative approach, we derive a new variational annealing method based on [11] in order to estimate the hidden binary variables. We also give a proof for the effectiveness of the variational annealing compared to other candidate alternatives, the variational annealing and the EM algorithm, in order to show the validity of using the variational annealing.
For the performance evaluation, we generate two regulatory networks in such a way that most of the regulations commonly exist and some exist only on one of the networks. We then apply our proposed approach and existing var model based and dynamic Bayesian network based approaches to two equally spaced time series data drawn separately from the generated networks. From the comparisons of true positive rates and false positive rates of these approaches, we confirm the effectiveness of our approach. We also generate unequally spaced time series data from these networks, and show that our approach works correctly on unequally spaced time series data while the performance of the existing approaches assuming equally spaced time points is drastically worsened.
Our proposed approach is used to analyze changes on regulations in gene networks between normal Human lung cells and Human lung cells treated by stimulating EGF-receptors and dosing an anticancer drug termed Gefitinib. A lung cancer condition is simulated by the stimulation of EGF-receptors in the treated cells. Since Gefitinib is known as a selective inhibitor of EGF-receptors, the stimulation of EGF-receptors would be counteracted by Gefitinib, and hence the treated cells are expected to be the same condition as normal cells. However, gene expression profiles from normal and treated cells are actually different, and off-targets of Gefitinib causing unexpected positive or negative effects are implied. We focus on genes with changes on regulations between the networks estimated by our approach and find possible off-target genes of Gefitinib. According to the published clinical information, one of the possible off-target genes is suggested as one of factors of interstitial pneumonia, which is known as a side effect of Gefitinib.
Methods
Vector autoregressive model and its state space representation
Vector autoregressive model
where A is a p × p autoregressive coefficient matrix, and ε_{ t }is observation noise at time t and follows $\mathcal{N}\left(0,\mathsf{\text{diag}}\left[{\sigma}_{1}^{2},...,{\sigma}_{p}^{2}\right]\right)$, a normal distribution with mean 0 and variance $\mathsf{\text{diag}}\left[{\sigma}_{1}^{2},...,{\sigma}_{p}^{2}\right]$. The (i, j)th element of A, A_{ ij }, indicates a temporal regulation from the j th gene to the i th gene, and if A_{ ij }≠ 0, regulation from the j th gene to the i th gene is considered. By examining whether A_{ ij }is zero or not for all i and j, a gene network is constructed. Since equally space time points are assumed in the VAR model, it has difficulty on handling unequally spaced time series data.
State space representation of VAR model (VAR-SSM)
where ρ_{ t }is the observation noise normally distributed with mean 0 and variance R = diag[r_{1},..., r_{ p }]. Unequally spaced time series data are handled by ignoring observation model at non-observed time points.
Joint model of VAR-SSM for two time series data
where ∘ denotes the Hadamard product, E^{(c)}is a p × p binary matrix, and ${\mathit{\eta}}_{t}^{\left(c\right)}$ and ${\mathit{\rho}}_{t}^{\left(c\right)}$ are respectively system and observation noises from $\mathcal{N}\left(0,H\right)$ and $\mathcal{N}\left(0,R\right)$. In this model, the (i, j)th element ${E}_{ij}^{\left(c\right)}$ takes one if regulation from gene j to gene i exists on condition c and zero otherwise, i.e., the presence of regulations is controlled by E^{(c)}and the AR coefficient matrix A is commonly used in conditions 1 and 2. Changes on regulations are identified when regulations exist only in a condition.
Finding the optimal configuration of E is computationally intractable, and heuristics approaches such as the EM algorithm and the variational method are used in practice. Here, we use the variational annealing, an extension of the deterministic annealing for discrete variables [11]. In the next section, we give a small explanation of the variational annealing and show its effectiveness compared to the EM algorithm and the variational method.
Parameter estimation by variational annealing
In the deterministic annealing, optimization problem is solved while gradually changing temperature in a some schedule, and maximum likelihood estimator is obtained like the EM algorithm [12–14]. Yoshida and West proposed to use the deterministic annealing to find the configuration of the binary variables that maximizes the likelihood of factor models with sparseness priors [11]. We derive a new variational annealing method by extending Yoshida and West's approach to find the configuration of the binary variables on marginal likelihood function, which can be applied for searching E that maximizes Equation (1).
where τ is called temperature and the equality holds for τ → +0. Hereafter, the integral range of E is omitted if no confusion occurs. Let Q(E) be a normalized non-negative function, i.e., Q(E) ≥ 0 and ∫ Q(E)dE = 1.
Thus, as an approximation of E maximizing Equation (2), we try to find $\xca=arg\underset{E}{max}Q\left(E\right)$, where Q(E) is the function maximizing the lower bound in Equation (4). Since higher values on P(E) are weighed more in the approximation by Q(E) for τ < 1, the better approximation is expected for the higher values. This property on the limited case is shown in Proposition 1. As is in the variational method and the EM algorithm, the maximum of the lower bound in Equation (4) is searched by a hill climbing from high temperature τ > 1.
Gradually converging τ to 0, local optimum of the lower bound and corresponding Q(E), Q(X), and Q(Θ) are obtained.
Effectiveness of variational annealing
As alternatives of the variational annealing, we may consider the variational method and the EM algorithm where X is the set of hidden variables and E is handled as the set of parameters to be maximized. We show the effectiveness of variational annealing compared to the variational method and the EM algorithm under the following conditions: P(X, Θ, E) is factorized into P(X, E)P(Θ, E), and P(E_{ i }|X, Θ, E\{E_{ i }}) is given as a binomial distribution, where E_{ i }is the i th element of E. If factorization of P(X, Θ, E) = Q(X)Q(Θ)Q(E) is assumed in the calculation of the variational method, arg max_{ E }Q(E) is not the optimal solution of Equation (2) in general. In the EM algorithm, by allowing E to move around a p-dimensional continuous space ${\left(0,1\right)}^{p},{\xca}_{\mathsf{\text{EM}}}=arg\underset{E\in {\left(0,1\right)}^{m}}{max}P\left(E\right)$ can be calculated. Let ${\xca}_{\mathsf{\text{EM}},i}$ be the i th element of ${\xca}_{\mathsf{\text{EM}}}$. Usually, ${\xca}_{\mathsf{\text{EM}}}\mathsf{\text{,}}i$ is mapped to 1 if ${\xca}_{\mathsf{\text{EM,}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{i}}}>0.5$ and 0 otherwise for discretizing ${\xca}_{\mathsf{\text{EM}}}$ to the p-dimensional binary space {0,1}^{ p }, but such a mapping is not guaranteed to provide $arg\underset{E\in {\left\{0,1\right\}}^{m}}{max}P\left(E\right)$. Although other mappings can be considered, to the best of our knowledge, no mapping is guaranteed to provide the optimal solution in polynomial time of p.
In the following, we prove a proposition in order to show that the variational annealing possibly give the optimal solution of Equation (2) even if the factorization of $Q\left(E\right)={\prod}_{i=1}^{p}Q\left({E}_{i}\right)$ is additionally considered.
Then, the set of E_{ i }∈ {0,1} maximizing ${\xca}_{\mathsf{\text{EM}}}$is $arg\underset{E\in {\left\{0,1\right\}}^{p}}{max}P\left(E\right)$.
For the proof of the proposition, see Section 1 in Additional file 1. From Proposition 1, if the factorization P(X, Θ, E) = P(Θ, E)P(X, E) is satisfied and optimal Q functions are found, the variational annealing is guaranteed to provide the optimal solution of Equation (2) while the variational method and the EM algorithm are not. Although the factorization is not a generally satisfied property, the factorization is often assumed in approaches based on the variational method, and the assumption usually works as good approximations. Thus, the variational annealing is expected to provide the better performance than the variational method and the EM algorithm even if the factorization is not satisfied exactly.
Procedures of variational annealing on proposed model
In the variational annealing on the proposed model, we calculate Q functions for hidden variables X, parameters Θ, and binary variables E iteratively while cooling temperature τ to zero gradually at each iteration cycle. In the following, we show the calculation procedures of Q(X), Q(Θ), and Q(E) on the proposed model as variational E-step, variational M-step, and variational A-step, respectively. More details of the procedures are given in Additional file 1. For the notational brevity, we denote the expectation of a value x with a probability distribution Q(y) as 〈x〉_{Q(y)}.
Variational E-step
Parameters of Q(X) are mean of x_{ t }, variance of x_{ t }, and cross time variance of x_{t-1}and x_{ t }. These parameters can be calculated via variational Kalman filter by using following terms expected with Q(Θ)Q(E): 〈E^{(c)}〉_{Q(Θ)Q(E)}, 〈A〉_{Q(Θ)Q(E)}, 〈H^{-1} A ○ E^{(c)}〉_{Q(Θ)Q(E)}, and 〈(A ○ E^{(c)})'H^{-1}A ○ E^{(c)}〉_{Q(Θ)Q(E)}. For the details of variational Kalman filter, see [4]. From the parameters of Q(X), expectations of the following terms with Q(X) required in other steps are calculated: ${\u3008{\mathit{x}}_{t}^{\left(c\right)}\u3009}_{Q\left(X\right)},{\u3008{\mathit{x}}_{t}^{\left(c\right)}{\left({\mathit{x}}_{t}^{\left(c\right)}\right)}^{\prime}\u3009}_{Q\left(X\right)}$, and ${\u3008{\mathit{x}}_{t+1}^{\left(c\right)}{\left({\mathit{x}}_{t}^{\left(c\right)}\right)}^{\prime}\u3009}_{Q\left(X\right)}$.
Variational M-step
Parameters for the above functions are calculated by using ${\u3008{\mathit{x}}_{t}^{\left(c\right)}\u3009}_{Q\left(X\right)},{\u3008{\mathit{x}}_{t}^{\left(c\right)}{\left({x}_{t}^{\left(c\right)}\right)}^{\prime}\u3009}_{Q\left(X\right)},{\u3008{\mathit{x}}_{t+1}^{\left(c\right)}{\left({\mathit{x}}_{t}^{\left(c\right)}\right)}^{\prime}\u3009}_{Q\left(X\right)},$, and ${\u3008{E}_{ij}^{\left(c\right)}\u3009}_{Q\left(E\right)}$.
Variational A-step
For the calculation of Q(E), we assume the factorization of Q(E) to ${\prod}_{c}{\prod}_{ij}Q\left({E}_{ij}^{\left(c\right)}\right)$ in order to make the computation tractable. $Q\left({E}_{ij}^{\left(c\right)}\right)$ follows a binomial distribution that takes one with probability ${e}_{ij}^{\left(c\right)}$ and zero with probability $1-{e}_{ij}^{\left(c\right)}$, and thus the expectation of ${E}_{ij}^{\left(c\right)}$ with Q(E) is given by ${e}_{ij}^{\left(c\right)}$. For the preparation, we calculate ${\u3008A\u3009}_{Q\left(\mathrm{\Theta}\right)},{\u3008{H}^{-1}A\u3009}_{Q\left(\mathrm{\Theta}\right)},{\u3008{A}^{\prime}{H}^{-1}A\u3009}_{Q\left(\mathrm{\Theta}\right)},{\u3008{\mathit{x}}_{t}^{\left(c\right)}\u3009}_{Q\left(X\right)},{\u3008{\mathit{x}}_{t}^{\left(c\right)}{\left({\mathit{x}}_{t}^{\left(c\right)}\right)}^{\prime}\u3009}_{Q\left(X\right)}$, and ${\u3008{\mathit{x}}_{t+1}^{\left(c\right)}{\left({\mathit{x}}_{t}^{\left(c\right)}\right)}^{\prime}\u3009}_{Q\left(X\right)}$. $Q\left({E}_{ij}^{\left(c\right)}\right)$ is then iteratively calculated by using these expected terms as well as ${\u3008{E}_{ik}^{\left(c\right)}\u3009}_{Q\left(E\right)}$ for k ≠ j. A few iterations are enough for the convergence.
Update and selection of hyperparameters
The proposed model contains u_{0}, k_{0}, v_{0}, l_{0}, ζ_{i 0}, ζ_{i 1}, α_{0}, and α_{1} as hyperparameters. α_{0} and α_{1} should be α_{0} <α_{1} as in the model setting, but this condition can be violated in the update step of the variational method. Thus, we select α_{0} and α_{1} by cross validation, and update other hyperparameters as in the variational method.
${\widehat{\zeta}}_{i0}$ and ${\widehat{\zeta}}_{i1}$ can also be obtained by the Newton-Raphson method.
For the selection of α_{0} and α_{1}, we set α_{1} to some large value and select α_{0} by a leave one out cross validation procedure. For condition c ∈ {1, 2} and time point $t\in {\mathcal{T}}_{obs}^{\left(c\right)}$, we remove ${\mathit{y}}_{t}^{\left(c\right)}$ from data set y and use the data set to train the model. We calculate square sum of residues ${r}_{t,c}^{2}$ between ${\mathit{y}}_{t}^{\left(c\right)}$ and the prediction of ${\mathit{x}}_{t}^{\left(c\right)}$ estimated from the variational Kalman filter on the trained model given by ${r}_{t,c}^{2}={\left({\mathit{y}}_{t}^{\left(c\right)}-{\mathit{x}}_{t}^{\left(c\right)}\right)}^{\prime}\left({\mathit{y}}_{t}^{\left(c\right)}-{\mathit{x}}_{t}^{\left(c\right)}\right)$. By grid search on parameter space of α_{0}, we select α_{0} that minimizes ${\sum}_{c=1}^{2}{\sum}_{t\in {\mathcal{T}}_{obs}^{\left(c\right)}}{r}_{t,c}^{2}$.
Summary of procedures
- 1.
Set τ to some large value. Also set α_{0} to a small value and α_{1} to a large value satisfying that α_{0} <α_{1}.
- 2.
Initialize other hyperparameters and hidden variables.
- 3.Perform the following procedures:
- (a)
Calculate variational M-step.
- (b)
Update hyperparameters.
- (c)
Calculate variational E-step.
- (d)
Calculate variational A-step.
- (e)
Go back to step (a) until some convergence criterion is satisfied.
- (a)
- 4.
Divide τ by some value > 1 such as 1.05.
- 5.
Go back to step 3 if τ is larger than some very small value > 0.
In our setting, α_{1} is set to 1,000. For the initialization of τ and other hyperparameters, we use the following settings: $\tau =2.5,{E}_{ij}^{\left(c\right)}=0.5,{u}_{i}=1,{k}_{i}=1,{v}_{i}=1,{l}_{i}=1,{\zeta}_{i0}=10,\text{and}{\zeta}_{i1}=10$. If ${\mathit{y}}_{t}^{\left(c\right)}$ is observed, we initialize ${\mathit{x}}_{t}^{\left(c\right)}$ with ${\mathit{y}}_{t}^{\left(c\right)}$. Otherwise, we use the linearly interpolated one.
Results and discussion
Performance evaluation by Monte Carlo experiments
For the evaluation of the proposed approach, we generate two linear regulatory network models with similar topological structures G_{1} and G_{2} based on a linear regulatory network model G_{0}. G_{0} is prepared in the following manner: (i) a scale free network of 100 nodes and 150 edges is generated; (ii) edge directions are assigned randomly; (iii) autoloop edges are added to root nodes of the directed network; and (iv) AR coefficients for the directed edges are chosen randomly from {-0.9, -0.8, -0.7, -0.6, -0.5, 0.5, 0.6, 0.7, 0.8, 0.9}. We then generate G_{1} and G_{2} from G_{0} as follows: (i) autoloop edges and 70% of non-autoloop edges in G_{0} are used for commonly existing edges in G_{1} and G_{2}; and (ii) the other 30% of non-autoloop edges are randomly assigned as either G_{1} or G_{2} specific edges. Note that AR coefficients on edges of G_{1} and G_{2} are preserved, i.e., if a regulation from gene j to gene i exists in G_{1} or G_{2}, then its coefficient is the same as that of the regulation from gene j to gene i in G_{0}.
From each of G_{1} and G_{2}, we obtain two equally spaced time series data of 25 time points and 50 time points. For system noise and observation noise, normally distributed values with mean 0 and standard deviation 1 and mean 0 and standard deviation 0.1 and 1 are used, respectively. The signal-noise ratio for system noise with standard deviation 1 and observation noise with standard deviation 0.1 is 0.03dB, and the signal is bit stronger than the noise. On the other hand, for observation noise with standard deviation 1, the signal-noise ratio is -0.26dB. In the condition, the noise is stronger than the signal, and the noise level is quite high.
Comparison between variational annealing and EM algorithm
Comparison of the variation annealing (Proposed) and EM algorithm (EM) based on the proposed model
(a) | ||||||
---|---|---|---|---|---|---|
# of time points | 50 | 25 | ||||
# TP | # FP | PRE | # TP | # FP | PRE | |
Proposed | 295.9 | 41.7 | 0.88 | 238.4 | 71.6 | 0.77 |
EM | 294.9 | 119.2 | 0.71 | 196.8 | 66.9 | 0.75 |
(b) | ||||||
# of time points | 50 | 25 | ||||
# TP | # FP | PRE | # TP | # FP | PRE | |
Proposed | 39.8 | 13.2 | 0.75 | 23.4 | 20.8 | 0.53 |
EM | 39.9 | 39.4 | 0.5 | 11.5 | 10.0 | 0.53 |
From the comparison, the results of the proposed approach contain more true positives than those of the EM algorithm based approach except for identifying changes on regulations for time points 50. For identifying changes on regulations, the EM algorithm based approach estimates bit more true positives than the proposed approach, but the difference is so small that it can be ignored. On the other hand, the results of the EM algorithm based approach contain more false positives than those of the proposed approach, and hence the precision of the results by the EM algorithm is worse than that of the proposed approach. Therefore, the effectiveness of the variational annealing is confirmed in the computational experiment as well.
Comparison between proposed approach and existing approaches
We employ the elastic net based VAR model approach [2] and the dynamic Bayesian network based approach termed G1DBN [8] as existing approaches for estimating networks from time series data. For the experiments, two versions of these approaches are considered: ENet1 and ENet2 from the elastic net based VAR model approach, and G1DBN1 and G1DBN2 from G1DBN. These approaches are different in the following point: ENet1 and G1DBN1 estimate G_{1} and G_{2} independently, i.e., for the estimation of G_{1}, only time series data from G_{1} is used, while ENet2 and G1DBN2 assume that G_{1} and G_{2} have the same network structure and estimate a network by using two time series data. Thus, ENet2 and G1DBN2 are considered to more use data sample than ENet1 and G1DBN1 for network estimation, but changes on regulations between G_{1} and G_{2} are not considered. For selection of hyperparameters in ENet1 and ENet2, AICc is used [2], and for hyperparameters α_{1} and α_{2} of G1DBN1 and G1DBN2, a setting of α_{1} = 0.1 and α_{2} = 0.0059 considered in [8] is used.
A summary of results for system noise with standard deviation 1 and observation noise with standard deviation 0.1
(a) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Equally spaced | Unequally spaced | |||||||||||
# of time points | 50 | 25 | 50 | 25 | ||||||||
# TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | |
Proposed | 295.9 | 41.7 | 0.88 | 238.4 | 71.6 | 0.77 | 262.4 | 42.1 | 0.86 | 110.7 | 37.2 | 0.75 |
ENet1 | 246.3 | 119.7 | 0.67 | 109.2 | 67 | 0.62 | 84.7 | 140.6 | 0.38 | 20.3 | 70.4 | 0.22 |
ENet2 | 277.9 | 130.9 | 0.68 | 212.8 | 130 | 0.62 | 169.7 | 241.5 | 0.41 | 65.5 | 132.5 | 0.33 |
G1DBN1 | 223.7 | 48 | 0.82 | 99.9 | 46.2 | 0.68 | 65.1 | 83.1 | 0.44 | 19.3 | 72.7 | 0.21 |
G1DBN2 | 268.8 | 83.4 | 0.76 | 188.1 | 64.5 | 0.74 | 134.8 | 104.4 | 0.56 | 46.7 | 85.7 | 0.35 |
(b) | ||||||||||||
Equally spaced | Unequally spaced | |||||||||||
# of time points | 50 | 25 | 50 | 25 | ||||||||
# TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | |
Proposed | 39.8 | 13.2 | 0.75 | 23.4 | 20.8 | 0.53 | 31.2 | 16.5 | 0.65 | 5.5 | 15.9 | 0.26 |
ENet1 | 38.5 | 153.1 | 0.2 | 18.6 | 113 | 0.14 | 12.3 | 186.6 | 0.06 | 3.7 | 85 | 0.04 |
ENet2 | - | - | - | - | - | - | - | - | - | - | - | - |
G1DBN1 | 35.6 | 88.9 | 0.29 | 16.8 | 91.9 | 0.15 | 10.3 | 121.9 | 0.08 | 2.4 | 87.8 | 0.03 |
G1DBN2 | - | - | - | - | - | - | - | - | - | - | - | - |
are also provided. The results are averaged on ten data sets. The number of regulations in the true network models of G_{1} and G_{2} are in total 305. For the latter point, we consider the estimated regulations existing only in one of two estimated networks as changed regulations, and check if they correctly exist only in the corresponding true network. The numbers of true positives and false negatives of the estimated changes on regulations and the precisions are summarized in Table 2(b). The results are also averaged on ten data sets. The number of true changes on regulations between in G_{1} and G_{2} are 47, i.e., the number of true positives on this case is at most 47.
For the estimation of the regulations in Table 2(a), the proposed approach outperforms other approaches in terms of true positives. The proposed approach contains more false positives than ENet1, G1DBN1, and G1DBN2 on the data of 25 time points. We further consider these cases in terms of the precision. The precisions of the proposed approach, ENet1, G1DBN1, and G1DBN2 are given by 0.77, 0.62, 0.68, and 0.74, respectively. From this analysis, the proposed approach shows better performance than ENet2, G1DBN1, and G1DBN2 on the whole. More true positives are estimated by ENet2 than ENet1, and the precisions of ENet2 tend to be better than those of ENet1. This type of relationship is also observed between G1DBN1 and G1DBN2. Also, the elastic net based VAR model approaches estimate more true positives than the approaches from G1DBN. However, in the precision, the approaches from G1DBN are better than the elastic net based VAR model approaches. From the results in Table 2(b), we see that the proposed approach estimates more true changes on regulations than ENet1 and G1DBN1 on both data of 25 and 50 time points. In addition, the results of the proposed approach contain less false positives than those of ENet1 and G1DBN1. No changes on regulations are detected in ENet2 and G1DBN2 as topological differences are ignored in these approaches. Since the proposed approach considers differences on network structures as well as uses two time series data efficiently, it can provide better results than other approaches on both estimating regulations and identifying changes on regulations.
One may think it is strange that false positives in ENet1 and G1DBN1 in Table 2(b) is more than those in Table 2(a), but this case can occur from the following reason. If a regulation exists in both of the true network models of G_{1} and G_{2}, but is estimated only for G_{1}, then the case is not counted as a false positive in Table 2(a) while it is counted as a false positive in Table 2(b). Thus, the number of false positives in Table 2(b) can be greater than those in Table 2(a).
A summary of results for system noise with standard deviation 1 and observation noise with standard deviation 1
(a) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Equally spaced | Unequally spaced | |||||||||||
# of time points | 50 | 25 | 50 | 25 | ||||||||
# TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | |
Proposed | 190.2 | 122.8 | 0.61 | 88.1 | 121.0 | 0.42 | 132.1 | 675.5 | 0.66 | 52.1 | 108.9 | 0.33 |
ENet1 | 110.8 | 136.9 | 0.45 | 30.3 | 75.9 | 0.29 | 32.5 | 133.7 | 0.2 | 7.4 | 75.2 | 0.09 |
ENet2 | 189.8 | 218 | 0.47 | 85.8 | 136.2 | 0.39 | 75.9 | 180.7 | 0.3 | 23.5 | 123.3 | 0.16 |
GIDBN1 | 86.6 | 82.6 | 0.51 | 22.6 | 90.8 | 0.2 | 26.3 | 74 | 0.26 | 7.1 | 71.6 | 0.09 |
GIDBN2 | 163.9 | 105.7 | 0.61 | 54.4 | 99 | 0.35 | 66.2 | 91.2 | 0.42 | 17.4 | 92.8 | 0.16 |
(b) | ||||||||||||
Equally spaced | Unequally spaced | |||||||||||
# of time points | 50 | 25 | 50 | 25 | ||||||||
# TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | # TP | # FP | PRE | |
Proposed | 15.4 | 43.6 | 0.26 | 4.7 | 50.0 | 0.09 | 8.1 | 16.7 | 0.33 | 3.1 | 42.5 | 0.07 |
ENet1 | 16.9 | 184 | 0.08 | 3.8 | 95.4 | 0.04 | 5.2 | 155 | 0.03 | 1.2 | 81.4 | 0.01 |
ENet2 | - | - | - | - | - | - | - | - | - | - | - | - |
GIDBN1 | 14.5 | 125.1 | 0.1 | 3.9 | 105.7 | 0.04 | 4 | 91.9 | 0.04 | 1.7 | 76.8 | 0.02 |
GIDBN2 | - | - | - | - | - | - | - | - | - | - | - | - |
Analysis of time series microarray data from Human small airway epithelial cells
We apply the proposed approach to two time series microarray gene expression data from normal Human small airway epithelial cells (SAECs) and SAECs treated by stimulating EGF-receptors and dosing an anticancer drug termed Gefitinib. EGF-receptors are often overexpressed in lung cancer cells such as tumoral SAECs, and a lung cancer condition is simulated in the treated SAECs by stimulating EGF-receptors. Since Gefitinib is known as a selective inhibiter of EGF-receptors, the stimulation of EGF-receptors would be counteracted by Gefitinib, and the condition of treated SACEs should be the same as that of normal SAECs in theory. However, since some gene expression patterns are different between the two conditions in practice, some unknown effects by Gefitinib may be involved in the phenomenon. Thus, we focus on changed regulations between the gene networks estimated from gene expression data in these two conditions in order to find some insights on the unknown effects of Gefitinib.
For gene set selection, we first screen 500 genes from the ranking of the gene list sorted by coefficient variation [5]. We then select 100 genes with highly varied expression profiles between normal SAECs and treated SAECs. The time series gene expression data from both types of cells are comprised of spaced 14 time points in 48 hours. The time schedule of 14 time points are {0 h, 6 h, 9 h, 12 h, 15 h, 18 h, 21 h, 24 h, 27 h, 30 h, 33 h, 36 h, 39 h, 48 h}. For the analysis on the proposed approach, we set interval on system model to three hours.
Changes on regulations between normal and treated SAECs
Normal SAECs | Treated SAECs |
---|---|
ZC3HAV1L → FOXA2 | Prss22 → foxn2 |
LIF → foxn2 | Prss22 → cdk14 |
Cdc42ep2 → Spink6 | Prss22 → Camk2n1 |
Siglec15 → NTN1 | Prss22 → cttn |
HAS3 → HAS3 | Prss22 → Sfrs6 |
HAS3 → Enc1 | Prss22 → ITGA2 |
HAS3 → LEPREL1 | Prss22 → pkn2 |
Prss22 → Hs2st1 | |
Prss22 → FILIP1L | |
Prss22 → Hcn2 | |
Prss22 → KLF16 | |
Ktelc1 → NTN1 | |
Tm6sf1 → Siglec15 |
From Table 4, we see that Prss22 is involved in most of the regulations only in treated SAECs. Prss22 is a tryptase, one of serine proteases, and its relationship with the airways is suggested by a report about its expression in the airways in a developmentally regulated manner. Tryptase is a potent mitogen of fibroblast [15], and it is reported that the increase and activation of fibroblast are promoted in lung cells under the condition of interstitial pneumonia. Interstitial pneumonia is known as a side effect of Gefitinib, and these findings suggest that Prss22 is an off-target of Gefitinib and is possibly related to the side effect of Gefitinib. Also, FILIP1L, a gene estimated as one of targets of Prss22 in treated SAECs, is reported as an inhibitor of cell proliferation of fibroblast [16]. This observation supports the relation between Prss22 and fibroblast as well.
We also focus on other several genes related to changes on regulations in normal and treated SAECs. LIF, leukemia inhibitory factor, is known to affect cell growth and development. Gefitinib is also known to be effective for acute myelogenous leukemia via Sky, which is an off-target gene of Gefitinib [17]. In addition, Wang et al. reported that LIF prolongs the cell cycle of stem cells on acute myelogenous leukemia lines [18]. Although, to the best of our knowledge, no direct influence from Gefitinib to LIF is reported, the above facts suggest some relation between Gefitinib and LIF, and support changes on regulations of LIF between normal and treated SAECs.
Heikema et al. reported that Human Siglecs, Siglecs-14, -15, and -16 interact with transmembrain adaptor proteins containing the immunoreceptor tyrosine-based activation motif such as DAP12 [19], and therefore they potentially mediate the activation of intracellular signaling. Gefitinib is a selective inhibitor of tyrosine kinase, and the inhibition of tyrosine kinase is considered to affect the regulations around Siglec-15.
Although the stimulation of EGF-receptors in the treated SAECs is considered to be counteracted by Gefitinib, the expressions of some genes may be affected by the stimulation in practical conditions. HAS3 is related to synthesis of the unbranched glycosaminoglycan hyaluronic acid and is reported to be up-regulated by EGF [20]. The stimulation of EGF-receptors affects the amount of EGF taken into cells, and hence the stimulation is expected to cause the changes on the regulations around HAS3. Foxn2 is a member of family of Fox proteins. Fox proteins are known to play important roles on controlling the expressions of genes related to cell growth, proliferation, and differentiation. Some members of Fox family are related to EFG-receptors, e.g., the expression of Foxn1 is suppressed by EGF-receptor signaling [21] although no direct relation between EGF-receptors and Foxn2 is found. FOXA2 is also a member of family of Fox proteins. EGF-receptor signaling is known to decrease the expression of FOXA, which prevents the mucus production [22]. [23] reported the relation between EGF-receptors and FOXA2 in the airways of asthmatic patients. Thus, it appears that the change on the regulation related to FOXA2 is caused by the stimulation of EGF-receptors.
Conclusions
We proposed the new computational model that is based on VAR-SSM and estimates gene networks from time series data on normal and treated conditions as well as identifies changes regulations by the treatment. Unlike many of existing gene network estimation approaches assuming equally spaced time points, our approach can handle unequally spaced time series data. The efficient use of time series data is achieved by representing the presence of regulations on each condition with hidden binary variables. Since finding the optimal configuration of the hidden binary variables on the proposed model is computationally in tractable, we derive the extended variational annealing method in order to address the problem as the alternative method.
In the Monte Carlo experiments, we use equally and spaced time series data from synthetically generated two regulatory networks whose structures are different in several regulations, and verified the effectiveness of the proposed model in both estimation of regulations and changes on regulations between the two conditions, compared to existing methods.
As the real data application, we use the proposed approach to analyze two time series data from normal SAECs and SAECs treated by stimulating EGF-receptors and dosing Gefitinib. From genes related to changes on regulations by the treatment, we find possible off-target genes of Gefitinib, and one of these genes is suggested to be related to a factor of interstitial pneumonia, which is known as a side effect of Gefitinib. In this study, we consider changes on regulations in two conditions, but the proposed approach can be extended to identifying changes among more than two conditions.
Declarations
Acknowledgements
We thank the anonymous reviewers for their constructive comments and suggestions, which improved the quality of this publication.
This article has been published as part of BMC Genomics Volume 13 Supplement 1, 2012: Selected articles from the Tenth Asia Pacific Bioinformatics Conference (APBC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/13?issue=S1.
Authors’ Affiliations
References
- Opgen-Rhein R, Strimmer K: From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Syst Biol. 2007, 1: 37-10.1186/1752-0509-1-37.PubMed CentralView ArticlePubMed
- Shimamura T, Imoto S, Yamaguchi R, Fujita A, Nagasaki M, Miyano S: Recursive regularization for inferring gene networks from time-course gene expression profiles. BMC Syst Biol. 2009, 3: 41-10.1186/1752-0509-3-41.PubMed CentralView ArticlePubMed
- Fujita A, Sato JR, Kojima K, Gomes LR, Nagasaki M, Sogayar MC, Miyano S: Identification of Granger causality between gene sets. Journal of Bioinformatics and Computational Biology. 2010, 8: 679-701. 10.1142/S0219720010004860.View Article
- Beal MJ: Variational algorithms for approximate Bayesian inference. PhD thesis. 2003, University College London, Gatsby Computational Neuroscience Unit
- Yamaguchi R, Imoto S, Yamauchi M, Nagasaki M, Shimamura T, Hatanaka Y, Ueno K, Higuchi T, Gotoh N, Miyano S: Predicting differences in gene regulatory systems by state space models. Genome Inform. 2008, 21: 101-113.PubMed
- Shiraishi Y, Kimura S, Okada M: Inferring cluster-based networks from differently stimulated multiple time-course gene expression data. Bioinformatics. 2010, 26 (8): 1073-1081. 10.1093/bioinformatics/btq094.PubMed CentralView ArticlePubMed
- Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'Alché Buc F: Gene networks inference using dynamic Bayesian networks. Bioinformatics. 2003, 19 (Suppl 2): ii138-ii148. 10.1093/bioinformatics/btg1071.View ArticlePubMed
- Lèbre S: Inferring dynamic genetic networks with low order independencies. Stat Appl Genet Mol Biol. 2009, 8: Article 9-PubMed
- Bar-Joseph Z, Gerber G, Simon I, Gifford DK, Jaakkola TS: Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes. Proc Natl Acad Sci U S A. 2003, 100 (18): 10146-10151. 10.1073/pnas.1732547100.PubMed CentralView ArticlePubMed
- Kojima K, Yamaguchi R, Imoto S, Yamauchi M, Nagasaki M, Shimamura T, Ueno K, Higuchi T, Gotoh N, Miyano S: A state space representation of VAR models with sparse learning for dynamic gene networks. Genome Inform. 2010, 22: 56-68.PubMed
- Yoshida R, West M: Bayesian learning in sparse graphical factor models via variational mean-field annealing. J Mach Learn Res. 2010, 11: 1771-1798.PubMed CentralPubMed
- Rose K: Deterministic annealing for clustering, compression, classication, regression, and related optimization problems. Proceedings of the IEEE. 1998, 86 (11): 2210-2239. 10.1109/5.726788.View Article
- Ueda N: Deterministic annealing EM algorithm. Neural Networks. 1998, 11 (2): 271-282. 10.1016/S0893-6080(97)00133-0.View ArticlePubMed
- Itaya Y, Zen H, Nankaku Y, Miyajima C, Tokuda K, Kitamura T: Deterministic annealing EM algorithm in parameter estimation for acoustic model. The 8th International Conference on Spoken Language Processing. 2004, 433-436.
- Ruoss SJ, Hartmann T, Caughey GH: Mast cell tryptase is a mitogen for cultured fibroblasts. J Clin Invest. 1991, 88 (2): 493-499. 10.1172/JCI115330.PubMed CentralView ArticlePubMed
- Kwon M, Hanna E, Lorang D, Quick JS, He M, Adem A, Stevenson C, Chung J, Hewitt SM, Zudaire E, Esposito D, Cuttitta F, Libutti SK: Functional characterization of filamin A interacting protein 1-like, a novel candidate for antivascular cancer therapy. Cancer Res. 2008, 68 (18): 7332-7341. 10.1158/0008-5472.CAN-08-1087.PubMed CentralView ArticlePubMed
- Downing JR: Can treating the SYK cell cure leukemia?. Cancer Cell. 2009, 16 (4): 270-271. 10.1016/j.ccr.2009.09.020.View ArticlePubMed
- Wang C, Lishner M, Minden MD, McCulloch EA: The effects of leukemia inhibitory factor (LIF) on the blast stem cells of acute myeloblastic leukemia. Leukemia. 1990, 4 (8): 548-552.PubMed
- Heikema AP, Bergman MP, Crocker PR, Gilbert M, Samsom JN, van Wamel WJ, Endtz HP, van Belkum A: Characterization of the specific interaction between sialoadhesin and sialylated Campylobacter jejuni lipooligosaccharides. Infect Immun. 2010, 78 (7): 3237-3246. 10.1128/IAI.01273-09.PubMed CentralView ArticlePubMed
- Pasonen-Seppänen S, Karvinen S, Törrönen K, Hyttinen JM, Jokela T, Lammi MJ, Tammi MI, Tammi R: EGF upregulates, whereas TGF-beta downregulates, the hyaluronan synthases Has2 and Has3 in organotypic keratinocyte cultures: correlations with epidermal proliferation and differentiation. J Invest Dermatol. 2003, 120 (6): 1038-1044. 10.1046/j.1523-1747.2003.12249.x.View ArticlePubMed
- Mandinova A, Kolev V, Neel V, Hu B, Stonely W, Lieb J, Wu X, Colli C, Han R, Pazin MJ, Ostano P, Dummer R, Brissette JL, Dotto GP: A positive FGFR3/FOXN1 feedback loop underlies benign skin keratosis versus squamous cell carcinoma formation in humans. J Clin Invest. 2009, 119 (10): 3127-3137. 10.1172/JCI38543.PubMed CentralView ArticlePubMed
- Zhen G, Park SW, Nguyenvu LT, Rodriguez MW, Barbeau R, Paquet AC, Erle DJ: IL-13 and epidermal growth factor receptor have critical but distinct roles in epithelial cell mucin production. Am J Respir Cell Mol Biol. 2007, 36 (2): 244-253.PubMed CentralView ArticlePubMed
- Lai HY, Rogers DF: Mucus hypersecretion in asthma: intracellular signaling pathways as targets for pharmacotherapy. Curr Opin Allergy Clin Immunol. 2010, 10: 67-76. 10.1097/ACI.0b013e328334643a.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.