Vector autoregressive model and its state space representation
Vector autoregressive model
Given gene expression profile vectors of p genes during T time points {y1, ..., y
T
}, the first order vector autoregressive (VAR(1)) model at time point t is given by
where A is a p × p autoregressive coefficient matrix, and ε
t
is observation noise at time t and follows , a normal distribution with mean 0 and variance . The (i, j)th element of A, A
ij
, indicates a temporal regulation from the j th gene to the i th gene, and if A
ij
≠ 0, regulation from the j th gene to the i th gene is considered. By examining whether A
ij
is zero or not for all i and j, a gene network is constructed. Since equally space time points are assumed in the VAR model, it has difficulty on handling unequally spaced time series data.
State space representation of VAR model (VAR-SSM)
Let be the set of equally spaced entire T time points and the set of time points where gene expressions are observed. Note that holds. VAR-SSM is comprised of two models: system model and observation model. Let x
t
be hidden variable vector representing true gene expression at time t. The system model is given as the VAR model of x
t
:
where η
t
is the system noise normally distributed with mean 0 and variance H = diag[h1,...,h
p
]. The observation model represents measurement error of observed gene expression y
t
and true gene expression x
t
at observed time point :
where ρ
t
is the observation noise normally distributed with mean 0 and variance R = diag[r1,..., r
p
]. Unequally spaced time series data are handled by ignoring observation model at non-observed time points.
Joint model of VAR-SSM for two time series data
Let be time series gene expression data on cell condition c, where is the set of observed time points on cell condition c. We also let be the set of time points from 1 to T(c), where . Given time series data on two types of cell conditions c = 1 and 2, we propose a new VAR-SSM model to estimate gene networks in the two conditions as well as identify changes on regulations between them. The model is comprised of the following two equations:
where ∘ denotes the Hadamard product, E(c)is a p × p binary matrix, and and are respectively system and observation noises from and . In this model, the (i, j)th element takes one if regulation from gene j to gene i exists on condition c and zero otherwise, i.e., the presence of regulations is controlled by E(c)and the AR coefficient matrix A is commonly used in conditions 1 and 2. Changes on regulations are identified when regulations exist only in a condition.
The complete likelihood of our model, P(Y, X, Θ, E), where Y, X, Θ, and E are respectively the sets of , parameters, and E(c), is given by the following equation:
where the prior distribution P(Θ, E) is assumed to be factorized as
Here, z
ij
is a parameter for a potential function of and defined later. The prior distributions of A
ij
is given by
where α0 and α1 are parameters controlling the shrinkage of coefficients A and F
ij
is a binary variable that takes 1 if or takes 1 and 0 otherwise, i.e., F
ij
is given by . α1 is set to a large value, while α0 is set to smaller than α1. From the design of the prior for A
ij
, if or takes one, i.e., there exists regulation from gene j to i in condition 1 or 2, weaker prior is selected and the shrinkage of the coefficients are avoided. Otherwise stronger prior is selected and the sparsity of the network structures is promoted. The prior distributions of h
i
, and r
i
are given by
where represents the density function of inverse gamma distribution, u0 and v0 are the shape parameters, and k0 and l0 are the inverse scaling parameters. Under the assumption that a small number of regulations change between two conditions, we design the prior distribution for and by using the following potential function between for c = 1 or 2:
In this setting, if z
ij
is small, and tend to take the same value and thus most of the regulations exist in both two conditions. We also introduce a prior distribution for z
ij
by beta distribution with parameters ζi 0and ζi 1:
where B(·) is the beta function. Thus, the prior distribution of is given by
Figure 1 shows a graphical representation of the proposed model, where dependency of the parameters and variables are indicated. The hyperparameters are omitted and the observed data y
t
is represented by gray nodes. In the observed data and are propagated mainly via hidden variables and . Due to the data propagation, more accurate estimation is expected in the proposed model than the approaches considering data on two conditions independently.
For the parameter estimation, we search the configuration of E maximizing the following marginal likelihood:
(1)
Finding the optimal configuration of E is computationally intractable, and heuristics approaches such as the EM algorithm and the variational method are used in practice. Here, we use the variational annealing, an extension of the deterministic annealing for discrete variables [11]. In the next section, we give a small explanation of the variational annealing and show its effectiveness compared to the EM algorithm and the variational method.
Parameter estimation by variational annealing
In the deterministic annealing, optimization problem is solved while gradually changing temperature in a some schedule, and maximum likelihood estimator is obtained like the EM algorithm [12–14]. Yoshida and West proposed to use the deterministic annealing to find the configuration of the binary variables that maximizes the likelihood of factor models with sparseness priors [11]. We derive a new variational annealing method by extending Yoshida and West's approach to find the configuration of the binary variables on marginal likelihood function, which can be applied for searching E that maximizes Equation (1).
Let E, X, and Θ be p dimensional binary variables, unobserved variables, and parameters, respectively, and consider to search E maximizing the following marginal likelihood:
(2)
The maximum of the marginal likelihood on E is bounded by the following formula:
(3)
where τ is called temperature and the equality holds for τ → +0. Hereafter, the integral range of E is omitted if no confusion occurs. Let Q(E) be a normalized non-negative function, i.e., Q(E) ≥ 0 and ∫ Q(E)dE = 1.
From the Gibbs inequality, the right side of Equation (3) is also bounded:
Here, Q(E) is considered as an approximation function of . Under the assumption that P(X,Θ,E) ∝ Q(X)Q(Θ), where Q(X) and Q(Θ) are normalized non-negative functions, we have the following inequality
(4)
Thus, as an approximation of E maximizing Equation (2), we try to find , where Q(E) is the function maximizing the lower bound in Equation (4). Since higher values on P(E) are weighed more in the approximation by Q(E) for τ < 1, the better approximation is expected for the higher values. This property on the limited case is shown in Proposition 1. As is in the variational method and the EM algorithm, the maximum of the lower bound in Equation (4) is searched by a hill climbing from high temperature τ > 1.
In the hill climbing, Q(X), Q(Θ), and Q(E) are alternately updated from the following equations:
Gradually converging τ to 0, local optimum of the lower bound and corresponding Q(E), Q(X), and Q(Θ) are obtained.
Effectiveness of variational annealing
As alternatives of the variational annealing, we may consider the variational method and the EM algorithm where X is the set of hidden variables and E is handled as the set of parameters to be maximized. We show the effectiveness of variational annealing compared to the variational method and the EM algorithm under the following conditions: P(X, Θ, E) is factorized into P(X, E)P(Θ, E), and P(E
i
|X, Θ, E\{E
i
}) is given as a binomial distribution, where E
i
is the i th element of E. If factorization of P(X, Θ, E) = Q(X)Q(Θ)Q(E) is assumed in the calculation of the variational method, arg max
E
Q(E) is not the optimal solution of Equation (2) in general. In the EM algorithm, by allowing E to move around a p-dimensional continuous space can be calculated. Let be the i th element of . Usually, is mapped to 1 if and 0 otherwise for discretizing to the p-dimensional binary space {0,1}p, but such a mapping is not guaranteed to provide . Although other mappings can be considered, to the best of our knowledge, no mapping is guaranteed to provide the optimal solution in polynomial time of p.
In the following, we prove a proposition in order to show that the variational annealing possibly give the optimal solution of Equation (2) even if the factorization of is additionally considered.
Proposition 1. P(X,Θ,E) is factorized into P(Θ,E)P(X,E), and P(E
i
|X,Θ,E\{E
i
}) is given as a binomial distribution. Let be Q(E
i
) maximizing the lower bound of the variational annealing for τ → +0 given by
(5)
Then, the set of E
i
∈ {0,1} maximizing is .
For the proof of the proposition, see Section 1 in Additional file 1. From Proposition 1, if the factorization P(X, Θ, E) = P(Θ, E)P(X, E) is satisfied and optimal Q functions are found, the variational annealing is guaranteed to provide the optimal solution of Equation (2) while the variational method and the EM algorithm are not. Although the factorization is not a generally satisfied property, the factorization is often assumed in approaches based on the variational method, and the assumption usually works as good approximations. Thus, the variational annealing is expected to provide the better performance than the variational method and the EM algorithm even if the factorization is not satisfied exactly.
Procedures of variational annealing on proposed model
In the variational annealing on the proposed model, we calculate Q functions for hidden variables X, parameters Θ, and binary variables E iteratively while cooling temperature τ to zero gradually at each iteration cycle. In the following, we show the calculation procedures of Q(X), Q(Θ), and Q(E) on the proposed model as variational E-step, variational M-step, and variational A-step, respectively. More details of the procedures are given in Additional file 1. For the notational brevity, we denote the expectation of a value x with a probability distribution Q(y) as 〈x〉Q(y).
Variational E-step
Parameters of Q(X) are mean of x
t
, variance of x
t
, and cross time variance of xt-1and x
t
. These parameters can be calculated via variational Kalman filter by using following terms expected with Q(Θ)Q(E): 〈E(c)〉Q(Θ)Q(E), 〈A〉Q(Θ)Q(E), 〈H-1 A ○ E(c)〉Q(Θ)Q(E), and 〈(A ○ E(c))'H-1A ○ E(c)〉Q(Θ)Q(E). For the details of variational Kalman filter, see [4]. From the parameters of Q(X), expectations of the following terms with Q(X) required in other steps are calculated: , and .
Variational M-step
Q(Θ) is factorized into ∏
i
Q(A
i
|h
i
) Q(h
i
)Q(r
i
) ∏
j
Q(z
ij
), where A
i
is a vector given by (A
i
1, ..., A
ip
)'. From the design of the proposed model, Q(A
i
|h
i
), Q(h
i
), Q(r
i
), and Q(z
ij
) are given in the following form:
Parameters for the above functions are calculated by using , and .
Variational A-step
For the calculation of Q(E), we assume the factorization of Q(E) to in order to make the computation tractable. follows a binomial distribution that takes one with probability and zero with probability , and thus the expectation of with Q(E) is given by . For the preparation, we calculate , and . is then iteratively calculated by using these expected terms as well as for k ≠ j. A few iterations are enough for the convergence.
Update and selection of hyperparameters
The proposed model contains u0, k0, v0, l0, ζi 0, ζi 1, α0, and α1 as hyperparameters. α0 and α1 should be α0 <α1 as in the model setting, but this condition can be violated in the update step of the variational method. Thus, we select α0 and α1 by cross validation, and update other hyperparameters as in the variational method.
We first consider update of hyperparameters u0, k0, v0, l0, ζi 0, and ζi 1to increase the lower bound of marginal probability. u0 and k0 are updated by maximizing the following equation:
and are obtained by numerical optimization methods such as the Newton-Raphson method. v0 and l0 are also updated in a similar manner to u0 and k0. ζi 0and ζi 1are updated by solving the following equation:
and can also be obtained by the Newton-Raphson method.
For the selection of α0 and α1, we set α1 to some large value and select α0 by a leave one out cross validation procedure. For condition c ∈ {1, 2} and time point , we remove from data set y and use the data set to train the model. We calculate square sum of residues between and the prediction of estimated from the variational Kalman filter on the trained model given by . By grid search on parameter space of α0, we select α0 that minimizes .
Summary of procedures
The procedures for estimating parameters in the proposed model are summarized as follows:
-
1.
Set τ to some large value. Also set α0 to a small value and α1 to a large value satisfying that α0 <α1.
-
2.
Initialize other hyperparameters and hidden variables.
-
3.
Perform the following procedures:
-
(a)
Calculate variational M-step.
-
(b)
Update hyperparameters.
-
(c)
Calculate variational E-step.
-
(d)
Calculate variational A-step.
-
(e)
Go back to step (a) until some convergence criterion is satisfied.
-
4.
Divide τ by some value > 1 such as 1.05.
-
5.
Go back to step 3 if τ is larger than some very small value > 0.
In our setting, α1 is set to 1,000. For the initialization of τ and other hyperparameters, we use the following settings: . If is observed, we initialize with . Otherwise, we use the linearly interpolated one.