We now present three algorithms for learning various kinds of generative models from MD data.
Input The input to all three algorithms is a time-series of vectors where is a n × 1 vector of covariates (e.g., positional and/or angular deviations) and t is the number of snapshots in the MD trajectory.
Algorithm 1
Output The first algorithm produces a Gaussian Graphical Model . The first step is to compute the sample mean . Then it computes the regularized precision matrix Σ-1 (see below). Finally, is computed as follows: .
The algorithm produces the sparsest precision matrix that still fits the data (see below). It also guarantees that Σ-1 is positive-definite, which means it can be inverted to produce the regularized covariance matrix (as opposed to the sample covariance, which is trivial to compute). This is important because Eqs 1-3 require the covariance matrix, Σ. We further note that a sparse precision matrix does not imply that the corresponding covariance matrix is sparse, nor does a sparse covariance imply that the corresponding precision matrix is sparse. That is, our algorithm isn't equivalent to simply thresholding the sample covariance matrix, and then inverting.
Learning regularized precision matrices
A straight-forward way of learning a GGM is to find the parameters () that maximize the likelihood of the data (i.e., by finding parameters that maximize . It is known that a maximum likelihood model can be produced by setting the pair to the sample mean and covariance matrices, respectively. Unfortunately, maximum likelihood estimates can be prone to over-fitting. This is not surprising because the covariance matrix alone contains m = O(n2) parameters, each of which must be estimated from the data. This is relevant because the number of independent samples needed to obtain a statistically robust estimate of Σ grows polynomially in m. We note that while modern MD simulations do produce large numbers of samples (i.e., frames), these samples are not independent (because they form a time-series), and so the effective sample size is much smaller than the number of frames in the trajectory.
Our algorithm addresses the problem of over-fitting by maximizing the following objective function:
(6)
Here, ||Σ-1||1 is the L1 norm of the precision matrix. The L1 norm is defined as the sum of the absolute values of the matrix elements. It can be interpreted as a measure of the complexity of the model. In particular, each non-zero element of Σ-1 corresponds to a parameter in the model and must be estimated from the data. Thus, Eq. 6 establishes a tradeoff between the log likelihood of the data (the first term) and the complexity of the model (the second term). The scalar value λ controls this tradeoff such that higher values produce sparser precision matrices. This is our algorithm's only parameter. Its value can be computed analytically [27] from the number of frames in the trajectory and variables. Alternatively, users may elect to adjust λ to obtain precision matrices of desired sparsity.
Algorithmically, our algorithm maximizes Eq. 6 in an indirect fashion, by defining and then solving a convex optimization problem. Using the functional form of according to Eq. 1, the log-likelihood of Σ-1 can be rewritten as:
Noting that and that trace (ABC) = trace(CAB), the log-likelihood of Σ-1 can then be rewritten as:
Next, using the definition of the sample covariance matrix,
we can define the matrix Σ-1 that maximizes 6 as the solution to the following optimization problem:
(7)
We note that L1 regularization is equivalent to maximizing the likelihood under a Laplace prior and so the solution to Eq. 7 is a maximum a posteriori (MAP) estimate of the true precision matrix, as opposed to a maximum likelihood estimate. That is, our algorithm is a Bayesian method. Moreover, the use of L1 regularization ensures additional desirable properties including consistency -- given enough data, the learning procedure learns the true model, and high statistical efficiency -- the number of samples needed to achieve this guarantee is small.
We now show that the optimization problem defined in Eq. 7 is smooth and convex and can therefore be solved optimally. First, we consider the dual form of the objective. To obtain the dual, we first rewrite the L1-norm as:
where ||U||∞ denotes the maximum absolute value element of the matrix U. Given this change of formulation, the primal form of the optimization problem can be rewritten as:
(8)
That is, the optimal Σ-1 is the one that maximizes the worst case log likelihood over all additive perturbations of the covariance matrix.
Next, we exchange the min and max in Eq. 8. The inner max in the resulting function can now be solved analytically by calculating the gradient and setting it to zero. The primal form of the objective can thus be written as:
such that Σ-1 = (S + U*)-1.
After one last change of variables, W = S + U, the dual form of Eq. 7 can now be defined as:
(9)
Eq. 9 is smooth and convex, and for small values of n it can be solved by standard convex multivariate optimization techniques, such as the interior point method. For larger values of n we use Block Coordinate Descent [27] instead.
Block Coordinate Descent
Given matrix A, let A\k\jdenote the matrix produced by removing column k and row j of the matrix. Let A
j
also denote the column j, with diagonal element A
jj
removed. The Block Coordinate Descent algorithm [27]. Algorithm 1 proceeds by optimizing one row and one column of the variable matrix W at a time. The algorithm iteratively optimizes all columns until a convergence criteria is met. The W s produced in each iterations are strictly positive definite and so the regularized covariance matrix Σ = W is invertible.
Algorithm 1 Block Coordinate Descent
Require: Tolerance parameter ε, sample covariance S, and regularization parameter λ.
Initialize W(0):= S + λI where I is the identity matrix.
repeat
for j = 1,... n do
{//Here, W(j-1)denotes the current iterate.}
Set W(j)to W(j-1)such that W
j
is replaced by y*.
end for
Set W(0) = W(n)
until trace((W(0))-1S) - n + λ||(W(0))-1||1 ≤ ε.
return W(0)
The time complexity of this algorithm is O(n4.5/ε) [27] when converging to a solution within ε > 0 of the optimal. This complexity is better than , which would have been achieved using the interior point method on the dual form [28].
In summary, the algorithm produces a time-averaged model of the data by computing the sample mean and then constructing the optimal regularized Σ by solving Eq. 9 using Block Coordinate Decent. The regularized covariance matrix Σ is guaranteed to be invertible which means we can always compute the precision matrix, Σ-1, which can be interpreted as a graph over the variables revealing the direct and indirect correlations between the variables.
Algorithm 2
The second algorithm is a straight-forward extension of the first. Instead of producing a time-averaged model, it produced time-varying model: . Here, τ ≤ t indexes over sequentially ordered windows of frames in the trajectory. The width of the window, w, is a parameter and may be adjusted to learn time-varying models at a particular time-scale. Naturally, a separate time-averaged model could be learned for each window. Instead, the second algorithm applies a simple smoothing kernel so that the parameters of the τ th window includes information from neighboring window too. In this way, the algorithm ensures that the parameters of the time-varying model evolve as smoothly as possible, subject to fitting the data.
Let D(τ)⊆ D denote the subset of frames in the MD trajectory that correspond to the τth window, 1 ≤ τ ≤ T. The second algorithm solves the following optimization problem for each 1 ≤ τ ≤ T:
Here, S(τ) is the weighted covariance matrix, and is calculated as follows:
where k indexes over windows τ - κ to τ + κ, κ is a user-specified kernel width, and the weights w
k
are defined by a nonnegative kernel function. The choice of kernel function is specified by the user. In our experiments the kernel mixed the current window and the previous window with the current window having twice the weight of the previous. The time-varying model is then constructed by solving Eq. 9 for each S(τ). That is, the primary difference between the time-averaged and time-varying version of the algorithm is the kernel function.
Algorithm 3
The final algorithm builds on the second algorithm. Recall that the second algorithm learns T sequentially ordered models over windows of the trajectory. Moreover, recall that each model encodes a multivariate Gaussian (Eq. 1) and that the KL-divergence between multivariate Gaussians can be computed analytically via Eq. 3. The KL-divergence (also known as information gain or relative entropy) is a non-negative measure of the difference between two probability distributions. It is zero if and only if the two distributions are identical. It is not, however, a distance metric because it is not symmetric. That is D(P||Q) ≠ D(Q||P), in general. However, it is common to define a symmetric KL-divergence by simply summing KL
sym
= D(P||Q)+D(Q||P). We can thus cluster the models using any standard clustering algorithm, such as k-means or a hierarchial approach. In our experiments we used complete linkage clustering, an agglomerative method that minimizes the maximum distance between elements when merging clusters.
Let S be the set of clusters returned by a clustering algorithm. Our final algorithm treats those clusters as states in a Markov Chain. The prior probability of being in each state can be estimated using free energy calculations [29, 30] for each cluster, or according to the relative sizes of each cluster. It then estimates the transition probabilities between states i and j by counting the number of times a model assigned to cluster i is followed by a model assigned to cluster j. This simple approach creates a model that can be used to generate new trajectories by first sampling states from the Markov Chain and then sampling conformations from the models associated with that state.