The algorithm produces the sparsest precision matrix that still fits the data (see below). It also guarantees that Σ^{-1} is positive-definite, which means it can be inverted to produce the regularized covariance matrix (as opposed to the sample covariance, which is trivial to compute). This is important because Eqs 1-3 require the covariance matrix, Σ. We further note that a sparse precision matrix does not imply that the corresponding covariance matrix is sparse, nor does a sparse covariance imply that the corresponding precision matrix is sparse. That is, our algorithm isn't equivalent to simply thresholding the sample covariance matrix, and then inverting.

#### Learning regularized precision matrices

A straight-forward way of learning a GGM is to find the parameters (
) that maximize the likelihood of the data (i.e., by finding parameters that maximize
. It is known that a maximum likelihood model can be produced by setting the pair
to the sample mean and covariance matrices, respectively. Unfortunately, maximum likelihood estimates can be prone to over-fitting. This is not surprising because the covariance matrix alone contains *m* = *O*(*n*
^{2}) parameters, each of which must be estimated from the data. This is relevant because the number of *independent* samples needed to obtain a statistically robust estimate of Σ grows polynomially in *m*. We note that while modern MD simulations do produce large numbers of samples (i.e., frames), these samples are *not* independent (because they form a time-series), and so the effective sample size is much smaller than the number of frames in the trajectory.

Our algorithm addresses the problem of over-fitting by maximizing the following objective function:

Here, ||Σ^{-1}||_{1} is the *L*
_{1} norm of the precision matrix. The *L*
_{1} norm is defined as the sum of the absolute values of the matrix elements. It can be interpreted as a measure of the complexity of the model. In particular, each non-zero element of Σ^{-1} corresponds to a parameter in the model and must be estimated from the data. Thus, Eq. 6 establishes a tradeoff between the log likelihood of the data (the first term) and the complexity of the model (the second term). The scalar value λ controls this tradeoff such that higher values produce sparser precision matrices. This is our algorithm's only parameter. Its value can be computed analytically [27] from the number of frames in the trajectory and variables. Alternatively, users may elect to adjust λ to obtain precision matrices of desired sparsity.

Algorithmically, our algorithm maximizes Eq. 6 in an indirect fashion, by defining and then solving a convex optimization problem. Using the functional form of

according to Eq. 1, the log-likelihood of Σ

^{-1} can be rewritten as:

Noting that

and that

*trace* (

**ABC**) =

*trace*(

**CAB**), the log-likelihood of Σ

^{-1} can then be rewritten as:

Next, using the definition of the sample covariance matrix,

we can define the matrix Σ

^{-1} that maximizes 6 as the solution to the following optimization problem:

We note that *L*
_{1} regularization is equivalent to maximizing the likelihood under a Laplace prior and so the solution to Eq. 7 is a *maximum a posteriori* (MAP) estimate of the true precision matrix, as opposed to a maximum likelihood estimate. That is, our algorithm is a Bayesian method. Moreover, the use of *L*
_{1} regularization ensures additional desirable properties including *consistency* -- given enough data, the learning procedure learns the true model, and high statistical *efficiency* -- the number of samples needed to achieve this guarantee is small.

We now show that the optimization problem defined in Eq. 7 is smooth and convex and can therefore be solved optimally. First, we consider the dual form of the objective. To obtain the dual, we first rewrite the

*L*
_{1}-norm as:

where ||

**U**||∞ denotes the maximum absolute value element of the matrix

**U**. Given this change of formulation, the primal form of the optimization problem can be rewritten as:

That is, the optimal Σ^{-1} is the one that maximizes the worst case log likelihood over all additive perturbations of the covariance matrix.

Next, we exchange the

*min* and

*max* in Eq. 8. The inner

*max* in the resulting function can now be solved analytically by calculating the gradient and setting it to zero. The primal form of the objective can thus be written as:

such that Σ^{-1} = (**S** + **U***)^{-1}.

After one last change of variables,

**W** =

**S** +

**U**, the dual form of Eq. 7 can now be defined as:

Eq. 9 is smooth and convex, and for small values of *n* it can be solved by standard convex multivariate optimization techniques, such as the interior point method. For larger values of *n* we use Block Coordinate Descent [27] instead.

#### Block Coordinate Descent

Given matrix **A**, let **A**
_{\k\j
}denote the matrix produced by removing column *k* and row *j* of the matrix. Let **A**
_{
j
}also denote the column *j*, with diagonal element **A**
_{
jj
}removed. The Block Coordinate Descent algorithm [27]. Algorithm 1 proceeds by optimizing one row and one column of the variable matrix **W** at a time. The algorithm iteratively optimizes all columns until a convergence criteria is met. The **W**s produced in each iterations are strictly positive definite and so the regularized covariance matrix Σ = *W* is invertible.

**Algorithm 1** Block Coordinate Descent

**Require**: Tolerance parameter ε, sample covariance **S**, and regularization parameter λ.

Initialize **W**
^{(0)}:= **S** + λ**I** where **I** is the identity matrix.

**repeat**

**for**
*j* = 1,... *n*
**do**

{//Here, **W**
^{(j-1) }denotes the current iterate.}

Set **W**
^{(j) }to **W**
^{(j-1) }such that **W**
_{
j
}is replaced by *y**.

**end for**

Set **W**
^{(0)} = **W**
^{(n)}

**until**
*trace*((**W**
^{(0)})^{-1}
**S**) - *n* + λ||(**W**
^{(0)})^{-1}||_{1} ≤ ε.

**return W**
^{(0)}

The time complexity of this algorithm is *O*(*n*
^{4.5}/ε) [27] when converging to a solution within ε > 0 of the optimal. This complexity is better than
, which would have been achieved using the interior point method on the dual form [28].

In summary, the algorithm produces a time-averaged model of the data by computing the sample mean and then constructing the optimal regularized Σ by solving Eq. 9 using Block Coordinate Decent. The regularized covariance matrix Σ is guaranteed to be invertible which means we can always compute the precision matrix, Σ^{-1}, which can be interpreted as a graph over the variables revealing the direct and indirect correlations between the variables.