### Bayesian Networks - BNs

Bayesian Networks (BNs) are a combination of probability theory and graph theory. A graphical structure \mathcal{M}, a family of conditional probability distributions \mathcal{F} and their parameters **q**, fully specify a BN. The graphical structure \mathcal{M} of a BN consists of a set of nodes and a set of directed edges. The nodes represent random variables, while the edges indicate conditional dependence relations. The family of conditional probability distributions \mathcal{F} and their parameters **q** specify the functional form of the conditional probabilities associated with the edges, that is, they indicate the nature of the interactions between nodes and the intensity of these interactions. A BN is characterized by a simple and unique rule for expanding the joint probability in terms of simpler conditional probabilities. This follows the local Markov property: *A node is conditionally independent of its non descendants given its parents*. Due to this property, the structure \mathcal{M} of a BN has necessarily to be a directed acyclic graph (DAG), that is, a network without any directed cycles. Let *X*_{1}, *X*_{2}, ..., *X*_{
N
}be a set of random variables represented by the nodes *i* ∈{1, ..., *N*} in the graph, define *π*_{
i
}[*G*] to be the parents of node *X*_{
i
}in graph *G*, and let {{X}_{\pi}}_{{}_{i}}\left[G\right] represent the set of random variables associated with *π*_{
i
}[*G*]. Then we can write the expansion for the joint probability as P\left({X}_{\mathsf{\text{1}}},...,{X}_{N}\right)={\prod}_{i=1}^{N}\mathsf{\text{P}}\left({X}_{i}|{{X}_{\pi}}_{{}_{i}}\left[G\right]\right).

The task of learning a BN structure in a score-based approach consists in devising a BN structure from a given set of training data . The main aim is to find a DAG structure that better explains the data available for learning. If we define that is the space of all models, the first goal is to find a model {\mathcal{M}}^{*}\in M that is most supported by the data ,\mathcal{M}*=\text{arg}\underset{\mathcal{M}}{\text{max}}\left\{P\left(\mathcal{M}|\mathcal{D}\right)\right\}. Having the best structure {\mathcal{M}}^{*} and the data , we can now find the best parameters, q=\text{arg}\underset{q}{\text{max}}\{P\left(q|{\mathcal{M}}^{*},\mathcal{D}|\right\} If we apply Bayes' rule we get P\left(\mathcal{M}|\mathcal{D}\right)\phantom{\rule{0.3em}{0ex}}\infty \phantom{\rule{0.3em}{0ex}}P\left(\mathcal{D}|\mathcal{M}\right)P\left(\mathcal{M}\right) where the marginal likelihood implies an integration over the whole parameter space:

P\left(\mathcal{D}|\mathcal{M}\right)=\int P\left(\mathcal{D}|q,\mathcal{M}\right)P\left(q|\mathcal{M}\right)dq

(1)

The integral in Equation 1, our score, is analytically tractable when the data is complete and the prior P\left(q|\mathcal{M}\right) and the likelihood P\left(\mathcal{D}|q,\mathcal{M}\right) satisfies certain regularity conditions [17, 18].

According to Equation 1 we have a way to assign a score to a graphical structure given a data set. However, the search for high scoring structures is not trivial. It is impossible to list the whole set of structures because its number increases super-exponentially with the number of nodes. Also when considering an sparse data set P\left(\mathcal{M}|\mathcal{D}\right) is diffuse, meaning that P\left(\mathcal{M}|\mathcal{D}\right) will not be properly represented by a single structure {\mathcal{M}}^{*}. Hence, a Markov chain Monte Carlo (MCMC) scheme is adopted [19], which under fairly general regularity conditions is theoretically guaranteed to converge to the posterior distribution [20]. Given a network structure {\mathcal{M}}_{\mathsf{\text{old}}}, a network structure {\mathcal{M}}_{\mathsf{\text{new}}} is proposed from the proposal distribution \mathcal{Q}\left({\mathcal{M}}_{\mathsf{\text{new}}}|{\mathcal{M}}_{\mathsf{\text{old}}}\right), which is then accepted according to the standard Metropolis-Hastings [20] scheme with the following acceptance probability:

A=\text{min}\left\{\frac{P\left(\mathcal{D}|{\mathcal{M}}_{\mathsf{\text{new}}}\right)P\left({\mathcal{M}}_{\mathsf{\text{new}}}\right)Q\left({\mathcal{M}}_{\mathsf{\text{old}}}|{\mathcal{M}}_{\mathsf{\text{new}}}\right)}{P\left(\mathcal{D}|{\mathcal{M}}_{\mathsf{\text{old}}}\right)P\left({\mathcal{M}}_{\mathsf{\text{old}}}\right)Q\left({\mathcal{M}}_{\mathsf{\text{new}}}\right)\left({\mathcal{M}}_{\mathsf{\text{old}}}\right)},1\right\}

(2)

In this paper we use the standard MCMC proposal which consists in to propose, at each interaction, one of the basic operations of adding, removing or reversing an edge. For more details about this scheme see [10].

### Bayesian Networks with Interventions - BN-I

Nowadays molecular biology has different techniques for producing interventions in biological systems, for instance, knocking genes down with RNA interference or transposon mutagenesis. The consequence is that the components of the system which are targeted by the interventions are no longer subject to the internal dynamics of the system under investigation. The components of the biological system can be either activated (up-regulated) or inhibited (down-regulated) and under this external intervention their values are no longer stochastic. The intervened components are not subject to the internal dynamics of the system, hence their values are deterministic. However, the other components which are not intervened are influenced by these deterministic values. Therefore, interventions are very useful to break the symmetries within the equivalence classes of BNs and consequently to the discovery of putative causal relationships. For a discussion about equivalence classes see [21] and for a discussion about putative causal relationships see [12, 22].

In order to incorporate the interventions under the BN framework two small modifications are necessary. The calculation of the score for observational data P\left(\mathcal{D}|\mathcal{M}\right) as defined in Equation (1) is modified. Effectively the measurements of a node *X*_{
i
}under intervention are removed from the computation of the score.

The second necessary modification is related to the definition of equivalence classes. In [23] it is defined the Transition Sequence equivalent networks (TS-equivalent). Two networks {\mathcal{M}}_{1} and {\mathcal{M}}_{2} are TS-equivalent if and only if they have the same skeleton, the same set of v-structures and the same set of parents for all manipulated variables. All edges connected with an intervened node become directed when the concept of TS-equivalence is applied. Therefore, new v-structures are formed and further edges become directed. In order to obtain the TS-equivalent DAG the procedure presented by [24] is applied. For each intervened node in the network two dummy nodes are added each with one directed edge pointing from the dummy node to the intervened node. The new DAG now with the dummy nodes added is converted to a CPDAG (for a discussion about CPDAGs see [25] ). Finally the dummy nodes are removed and we have the DAG TS-equivalent graph.

The sampling scheme of the BNs-I is the same of the BNs and is given by Equation 2.

### Bayesian Networks with addition of Extra knowledge - BN-E

In order to be able to incorporate extra knowledge in the inference of networks it is necessary to define a function that measures the agreement between a given network structure \mathcal{M} and the extra knowledge that we have at our disposal, . We call this agreement measure 'energy' following the approach proposed in [26].

A network structure \mathcal{M} is represented by an adjacency matrix where each entry {\mathcal{M}}_{ij} can be either 0 or 1 representing respectively the absence and the presence of an edge between node-*i* and node-*j*. The prior knowledge matrix, or belief matrix, , is defined to have entries {\mathcal{B}}_{ij} ∈ [0, 1] representing our knowledge about the node interactions. An entry {\mathcal{B}}_{ij}=0.\mathsf{\text{5}} denotes that we do not have any information about the presence or absence of an edge between node-*i* and node-*j*. If 0\le {\mathcal{B}}_{ij}<0.\mathsf{\text{5}} we have prior evidence that the edge between node-*i* and node-*j* is absent and the evidence is stronger as {\mathcal{B}}_{ij} is closer to 0. At last, if 0.5<{\mathcal{B}}_{ij}\le 1 we have prior evidence that there is a directed edge pointing from node-*i* to node-*j*. The evidence is stronger as {\mathcal{B}}_{ij} is closer to 1. It is important to note that the entries in our belief matrix are not proper probabilities and they only express our belief, or knowledge obtained from other sources, about the relationships among nodes.

Having defined how to represent a BN structure, \mathcal{M}, and the extra belief, , the energy of a network is defined as:

E\left(\mathcal{M}\right)=\sum _{i,j=1}^{N}|{\mathcal{B}}_{i,j}-{\mathcal{M}}_{i,j}|

(3)

where *N* is the total number of nodes in the studied domain. From Equation 3 it is clear that the energy *E* is zero for a perfect match between the prior knowledge \mathcal{B} and the actual network structure , while increasing values of *E* indicate an increasing mismatch between \mathcal{B} and .

Following the work of [26] we integrate the prior knowledge expressed by Equation 3 into the inference procedure, and define the prior distribution over network structures, , to take the form of a Gibbs distribution:

P\left(\mathcal{M}|\beta \right)=\frac{{e}^{-\beta E\left(\mathcal{M}\right)}}{\sum _{\mathcal{M}\in M}{e}^{-\beta E\left(\mathcal{M}\right)}}

(4)

where the energy E() was defined in Equation 3, *β* is a hyper-parameter that corresponds to an inverse temperature in statistical physics, and the denominator is a normalizing constant that is usually referred to as the partition function. Note that the summation in the denominator extends over the set of all possible network structures. The hyper-parameter *β* can be interpreted as a factor that indicates the strength of the influence of the prior knowledge relative to the data. For *β* → 0, the prior distribution defined in Equation 4 becomes flat and uninformative about the network structure. Conversely, for *β* → ∞, the prior distribution becomes sharply peaked at the network structure with the lowest energy.

For Dynamic Bayesian Networks the summation in the denominator of Equation 4 can be computed exactly and efficiently as discussed in [16] with

Z\left(\beta \right)=\prod _{n}\sum _{{\pi}_{\mathcal{M}}\left(n\right)}{e}^{-\beta \epsilon \left(n,{\pi}_{\mathcal{M}}\left(n\right)\right)}.

(5)

In this paper we apply the method only to static BNs and thus the summation in the denominator of Equation 4 is in fact an upper bound to the true value. This happens because this summation includes all possible structures and we are only interested in the DAG structures. Furthermore, throughout this paper we use a fan-in restriction of three as has been proposed in several other applications, for instance see [27–29]. This fan-in restriction makes the summation over all structures closer to the summation of only the DAGs as it reduces the number of densely connected networks. The partition function approximation has been investigated elsewhere [16, 30] and was not found to pose a problem to the proposed method.

### BN-E MCMC sampling scheme

At this point, having the prior probability distribution over network structures defined, an MCMC scheme to sample both the hyper-parameters and the network structures from the posterior distribution P\left(\mathcal{M},\beta |\mathcal{D}\right) is proposed.

A new network structure {\mathcal{M}}_{\mathsf{\text{new}}} and a new hyper-parameter {\beta}_{\mathsf{\text{new}}} are proposed respectively from the proposal distributions \mathcal{Q}\left({\mathcal{M}}_{\mathsf{\text{new}}}|{\mathcal{M}}_{\mathsf{\text{old}}}\right) and \mathcal{R}\left({\beta}_{\mathsf{\text{new}}}|{\beta}_{\mathsf{\text{old}}}\right). This proposed move is then accepted according to the Metropolis-Hastings update rule [20] with the following acceptance probability:

\begin{array}{c}A=\text{min}\left\{\frac{P\left(\mathcal{D}|{\mathcal{M}}_{\mathsf{\text{new}}}\right)}{P\left(\mathcal{D}|{\mathcal{M}}_{\mathsf{\text{old}}}\right)}\times \frac{P\left({\mathcal{M}}_{\mathsf{\text{new}}}|{\beta}_{\mathsf{\text{new}}}\right)}{P\left({\mathcal{M}}_{\mathsf{\text{old}}}|{\beta}_{\mathsf{\text{old}}}\right)}\times \frac{P\left({\beta}_{\mathsf{\text{new}}}\right)}{P\left({\beta}_{\mathsf{\text{old}}}\right)}\right.\times \\ \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\left(\right)close="\}">\frac{Q\left({\mathcal{M}}_{\mathsf{\text{old}}}|{\mathcal{M}}_{\mathsf{\text{new}}}\right)}{Q\left({\mathcal{M}}_{\mathsf{\text{new}}}|{\mathcal{M}}_{\mathsf{\text{old}}}\right)}\times \frac{R\left({\beta}_{\mathsf{\text{old}}}|{\beta}_{\mathsf{\text{new}}}\right)}{R\left({\beta}_{\mathsf{\text{new}}}|{\beta}_{\mathsf{\text{old}}}\right)},1\end{array}\n

(6)

which was expanded following the conditional independence relationships depicted in Figure 3.

In order to increase the acceptance probability which in turn can augment the mixing and convergence of the Markov chain the move is separeted into two submoves. In the first move a new network structure {\mathcal{M}}_{\mathsf{\text{new}}} is sampled from the proposal distribution Q\left({\mathcal{M}}_{\mathsf{\text{new}}}|{\mathcal{M}}_{\mathsf{\text{old}}}\right) while keeping the hyper-parameter *β* fixed. Next, we sample a new hyper-parameter *β* from the proposal distribution R\left({\beta}_{\mathsf{\text{new}}}|{\beta}_{\mathsf{\text{old}}}\right) for a fixed network structure . The two sub-moves are iterated until a convergence criterion is satisfied.

### Data sets

One very interesting aspect when comparing different methods applied to the inference of the structure of networks is the ability to compare how they perform when faced with real data sets. In our case a real data set means data obtained with real experiments from a real biological system. Also, the comparison among the methods with real data is only possible if the network which the data was generated from is known. We call this known network the *gold-standard* network. Taking these considerations into account we use data from flow cytometry experiments obtained by [31] where the Raf signalling pathway, see Figure 4, was studied. This particular data set is very interesting as it provides high quality measurements, large amounts of data, intervened data and a *gold-standard* network.

Because the interest is to compare the BNs approaches in the context of inference of networks, where the data available are usually sparse, we down sampled the original data to 100 data points. Furthermore, we average the results over five data sets. The observational data is obtained from the original data where no interventions were realized. The interventional data is sampled from all the interventions realized in the original data and is composed by: 16 data points without intervention; 14 data points for each of the inhibited proteins (*AKT, PKC, PIP2, MEK*) and 14 data points for each of the activated proteins (*PKC, PKA*) proteins, performing a total of 100 data points.

In order to further investigate how the methods compare synthetic data sets were also prepared. These data are obtained from two different sources: a linear Gaussian distribution and a simulation tool named Netbuilder [32, 33]. In both cases, the data is obtained from the known structure of Figure 4.

Considering (.) to denote the Normal distribution, a random variable *x*_{
i
}is sampled from a linear Gaussian distribution with value distributed according to {x}_{i}~\mathcal{N}\left({\sum}_{k}{w}_{ik}{x}_{k},{\sigma}^{2}\right) where the sum extends over all parents of node *i* and *xk* represents the value of node *k*. The standard deviation is set to *σ* = 0.1 and the interaction strengths |*w*_{
ik
}| are sampled from the uniform distribution over the interval [0.5, 2], where the sign of *w*_{
ik
}is randomly varied.

In Netbuilder a sigma-pi formalism is implemented as an approximation to the solution of a set of Ordinary Differential Equations that models enzyme-substrate reactions, allowing the acquisition of data typical of signals measured in molecular biology. The data sets simulated with Netbuilder are closely related to real data sets when compared with the Gaussian data. For more details about the data generation see [11].