Bayesian networks
Let D = (V, E) be a directed acyclic graph (DAG), where V is a finite set of nodes and E is a finite set of directed edges between the nodes [19]. The DAG defines the structure of the BN.
Each node v ∈ V in the graph corresponds to a random variable x
v
. The set of variables associated with the graph D is then X = {x
v
}. Often we do not distinguish between a variable x
v
and the corresponding node v. To each node v with parents pa(v), a local probability distribution, p(x
v
|x
pa
(
v
)), is attached. The set of local probability distributions for all variables in the network is P. A BN for a set of random variables X is the pair (D,P). Directed edges in D encode conditional dependencies between the random variables X through the factorization of the joint probability distribution.
(1)
As a measure of how well a DAG D represents the conditional dependencies between the random variables, we use the relative probability
(2)
and refer to it as a network score, where d is data and p(d|D) is called the likelihood of D.
The log network score contribution of a node is evaluated whenever the node is learned. The log network score N(D) is given by
(3)
The number of possible DAGs grows exponentially with the number of nodes, and the problem of identifying the network with the highest score is NP-hard. If the number of random variables in a network is large, it is not computationally possible to calculate the network score for all possible DAGs. For these situations, the search strategy GHC method is implemented.
The GHC method is as follows.
-
1.
Select an initial DAG D0 randomly from which to start the search.
-
2.
Calculate the Bayes scores of D0 and all possible networks that differ by only one directed edge, that is, an edge is added to D0, an edge in D0 is deleted, or the direction of an edge in D0 is reversed.
-
3.
Among all these networks, select the one that increases the Bayes score the most.
-
4.
If the Bayes score was not improved, stop the search. Otherwise, make the select network D0 and repeat from step 2.
In the GHC method, we can limit the maximum number of these steps in the search algorithm. Also, the search algorithm can restart an arbitrary number of times. More details on the parameter setting will be described later in this paper.
Methods
We propose a new method to estimate a gene regulatory network with reduced computational time. The proposed method is composed of three steps: dividing the whole problem into partial problems, estimating gene regulatory networks of partial problems, and uniting the estimated networks. In this section, we describe our BN-based method using the analysis of a set of expression data as an example. This example includes five genes V = {v
i
|1 ≤ i ≤ 5}. A conceptual representation of our approach is presented in Figure 1. We call a search of all possible networks an exhaustive search to distinguish it from the GHC method.
Step 1: Dividing the whole problem into partial problems
Our approach first divides the set of all genes V into all the combinational subset with three genes (triplets) t = {v
i
, v
j
, v
k
∈ V|1 ≤ i <j <k ≤ 5}. For example, our approach obtains 5C3 = 10 partial problems {v1, v2, v3}.{v1, v2, v4}, ..., {v3, v4, v5}.
Step 2: Estimating gene regulatory networks
After making partial problems, we next calculate independently the scores of all the possible networks of each partial problem by exhaustive search and obtain estimated DAGs G. The number of possible alternative networks for a triplet {v1, v2, v3} is 33 = 27 because there are three cases for each potential edge (v
i
, v
j
) (1 ≤ i <j ≤ 3): a directed edge from v
i
to v
j
, a directed edge from v
j
to v
i
, and no edge.
Let c = (D, S
D
, R
D
) be a tuple, where D ∈ G is a DAG, S
D
= p(D, d) is a score of D, where p(D, d) is given by Equation 2, and R
D
is a rank of D.
We add tuples of all the partial problems to Z, where Z is a set of c. For example, when we have 10 partial problems {v1, v2, v3}.{v1, v2, v4}, ... , {v3, v4, v5}, we add 270 tuples of networks to Z.
Step 3: Uniting estimated partial problems
To solve the original problem, this step unites three-gene networks into a single gene regulatory network. The policy of the step is to classify relationships between genes, i.e., determine (v
i
, v
j
) (1 ≤ i <j ≤ 3) into one of the three edge types (a directed edge from v
i
to v
j
, a directed edge from v
j
to v
i
, or no edge between v
i
and v
j
) according to the score calculated in Step 2.
To select an edge type between genes v
i
and v
j
, we calculate an edge (v
i
, v
j
) value for each of the three types t using the following:
(4)
where D has edge (v
i
, v
j
). Then we select one edge type that has the highest total value.
When two or more edge types have the highest total value, we use edge scores of the partial problems whose ranks are 2 or more.
Algorithm
Input: V = V1, ..., Vn: a set of genes, GEP: gene expression profiles of V
Output: G
V
: DAG including genes V
Variable: Z: a set of tuples (graph, score, rank)
1: Make a collection of set V that includes all the subsets of V with three elements
2-1: for each U in V do
2-2: Make a collection of set D
u
that includes all the DAGs of U
2-3: for each D in D
u
do
2-4: calculate rank R
D
and score S
D
with GEP
2-5: add (D, S
D
, R
D
) to Z
2-6: end for
2-7: end for
3-1: i ← 1
3-2: repeat
3-3: for each edge between genes (x, y) in D of (D, S
D
, i) do
3-4: add all S
D
of (D, S
D
, i) for each of the three edge types
3-5: if one edge type has the highest total S
D
then
3-6: add an edge between genes (x, y) to G
V
3-7: end if
3-8: if two or more edge types have the highest total S
D
then
3-9: for each edge between genes (x or y, w) in G
V
, where w is a gene ≠ x, y do
3-10: select edge between genes (x, y) from D of (D, S
D
, i), where D includes genes x, y, and w.
3-11: end for
3-12: add edge (x, y) selected in (3-10) with the highest S
D
to G
V
3-13: end if
3-14: end for
3-15: i←i+1
3-16: until directions of all edges in G
V
are assigned
3-17: return G
V
A flowchart of the algorithm can be found in Figure 2.
Computational experiments
To verify the effectiveness of the proposed method, we performed three experiments. The first experiment determines computational time for different numbers of genes. The purpose of this experiment is to verify that the proposed method is able to estimate gene regulatory networks that are as large-scale as those estimated by the GHC method. The second experiment demonstrates that the proposed method is more accurate than the GHC method. The third experiment shows, through an example, that our algorithm works well for inferring real gene regulatory networks. We estimate the networks, including the known gene regulatory network, and compare the network estimated by the proposed method and that by the GHC method.
Implementation, system, and materials
Steps 1 and 2 are implemented using the deal package version 1.2-33 written in R. We use R 2.10.1. Step 3 is implemented using Perl 5.10.1.
The GHC method is implemented in the deal package version 1.2-33. In these experiments, the maximum number of actions, i.e., adding, deleting, or reversing a directed edge, is set at 50 and the number of restarts is set at 0. We call these parameters the default parameter set.
We performed all the experiments on a computer with Intel Core2 Duo 6600 CPU 2.40 GHz processors with 3.0 GB memory. The operation system is Ubuntu 10.04.
We used a dataset of two time-series gene expression profiles including 45102 genes from a mouse adipocyte and osteoblast. The number of time points is 62.
Experiment 1
We verified that the proposed method can estimate gene regulatory networks as large-scale as those estimated by the GHC method. We used the proposed method, an exhaustive search, and the GHC method, and compared the estimation time for from 3 to 70 genes. In this experiment, we selected genes from the gene expression profile from a mouse adipocyte by random sampling. We ran this process 50 times and calculated the mean estimation time. The results are summarized in Figure 3.
In Figure 3, the horizontal axis corresponds to the number of genes and the vertical axis corresponds to the logarithm of the estimation time. The proposed method was able to estimate the network including 70 genes, and the estimation times were almost the same as those of the GHC method. The estimation time of the proposed method was shorter than that of the GHC method for 40 or more genes. The estimation time of the proposed method was longer than that of the GHC method for 15 or fewer genes. The estimation time of the exhaustive search was very large by 5 genes.
Experiment 2
We verified that the estimation accuracy of the proposed method is higher than that of the GHC method for nearly identical estimation times. We compared the estimation results of the exhaustive search with the results of the proposed method and the GHC method. In this experiment, we selected five genes randomly from the gene expression profile 100 times from a mouse adipocyte and osteoblast. We estimated the network of these five genes by the proposed method and the GHC method. There are 59049 DAGs for five genes, and all the DAGs are ranked by the scores of the exhaustive search. The ranking was used to evaluate the networks estimated by the proposed method and the GHC method. The results are listed in Figure 4.
The two bar charts in Figure 4 show the ranks of 100 networks estimated by the proposed method and the GHC method. The left bar chart is the results for adipocyte, and the right are those for osteoblast. The correspondence count is the number of times that the network estimated by the proposed method or the GHC method corresponded with the network of the exhaustive search. The ranking in the exhaustive search is the ranking of the networks estimated by the exhaustive search. The networks are ranked by the scores of the exhaustive search. As there are 59049 DAGs for five nodes, the ranks are from 1st to 59049th.
The correspondence count of the proposed method from the 1st to 10th networks of the exhaustive search exceeded 50. For the correspondence count from the 30001th to the 59049th network of the exhaustive search, the GHC method exceeded 50 and the proposed method was less than 10.
Experiment 3
We used a known gene regulatory network and verified that the proposed method can estimate more accurately than the GHC method with the same or less computational time. We compared the regulations estimated by the proposed method with those of the GHC method. In this experiment, we used 40 genes from the gene expression profile from a mouse adipocyte. Of these, 7 genes are Pparγ and the genes that regulate or are regulated by Pparγ in adipocyte. These are shown in Figure 5(a). The remaining 33 genes were selected by random sampling. The results and known networks are shown in Figure 5. In this experiment, we used two parameter sets for the GHC method. One is the default parameter set. In the other parameter set, the maximum number of actions is 100 and the number of restarts is 10, which will return a better network but requires about 20-fold longer computational time than the default.
In Figure 5, results of the default and other parameter set are shown as networks (b) and (c), respectively. We call (c) the network estimated by the highly accurate GHC method in this experiment. Network (d) is estimated by the proposed method. The edges in networks (b), (c), and (d) are categorized according to the edges of network (a). The red edges are also in network (a), the blue edges have a different direction from those in network (a), and the black edges have no relationship in network (a).
Figure 5 shows that the proposed method was able to estimate more correctly than the GHC method. The sensitivity and selectivity of the proposed method were 33% and 30%, those of the GHC method were 0% and 0%, and those of the high accurate GHC method were 11% and 14%. Networks (b), (c), and (d) have many edges that the known gene regulatory network does not have, but these edges describe indirect regulations. For example, in Figure 5(d), there is a black edge from C/EBPα to Stat 1. The edge describes the indirect regulation from C/EBPα to Stat 1 via Pparγ because there are edges from C/EBPα to Pparγ and from Pparγ to Stat 1 in Figure 5(a).