An uncertain model-based approach for identifying dynamic protein complexes in uncertain protein-protein interaction networks

Background Recently, researchers have tried to integrate various dynamic information with static protein-protein interaction (PPI) networks to construct dynamic PPI networks. The shift from static PPI networks to dynamic PPI networks is essential to reveal the cellular function and organization. However, it is still impossible to construct an absolutely reliable dynamic PPI networks due to the noise and incompletion of high-throughput experimental data. Results To deal with uncertain data, some uncertain graph models and theories have been proposed to analyze social networks, electrical networks and biological networks. In this paper, we construct the dynamic uncertain PPI networks to integrate the dynamic information of gene expression and the topology information of high-throughput PPI data. The dynamic uncertain PPI networks can not only provide the dynamic properties of PPI, which are neglected by static PPI networks, but also distinguish the reliability of each protein and PPI by the existence probability. Then, we use the uncertain model to identify dynamic protein complexes in the dynamic uncertain PPI networks. Conclusion We use gene expression data and different high-throughput PPI data to construct three dynamic uncertain PPI networks. Our approach can achieve the state-of-the-art performance in all three dynamic uncertain PPI networks. The experimental results show that our approach can effectively deal with the uncertain data in dynamic uncertain PPI networks, and improve the performance for protein complex identification. Electronic supplementary material The online version of this article (10.1186/s12864-017-4131-6) contains supplementary material, which is available to authorized users.


Background
Over the past decade, yeast two-hybrid, mass spectrometry and other high-throughput experimental have generated a mass of protein-protein interaction (PPI) data. Such PPI data construct the large-scale PPI networks for many organisms. Great efforts have been made to understand organizational principles underlying PPI networks. Many cellular principles have been uncovered by analysis of these networks, such as the scale-free topology [1], disassortativeness [2] and modularity [3].
A protein complex consists of a group of proteins and multiple PPIs at the same time and place, forming single multi-molecular machinery [4]. Since most proteins are only functional after assembly into protein complexes, protein complexes are critical in many biological processes [5]. Over the past decade, great effort has been made to detect complexes on the PPI networks. The Molecular Complex Detection (MCODE) algorithm proposed by Bader and Hogue is the first time to exploit computational methods to identify complexes based on PPI networks [6]. Markov Clustering (MCL) [7] can use random walks to identify based on PPI networks. Liu et al. [8] propose Maximal Cliques Clustering (CMC) to predict complexes from large PPI networks. Based on the core-attachment structural feature [9], Leung et al. [10] propose CORE algorithm to identify proteincomplex cores by calculating the p-values for all pairs of proteins. Similarly, Wu et al. [11] present COACH algorithm to identify protein complexes, which detects the core structure and attachments of complex respectively. Nepusz et al. [12] propose ClusterONE algorithm which effectively improves the performance to identify the overlapping complexes. Zhang et al. [13] propose CSO algorithm to predict complexes by integrating GO data and PPI networks.
A protein complex is formed by a group of proteins at the same time, which interacted with each other by associated polypeptide chains. However, modeling biology systems as static PPI networks will lose the temporal information. It is necessary to construct dynamic PPI networks for both identifing protein complexes and further understanding molecular systems. Since gene expression data is helpful to analyze the temporal information of proteins, some studies [14][15][16][17][18] have used gene expression data to construct dynamic PPI networks and reveal the dynamic character of PPI networks. For example, Faisal et al. [14] predict human aging-related genes by integrating aging-related gene expression data with human PPI data. Wang et al. [15] construct dynamic PPI networks and detect complex by exploiting gene expression data and PPI data.
Another issue in complexes identification is PPI networks contain much noise data including false positive and false negative rates [16]. Some studies have been proposed to improve the reliability of PPI networks [17]. Using uncertain graph model to deal with such PPI networks is more reasonable than traditional graph model. Uncertain model have been applied to analyze social networks, electrical networks and biological networks. Recently, Zhao et al. [18] use uncertain model to detect protein complexes in static PPI networks. Nonetheless, few studies apply uncertain model to analyze dynamic PPI networks.
In this study, we firstly construct dynamic uncertain PPI networks (DUPN) by integrating gene expression and PPI data. The active time point and the existence probability of each protein is calculated based on gene expression data. The existence probability of each PPI is calculated based on the topological property of high-through PPI data. We then attempt to use uncertain graph model to identify the protein complexes in DUPN, and propose a clustering algorithm named CDUN. Finally, we evaluate our method in different datasets and the experimental results show that our method achieves the state-of-the-art performance for complex identification.

Methods
In this section, we introduce how to integrate the gene expression data with the PPI data to construct the DUPN, and then describe the clustering algorithm CDUN for identify protein complexes based on the DUPN in details.

Active time points and probability of proteins
In a living cell, proteins and PPIs are not static but changing over time [19]. The gene expression is useful to analyze the temporal information of the proteins. In recent years, some studies [15,20,21] have use gene expresstion data to calculate the active time points of proteins.
The gene expression data consist of n time point profiles. Let G i (p) denote the gene p expression value at i time point. Let α(p)and σ(p)be the arithmetic mean and the standard deviation (SD) of Gi(p), respectively.
In this study, we use the Eqs. (3) and (4) to calculate protein active probability at the different time points.
We use the Eq. (3) to calculate the k-sigma (k = 1,2,3) threshold for the gene p. Ge_thres k is determined by the values of α(p),σ 2 (p)and k (the times of sigma). If σ 2 (p)is very low, it indicates that the fluctuation of the expression curve of gene p is also very small and the value of G i (p) tends to be very close to α(p). In this case, the value of Ge_thresh k is close to α(p). If σ 2 (p)is very high, it indicates much noise in the gene expression data of the gene p. In this case, the value of Ge_thresh k is close to α(p) + k · σ(p). In the Eq. (3), the range of k (the times of sigma) is in (0, 3), and 3 is the maximum times of sigma. The larger k is, the higher Ge_thresh k gets. A higher value of Ge_thresh k indicates that using more strict rules to identify the active time point of a protein [20].
We use the Eq. (4) to calculate the active probability of a protein in the i time point. Thus, the protein active probability contains four levels (0.99, 0.95, 0.68 and 0) based on the sigma rules (P{|X-α| < σ} ≈ 0.6827, P{|X-α| < 2σ} ≈ 0.9545 and P{|X-α| < 3σ} ≈ 0.9973) [15,20]. Figure 1 shows an illustration example of the DUPN construction. Firstly, we use the PPI data to construct the static PPI networks in Fig. 1a. Secondly, we use gene expression data to calculate the active time points and the probability of each in Fig. 1b. In this study, the active probability only include three values P1 = 0.99, P2 = 0.95 and P3 = 0.68 based on the Eq. (4). Although a PPI imply physical contact between two proteins, it does not mean that the interaction occur in a cell at any time [22]. The real PPI networks are changing during the lifetime of a cell, because the active time points of proteins are different. Thirdly, we can inject the static PPI networks into a series of PPI subnetworks based on the dynamic information of the proteins in Fig. 1c. These PPI subnetworks associated with the different active time points construct a dynamic PPI network. All proteins in the PPI subnetworks Ti are active with an active probability at Ti time point. Finally, we assign an uncertain value to each protein and PPI in the dynamic PPI networks to construct the DUPN in Fig. 1d. In this way, we can distinguish the uncertain level of both protein and PPI in the DUPN. The existence probability of each protein is the active probability calculated based on Eq. (4). Zhao et al. [18] proposed a method to calculate the existence probability of PPI based on the topology structure of the PPI networks. In this study, we use the same method to calculate the existence probability of each PPI on the  Fig. 1d based on the topology structure of the PPI subnetworks in the Fig. 1c. The existence probability between the two proteins v j and v k is defined as follows:

Construction of DUPN
where N j and N k are the sets consisting of all neighbors of v j and v k at Ti time point in Fig. 1c, respectively. Our method to construct DUPN is different from the work [18]. In the DUPN, we assign an uncertain value to each protein and PPI, which can distinguish the uncertain level of each protein and PPI in the dynamic PPI networks.

Uncertain graph model
A static PPI network generally can be modeled as G is the function that assigns a probability of existence to each protein and P Ti E : E Ti → [0,1] is the function that assigns a probability of existence to each PPI at T i time point.
To deal with uncertain data, some uncertain graph models and theories [18,23,24] have been proposed to analyze social networks, electrical networks and biological networks and so on. In this study, we assume the probabilities of proteins and PPIs are independent. Let . The instantiation is a deterministic network with an observing probability. We denote the relationship between G ' j and UG Ti as UG Ti G ' j . The probability of Pr(G ' j ) is given as follows: The Eq. (6) gives a probability distribution over all instantiations of the uncertain PPI network UG Ti at T i time point. Based on the Eq. (6), if an uncertain PPI network UG Ti consists of n instantiations {G ' In an uncertain PPI network, identifying protein complexes has to take into account all possible instantiations {G ' 1 , G ' 2 ,…, G ' n } that are associated with the probabilities defined in Eq. (6).
is the probability associated with instantiation G ' j ∈PG Ti . Given a set of protein vertices in UG Ti , V S V Ti , the expected density of V S is defined as follow: where h j is the number of PPIs among the proteins of . Given a set of protein vertices V S ⊂ V Ti , a protein vertex v a ∈V Ti and v a ∉V S , the attached score between v a and V S in the UG Ti , is given as follows: where m j is the number of PPIs between v a and V S in the instantiation G ' j . As the uncertain graph model, an uncertain PPI network can generate a large amount of different possible instantiation. According to the Eqs. (7) and (8), the computational complexity is very high in an uncertain PPI network. Based on the studies [18,24], the Eqs. (7) and (8) can be efficiently calculated by the Eqs. (9) and (10), respectively.
Thus, based on the uncertain graph model, we can use the Eqs. (9) and (10) to efficiently calculate the expected density and the attached score for protein complex identification in an uncertain PPI network, respectively.

The CDUN algorithm
Some studies has revealed the complex core-attachment organization [25]. A protein complex generally contains of a core structure and some attachment proteins. In the core structure, the proteins share high functional similarity, which are highly co-expressed [9]. The attachment proteins assist the core proteins to perform subordinate functions. Based on the core-attachment structure of protein complexes, the CDUN algorithm identifies protein complexes from all the uncertain PPI networks of a DUPN in turn. Algorithm 1 shows the pseudo-codes of the DUPN algorithm.
CDUN algorithm consists of two phases. CDUN firstly detects candidate protein complexes from all UG Ti ∈DG in turn at line 1-31. The candidate complexes are added into Candidate_complex set. Then, CDUN removes the highly overlapped protein complexes from Candidate_complex at line 32-44, based on their ED value.
In the first phase, CDUN firstly calculates the expected density of all edges in UG Ti based on Eq. (9) at line 4-5.
ED ({u,v}, UG Ti ) denotes the expected density of the edge between u and v. The edge will be added into Seed_set, if its expected density is not less than Core_thresh that is a predefined threshold parameter. The effect of Core_thresh is discussed in The effect of Core_thresh section. Average expected density of all edges is calculated at line 10. Secondly, CDUN augments each seed to generate the core structure at line 11-20. If the ED value of the core structure is not less than Core_thresh, CDUN will add the neighbor protein p into the core structure at line 25-28. We use the same parameter (Core_thresh) in lines 7 and 16 to keep the expected density of both the seeds and the core structures are not less than the Core_thresh. Finally, CDUN detects the attachment proteins for each core structure based on the AS score that is calculated by Eq. (10), and adds the attachment proteins into each core structure to form the candidate complex set Candidate_all at line 22-30.
The candidate protein complexes in Candidate_all are identified from all UG Ti ∈DG, which generally overlap with each other. In the second phase, CDUN calculates the ED value of all candidate protein complexes in line 32-34. We rank the candidate complexes in descending order of the ED value (Candidate_list = (cc 1 , cc 2 ,…, cc n )) at line 35. The candidate complex with highest ED value in will be removed from Candidate_list and added into Complex_set. CDUN checks the overlapped degree between cc i ∈Candidate_list and cc 1 . CDUN will removes cc i from Candidate_list at line 39-42, if the overlapped degree is larger than the Overlap_thresh. In our experiments, we set the Overlap_thresh as 2/3. The above steps will be repeated until Candidate_list is empty and the final complex set Complex_set is generated.

Datasets
The PPI datasets used in our experiments are the DIP [26], MIPS [27] and STRING [28] datasets, respectively. The PPI data of STRING dataset are from biomedical literature data, high-throughput data, genomic context data and co-expression data. Table 1 lists the statistics of the dataset in our experiments.
We download the gene expression data GSE3431 [29] from Gene Expression Omnibus, which involves 36 different time intervals. The GSE3431 consists of 3 cycles and each cycle is 12 time intervals. We calculate the average value at 12 active time points for each gene To evaluate the protein complexes identified by our method, the gold standard data are CYC2008 [30]. The CYC2008 benchmark consist of 408 protein complexes, which includes some complexes of size 2. In some cases, it is hard to evaluate the performance of the methods by using the complexes of size 2. Therefore, we use 236 complexes of size more than 2 in the CYC2008 to evaluate the complexes identified in the experiments.

Evaluation metrics
Overall, most of the complexes identification methods use two type of evaluation metrics to evaluate the performance of complexes prediction [19]. One type of evaluation metrics are precision, recall and F-score. The other type are sensitivity (Sn), positive predictive value (PPV) and accuracy.
Let P(V P , E P ) be an identified complex and B(V B , E B ) be a known complex. The neighborhood affinity score NA(P,B) between P(V P , E P ) and B(V B , E B ) is defined as follows: In most studies of complex prediction, the P(V P , E P ) is considered as matching the B(V B , E B ) if NA(P,B) is larger than 0.2 [16]. In our experiments, we use the same threshold of NA(P,B).
Precision, recall and F-score are used to evaluate of our experimental results, which are defined as follows: where N ci and N cb are the number of detected protein known complexes by our method, respectively. Identi-fied_Set and Benchmark_Set denote the set of complexes identified by our method and gold standard dataset, respectively. In additional, we also report Sn, PPV and accuracy in our experiments. The definitions of Sn, PPV and accuracy are described in the study [16].

The effect of Core_thresh
In this experiment, we evaluate the effect of the threshold parameter Core_thresh on the performance of CDUN.
The Core_thresh determines not only the number of the seeds in the Seed_set, but also the expected density of the core structures generated from the seeds. We use DUPN_DIP to evaluate the effect of Core_thresh. Table 2 shows the detailed experimental results of Core_thresh ranged from 0 to 1. It can be seen that when Core_thresh takes from 0 to 1, the number of complexes identified by our method decreases constantly. When Core_thresh = 0, CDUN can identify 763 protein complexes on the DUPN_DIP. It indicates that too many seeds are generated due to the value Core_thresh is too small. When Core_thresh = 1.0, CDUN cannot identify any complexes on the DUPN_DIP. It indicates that no seeds can be generated due to the value Core_thresh is too large. Overall, with the increase of Core_thresh, the precision and PPV are increased, and the recall, Sn and Accuracy are. The F-score of CDUN ranges from 0.246 to 0.575. When Core_thresh is set as 0.4, the major metrics F-score achieves the highest value of 0.575.
The '#Complexes' refers to the number of identified complexes with different Core_thresh. The highest value in each row is in bold

Comparison with other methods
Then, we compare CDUN with other complex identification methods: CSO [13], Cluster ONE [12], COAN [17], CMC [8], COACH [11], HUNTER [31], MCODE [6], Transitivity Clustering method (TransClust) [32] and Spectral Clustering method (SpecClust) [33]. We test these methods on all three static PPI networks DIP, MIPS and STRING, respectively, and choose the optimal parameters. CDUN is performed on the DUPN_DIP, DUPN_-MIPS and DUPN_STRING, respectively. The Table 3 lists the comparison results using CYC2008 as the benchmark. Firstly, we use DIP dataset to compare the performance of complex detection methods. From Table 3, it can be seen that CDUN and CSO and COAN achieve the F-score of 0.575, 0.553 and 0.486, respectively, which significantly outperforms other methods. Both CSO and COAN exploit the GO data, which contain much valuable information related to protein complexes curated by experts. However, CDUN can achieve the highest Fscore of 0.575 without integrating GO annotation data. HUNTER achieves the highest precision of 0.852. Trans-Clust achieves the highest recall of 0.674, Sn of 0.622, PPV of 0.725 and accuracy of 0.672, respectively. But the precision of TransClust is only 0.13, which leads to a low F-score of 0.218.
Secondly, we use MIPS dataset to compare these methods. On MIPS dataset, CDUN achieves the highest F-score of 0.377, which are superior to other methods. HUNTER achieves the highest precision of 0.538. Trans-Clust achieves the highest recall of 0.623, Sn of 0.544, PPV of 0.71 and accuracy of 0.621, respectively.
Thirdly, we use STRING dataset to compare these methods. STRING dataset is much larger than other two datasets. This makes more difficult for protein complex identification on STRING dataset than other two datasets. From Table 3, we can see that CDUN achieve the For instance, Cluster ONE achieves a very low F-score of 0.188 on STRING dataset, which is much lower than on other datasets. This is mainly because the STRING PPI network is much more complex than the PPI networks constructed by other datasets. In addition, STRING dataset integrates PPIs not only from highthroughput experiments, but also from biomedical literatures, co-expression data, genomic context data. The multiple source data generally lead to more noise data in STRING dataset. These noise data also have impact on the performance of protein complex identification methods. Compared with other methods, CDUN integrates gene expression data and STRING dataset to construct DUPN_STRING which consists of 12 uncertain PPI subnetworks, {UG T1 , UG T2 , …, UG T12 . Then, CDUN identify the complexes from such uncertain PPI subnetworks. Eventually, CDUN achieve a high F-score of 0.537 on STRING dataset. We also note that CDUN does not achieve high recall and accuracy in some cases. For instance, CDUN only achieve accuracy of 0.526 and 0.387 on DIP and MIPS dataset, respectively. In the future work, we will try to improve the recall and accuracy of our method further.
In additional, we compare CDUN with DCU [21] on the DIP dataset. In the study [21], the DCU method was evaluated using all the 408 complexes in the CYC2008. Therefore, we also compare CDUN with DCU using all the 408 complexes of CYC2008. The comparison results are listed in the Table 4. It can be seen that CDUN achieves higher precision and F-score than DCU on DIP dataset.

The significance of the identified complexes
In this experiment, we use GO data to evaluate biological significance of the identified complexes. The GO classifies gene product functions along biological process, molecular function and cellular component. SGD's GO::TermFinder [34] is used to calculate the pvalue of an identified complex with respect to GO data in our experiment. If the p-value is less than 0.01, we consider the identified complex to be statistically significant. In Table 5, We calculate the proportion of identified protein complexes with p-value less than 0.01 on the three PPI datasets.
An study of cdc28-cyclin complexes identified by CDUN Our method can identify many protein complexes, as well as their active time points. The cellular systems are highly dynamic and responsive to cues from the environment. These dynamic complexes results are more valuable to reveal the cellular function and organization than the static complexes results. In Fig. 2, we present an example to illustrate this.
The PPI networks including the 10 proteins have extracted from MIPS dataset in Fig. 2. The PPI networks don't contain YGR109C and YPR120C, because there are no PPIs between the two proteins YGR109C and YPR120C with the other eight proteins in MIPS dataset. It is very difficult to identify the multiple Cdc28-cyclin complexes only based on the topology structure of PPI networks. Our method can use gene expression data to calculate the dynamic information of these proteins, which also have been listed in Fig. 2. From the protein dynamic information, we can see that these proteins manly are active at T2, T9, T10 and T11. For instance, YBR160W, YGR108W, YDL155W, YMR199W and YLR210W are active at T2 together. Then, our method constructs DUPN_MIPS based on PPI networks and protein dynamic information. Eventually, Cdc28-cyclin complex 1, 2 and 3 are identified from UG T2 , UG T9 , UG T10 andUG T11 by CDUN, which all matched in CYC2008 dataset.
From Fig. 2, we can see that the three different protein complexes are overlapped each other in the static PPI networks. Since our method constructs the DUPN, CDUN can effectively identify the three Cdc28-cyclin complexes. Furthermore, our method can identify the  active time points of the three Cdc28-cyclin complexes.
Cdc28-cyclin complex 1 and 2 are associated with T2 and T9, respectively. Cdc28-cyclin complex 3 is associated with T10 and T11. The experimental results reveal the dynamic property of Cdc28-cyclin complexes in the cellular systems. Firstly, the kinase catalytic subunit, YBR160W, associated with YGR108W, YDL155W, YMR199W and YLR210W to construct the Cdc28cyclin complex 1 at T2. Then, the kinase catalytic subunit, YBR160W, associated with YAL040C to construct the Cdc28-cyclin complex 2 at T9. Finally, YBR160W associated with YPL256C and YMR199W to construct the Cdc28-cyclin complex 3 at T10 and T11.