A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations

Wang, Rongquan; Wang, Caixia; Sun, Liyan; Liu, Guixia

doi:10.1186/s12864-019-5956-y

Methodology article
Open access
Published: 07 August 2019

A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations

Rongquan Wang^1,2,
Caixia Wang³,
Liyan Sun^1,2 &
…
Guixia Liu^1,2

BMC Genomics volume 20, Article number: 637 (2019) Cite this article

1805 Accesses
10 Citations
Metrics details

Abstract

Background

The detection of protein complexes is of great significance for researching mechanisms underlying complex diseases and developing new drugs. Thus, various computational algorithms have been proposed for protein complex detection. However, most of these methods are based on only topological information and are sensitive to the reliability of interactions. As a result, their performance is affected by false-positive interactions in PPINs. Moreover, these methods consider only density and modularity and ignore protein complexes with various densities and modularities.

Results

To address these challenges, we propose an algorithm to exploit protein complexes in PPINs by a Seed-Extended algorithm based on Density and Modularity with Topological structure and GO annotations, named SE-DMTG to improve the accuracy of protein complex detection. First, we use common neighbors and GO annotations to construct a weighted PPIN. Second, we define a new seed selection strategy to select seed nodes. Third, we design a new fitness function to detect protein complexes with various densities and modularities. We compare the performance of SE-DMTG with that of thirteen state-of-the-art algorithms on several real datasets.

Conclusion

The experimental results show that SE-DMTG not only outperforms some classical algorithms in yeast PPINs in terms of the F-measure and Jaccard but also achieves an ideal performance in terms of functional enrichment. Furthermore, we apply SE-DMTG to PPINs of several other species and demonstrate the outstanding accuracy and matching ratio in detecting protein complexes compared with other algorithms.

Background

A protein complex is a group of proteins that interact with each other to perform different cellular functions [1]. The detection of protein complexes from protein-protein interaction networks (PPINs) plays an important role in the realization of the cell function in the proteomics era. Specifically, protein complexes contribute to the study of protein interaction network [2], function, diseases [3], etc. Protein complexes help researchers to fully study the causes of various diseases and further develop new drugs. Research on protein complexes is helpful to analyze the different stages of diseases [4]. Current studies have shown that disease genes tend to be highly connected among themselves in disease networks. These highly connected subgraphs could be disease protein complexes and investigation of the cause and effect of these complexes in disease networks could contribute to providing the search space for bioinformaticists, enhance the analysis process [5, 6] and help medical researchers to design new drugs. As a result, the detection of protein complexes plays an indispensable role in complex diseases.

During the past decade, because of the development of high-throughput techniques such as yeast-two-hybrid [7], mass spectrometry [8], and protein chip technologies [9], the number of available PPINs has rapidly increased and have been collected from different public databases. In general, a PPIN can be naturally represented in the form of a network, which not only provides a people the panoramic scope of PPIs on a proteomics scale but also help us to understand the basic organization of cell machinery based on the whole network. How to use PPINs to analyze biological systems remains a meaningful task [10]. Although most of PPINs are missing and inaccurate [11, 12], they reveal biological processes and inherent organizational structures within cells [13–15]. How to accurately discover biological protein complexes is a main subject in biology and bioinformatics. In biology, there are some experimental methods have been designed to detect protein complexes in PPINs, including TAP-ms [16], Co-IP [17–19] and the two-hybrid system [13, 20]. However, biological experimental methods have their own shortcomings; for example, they are time-consuming, relatively expensive and inefficient. Thus, the use of to provide computational algorithms to improve the effectiveness of protein complex detection in PPINs is appealing.

To overcome these experimental constraints, various computational methods have been developed to improve the effectiveness of protein complex detection in PPINs. Some researchers have shown that a protein complex in a PPIN is a molecular structure consisting of both function and structure [21]. Furthermore, some related empirical studies on PPINs also support this point and indicate that modular components in these networks do exist [22]. These results have two implications: one is that these modules are composed closely related proteins and these proteins could have many common neighbor from the perspective of network topology; the other is that proteins in the same modules perform similar functions together in terms of biology. Thus, many researchers believe that proteins in the same complex generally implement the same or similar function and tend to interact with each other [23]. Generally, a PPIN is usually modeled as an undirected graph, where the nodes represent proteins and the edges correspond to protein-protein interactions. Therefore, protein complexes can be detected by mining the modular structures (i.e., dense subgraphs or subnetworks) from PPINs [24]. Based on this idea, the problem of detecting protein complexes in PPINs can be computationally addressed via graph clustering methods, where the resulting biological subgraphs or clusters are considered to be protein complexes. Herein, clustering consists of grouping nodes into groups (also called clusters or communities) such that the nodes in the same cluster are more similar to each other than the nodes in the other clusters [25]. Therefore, to overcome the disadvantages of the experimental methods, a series of graph clustering algorithms based on machine learning and data mining are developed as an compensatory choice to detect protein complexes.

Related work

Up to now, a variety of computational algorithms for detecting protein complexes have been proposed. We first try to make a brief classification of relation work. They mainly include Approaches based on cliques or dense subgraphs, Approaches based on core-attachment structure, Approaches based on hierarchical clustering, Approaches based on model, Approaches based on supervised learning. We will further discuss these methods in the following sections.

Approaches based on cliques or dense subgraphs

A large number of existing algorithms suppose protein complexes correspond to k-cliques or highly dense subgraphs. Thus, in the past decade a series of algorithms based on cliques or dense subgraphs have been proposed for detecting protein complexes from PPINs. Until now, many protein complexes detection algorithms also belong to this category. For example, adamcsek et al. [26] provide an application called CFinder to find the k-clique percolation clusters as protein complexes in PPINs. Another example is CMC [27], which first mines the maximal cliques from weighted PPIN, and then removes or merges some highly overlapping maximal cliques. However, this kind of methods require a protein complex to be k-clique or clique. Consequently, some researchers try to discover dense subgraph by using a heuristical searching strategy in a PPIN. For instance, MCODE [28] is one of the earliest this kind methods, which detects protein complexes based on seed-extend method and subgraph with highly density in a PPIN. Several years later, Altaf-UI-Amin et al. [29] propose DPClus, unlike MCODE, DPClus detect densely subgraphs as protein complexes based on the concepts of density and periphery. Following the DPClus, based on the diameter and density, Li et al. [30] present a improved clustering algorithm called IPCA. Several years later, a fast, memory-efficient cluster algorithm SPICi [31] is presented. This cluster algorithm uses density and support function for clustering larger networks.

In fact, approaches based on cliques or dense subgraphs are effective to detect the k-cliques or highly density protein complexes, but they fail to detect either the sparsely subgraph or the relatively peripheral proteins. How to tackle these challenges will be emphasis for further study.

Approaches based on core-attachment structure

Most of approaches based on cliques or dense subgraphs mainly focus on the assumption that the highly connected subgraphs may be protein complexes, but these methods ignore the inherent organization of protein complexes. Gavin et al. [14] recently have demonstrated that protein complexes consist a core and some attachments, in which proteins in the core are highly interconnected, and some attachments or protein modules often interact with their core sparsely and assist their core in performing subordinate functions. Employing the core-attachment structure, some outstanding detection algorithms are developed. They have mainly two stages: the first stage is identifying all dense subgraphs and letting them to be the protein complex cores and the second stage is to extend all complex cores by adding peripheral proteins into its core. For example, Wu et al. [32] develop the algorithm named COACH, which first mines some dense subgraphs as protein complex cores and then identifies peripheral proteins. And then peripheral proteins is cooperating with their protein complex core to form a protein complex. Recently, Peng et al. [33] propose another algorithm called WPNCA, which is a new algorithm by using the PageRank-Nibble algorithm and core-attachment structure. Experiments results show that WPNCA is superior to other state-of-the-art algorithm in detecting complexes.

Generally speaking, identified complexes with core-attachment structures have a larger size. In fact, the real protein complexes have a smaller size. It is a directions for further research in the future.

Approaches based on model

Up to now, approaches based on model in protein complexes detection are very popular in protein complexes detection. That because they show an excellent performance. Unlike most of algorithms that we mentioned above, approaches based on model focus predominantly on seeking to some relation model or graph pattern to predict protein complexes. It is a new way to discover protein complexes. Markov clustering (MCL) [34] is one of the most popular model by using the random walk strategy in a PPIN, and it has two basic operators called expansion and inflation. MCL can tolerate more noises than other types of algorithms. However, its result depends on the parameter inflation and it does not detect overlapping protein complexes. In fact, overlapping protein complexes takes up a large proportion of protein complexes. Based on this fact, Nepusz et al. [35] introduce a novel method (called ClusterONE) to predict overlapping protein complexes. ClusterONE introduces a cohesiveness (also called graph modularity) to assess the quality of protein complexes for the first time. On the basis of ClusterONE, we introduce CALM [36], a improved method, to detect protein complexes. Firstly, we identify overlapping nodes and seed nodes by calculating node degree and betweenness, then uses a greedy local research approach based on core-attachment and local modularity structure to produce detected protein complexes.

Although the algorithms based on model have good performance for the detection of protein complexes, their accuracy need to be improved by employing network topological features. For example, they could take multiple network topological property or biological informations into account.

Approaches based on hierarchical clustering

Recently, due to the form of a tree [37] in PPINs and the nature of modularity [38] in biological networks, some traditional hierarchical clustering algorithms are tried to detect protein complexes in the PPINs. The major difference among them is how to construct the hierarchical structure. More specifically, the key is how to measure the similarity of nodes. Next we introduce some representative algorithms.

Generally, traditional hierarchical clustering algorithms can not be use directly in PPINs with false positives. To overcome this challenge, based on the edge clustering coefficients and λ-module, Li et al. [39, 40] propose a new fast hierarchical algorithm for identifying protein complexes, named FAG-EC. Wang et al. modify FAG-EC and propose HC-PIN [41] to identify overlapping and hierarchical functional modules in a PPIN.

In summary, approaches based on hierarchical clustering can provide a global perspective to look at the hierarchical modular organization of a PPIN. What’s more, they are easy to implement and understand. However, most of them can not identify overlapping clusters and are sensitive to the noisiness of the PPINs [42]. Thus, their accuracies are limited. In practice, their performance is deficient in some cases.

Approaches based on supervised learning

The aforementioned various computational clustering algorithms are unsupervised-based clustering and they are used for finding protein complexes. All of these unsupervised clustering algorithms only consider one of the multiple topological structure of protein complexes and do not use the known complexes, thus they may ignore complexes with other types of topological structure.

To tackle the defect, with the development of supervised learning algorithms, some researchers utilize the information of known complexes to detect protein complexes from the PPINs. Supervised learning algorithms generally contain three main steps: (1) extract useful features from the known complexes; (2) train a supervised model by distinguishing the real complexes from random subgraphs based on the extracted features; (3) detect protein complexes from the PPINs by using the trained model as fitness evaluating function. So far ClusterEPs [43] is the best among them. It uses emerging patterns to measure the possibility of a subgraph being a complex.

Unfortunately, there is no appropriate feature selection method and the PPINs always have a considerable number of noise. Moreover, the number of known protein complexes is available for training is too small. These disadvantages make the trained model imprecise [44]. Meanwhile, some features are often related to the specific mapping PPINs, so these extracted features may be unique and not universal. As a result, their performance could decrease [45]. Therefore, how to overcome these issues is critical for further improving the accuracy of detection protein complexes.

Our work

The above algorithms have been shown to detect protein complexes effectively. Furthermore, proteins in the same complex generally possess high functional similarity; thus, protein constituting a complex possibly have similar function. Based on the strengths and weaknesses of the relative works and considering the fact that high-throughput PPINs are noisy and incomplete. Furthermore, proteins in the same protein complex generally possess high functional similarity and more neighbors, proteins constituting a protein complex possibly have similar function and more the same common neighbors. In this paper, we first integrate both common neighbors and GO annotations to construct a weighted PPIN. According to some evidence and research [30, 35, 46], the density-based algorithms and modularity-based algorithms have outstanding performance in PPINs. Thus, we define a new model to quantitatively assess protein complex detection by considering both the density and modularity of a subgraph, and we propose a new graph clustering method based on seed-extend algorithm, namely (SE-DMTG), to detect protein complexes of various dense and modularity. In this process, we grow each seed node to a subgraph until this subgraph is a locally optimal cluster. Furthermore, we remove redundant detected complexes and treat the derived complexes as finally identified protein complexes. Finally, to validate the performance of SE-DMTG, we apply it to PPINs of three different species and compare the results, in terms of the F-measure and Jaccard with those of some representative state-of-the-art algorithms by using several known protein complex datasets that are widely used in biological experiments. The experimental results demonstrate that SE-DMTG outperforms the other competing algorithms in terms of accuracy and matching with known complexes. In addition, these identified protein complexes are subjected to functional enrichment analysis to ascertain their biological significance.

Results

Protein-protein interactions datasets selection selection

For performance testing, we carry out all the experiments on three species PPINs: S.cerevisiaecerevisiae (Yeast), Homosapiens (Human) and Musmusculus (Mouse). For yeast, we mainly tested three real yeast PPINs. They are Krogan core [15], DIP [55] and combined6, where combined6 [27] is generated by six individual experiments, including interactions characterized by mass spectrometry technique (2002) [56], Gavin et al. (2002, 2006) [14, 57] and Krogan et al. (2006) [15], and interactions produced using two-hybrid techniques [7, 13]. For human, we use two PPINs, which consists of DIP (version Hsapi20170205 on 9/5/2019) [58] and a combined dataset from HPRD (Human Protein Reference Database, 7/2010) [59] and BioGRID (version 3.2.109) [60], namely, HPRD+BioGRID, which is downloaded from Ref [61]. For the mouse, the PPIN of Mus musculus is also obtained from Biogrid (version 3.5.172) [62]: we download Biogrid Mus musculus (BIOGRID-ORGANISM-Mus _musculus-3.5.172.tab.txt), and then we extract the related of mouse file (Biogrid UNIPROT.tab.txt,14/5/2019). Note that, we use all the unweight PPINs to test all algorithms and we remove all self-connecting interactions and repeated interactions. The detail information of these datasets is listed in Table 2.

Table 1 Summary of metrics or scores

Full size table

Table 2 Statistics on the used datasets of PPINs

Full size table

Protein complexes selection

To evaluate the performance of different protein complex detection algorithms. For yeast, we employ two known protein complexes sets as standard complexes to evaluate the quality of identified protein complexes by various algorithms in yeast PPINs, namely CYC2008 [63] and SGD [64]. In particular, CYC2008 is constructed from three sources, i.e., 1) MIPS [65], 2) Aloy et al [66], and 3) SGD database [67]. For human, we use two standard complexes, which include: 1. CORUM complexes [68]. 2. CGPK complexes [61] is constructed from four sources, i.e., (1) the Comprehensive Resource of Mammalian protein complexes (CORUM) [68]; (2) protein complexes are annotated by GO [69]; (3) Proteins Interacting in the Nucleus database (PINdb) [70] and (4) KEGG modules [71]. For mouse, we use the CORUM complexes [68]. Following the work done by Nepusz et al. [35], we further eliminate those protein complexes that are made up of fewer than three proteins and discard some redundant protein complexes. Finally, the rest of known protein complexes in these databases are used for performance evaluation. The summary of the these standard protein complexes is presented in Table 3.

Table 3 Statistics of the gold standard complexes we use

Full size table

Preprocessing

For yeast, we directly use the protein name to represent the proteins in the PPIN and protein complexes. For human and mouse, different PPINs and different standard protein complexes from different sources of datasets are heterogeneous in many aspects. Therefore, we use the Uniprotid [72] to represent each protein in this study. As a result, we have a uniform way to represent proteins for both the different PPINs and the standard protein complexes. In the process, we remove all duplication interactions, and proteins is not exist its associated Uniprot accession id.

Gene Ontology(GO) selection

As for the Gene Ontology (GO) file, for yeast, we use the GO slims which is the cut-down version of GO, it is a subset of the terms in the whole yeast GO. Here, since GO slims of CC include some protein complexes information, we only use GO slims of BP and MF as GO annotations. Moreover, the GO slim information is downloaded from the website (https://www.yeastgenome.org/). Similarly, for human and mouse, we exploit each protein with their associated Biological Process (BP), and Molecular Functions (MF) GO annotation based on the web UniProt [72] (available at https://www.uniprot.org/), and we download these mapping files.

Evaluation metrics

For the purpose of performance evaluation, This section introduces some evaluation metrics that have been used in this paper. These evaluation metrics calculate the matching degree between identified complexes obtained by different algorithms and standard complexes. Generally, the value of these evaluation metrics falls into the interval between 0.0 and 1.0. The higher the value, the better quality of clustering results and better performance an detecting algorithm has.

1) Precision, Recall, and F-Measure: To evaluate the performance of all algorithms, we match generated complexes with known complexes. First, we introduce the overlap score (OS) between the identified protein complexes and known complexes, which is presented as follows [73]:

$$ OS(p,g)=\frac{|N_{p}\cap N_{g}|^{2}}{|N_{p}|\cdot |N_{g}|} $$

(1)

Here, |Np| is the size of the detected complex, |Ng| is the size of the known complex, and |Np∩Ng| is the common protein number from the detected and known complexes. If OS(p,g)≥ω, we consider p and g to match each other. In our experiment, we set ω=0.2, which is consistent with previous studies [28,29].

After the overlap score (OS) has be defined, we can now give the definition of Precision, Recall, and F-measure as follows [74]:

$$ F-measure=\frac{2\times Precision\times Recall}{Precision+ Recall} $$

(2)

where Precision =$\frac {N_{{cp}}}{|P|}$ and Recall =$\frac {N_{{cg}}}{|G|}$. The F-measure is the harmonic mean of Precision and Recall, which can assess the overall performance of the detection algorithms.

2) JaccardI, JaccardS and Jaccard: As we all known, Precision, Recall and F-measure by setting a threshold to judge whether a standard complex and an identified complex are matched or not. It has its limitations because it doesn’t consider the impact of overlapping part on both identified complexes and the corresponding standard complexes [75]. Therefore, we utilize Jaccard measure for evaluating clustering results [76,77]. It considers the proportion of overlap size in the union set of an identified complex and a standard complex [75]. For more details, please refer to Song et al. [76].

Before we give these metrics, we firstly introduce some notations. Let I be the set of identified complexes obtained by a specific identified algorithm, and S be the set of standard complexes. Moreover, let S_i∈S be a standard complex and I_j∈I represent an identified complex, and then their Jaccard coefficient between them is defined as $Jac(S_{i},I_{j})=\frac {|S_{i}\cap I_{j}|}{|S_{i}\cup I_{j}|}$ [77]. For each identified complex I_j, its Jaccard measure is the maximum Jaccard coefficient over all standard complexes i.e, $\phantom {\dot {i}\!}Jac(I_{j}) = max_{S_{i}\in S} Jac(I_{j},S_{i})$. Taking an average over these identified complexes, weighted by complex size, we compute the weighted average Jaccard measure for the all I identified complexes.

$$ JaccardI=\frac{\sum\nolimits_{I_{j}\in I} |I_{j}|Jac(I_{j})}{\sum\nolimits_{I_{j}\in I}|I_{j}|}, $$

(3)

Similarly, for a standard complex S_i, its Jaccard measure is $\phantom {\dot {i}\!}Jac(S_{i}) = max_{I_{j}\in I} Jac(S_{i},I_{j})$ and

$$ JaccardS=\frac{\sum\nolimits_{S_{i}\in S} |S_{i}|Jac(S_{i})}{\sum\nolimits_{S_{i}\in S}|S_{i}|}, $$

(4)

Finally, the Jaccard measure between identified complexes and standard complexes is defined as the harmonic mean of JaccardI and JaccardS.

$$ Jaccard=\frac{2\times JaccardI\times JaccardS}{JaccardI + JaccardS}. $$

(5)

According to the definition of Jaccard measure, we can see that Jaccard measure could better evaluate the performance of the identified algorithms than F-measure, especially to compare matching rates of different algorithms.

3) p-value: To evaluate the statistical significance of the detected protein complexes, many researchers annotate their main biological functions by using p-value [23,78]. We calculate the function enrichment test to demonstrate the biological significance of detected protein complexes by different algorithms. In this paper, we use LAGO [78] to accomplish the function enrichment test with different threshold. Note that, LAGO is a fast tool which finds significant GO terms among a list of gene names, and it computes the significance (p-value) via the hypergeometric distribution, and applies (by default) Bonferroni correction. For the details of calculating p-value, please refer to [78]. The p-value is used for measuring the biological relevance of detected protein complexes and can be denoted as follows.

$$ {p}-value=1-\sum\limits_{i=0}^{k-1} \frac{{{F}\choose{i}}{{N-F}\choose{C-i}}}{{{N}\choose{C}}} $$

(6)

where k is the number of proteins of the functional group in the protein complex, N is the number of proteins in the PPIN. F is the size of a functional group in the PPIN, a detected protein complex that contains C proteins. Generally, the lower the p-value is, the stronger biological significance the protein complex has. The detected protein complex with less than 0.01 is deemed to be meaningful. In additionally, the larger protein complexes possess the smaller p-values.

Comparison with existing algorithms based on known protein complexes

We have experiments on six PPINs to compare our SE-DMTG algorithm with the following state-of-the-art protein complex detection algorithms, including MCODE [28], MCL [34], CFinder [26], DPClus [29], IPCA [30], CMC [27], COACH [32], HC-PIN [41], SPICi [31], ClusterONE [35], WPNCA [33], CALM [36], and ClusterEPs [43]. Here all parameters are set as their authors advised in Table 4. Meanwhile, to evaluate the performance of all algorithms more comprehensively, all the detection algorithms are tested on the three different species that are yeast, human and mouse. Where three yeast PPINs include the Krogan-core, DIP and combined6 dataset. For human, it includes DIP and a combined dataset (HPRD+BioGRID). And we use the BioGRID dataset as mouse PPIN for testing all algorithms. All tested results are presented in Tables 5, 6, 7, 8 and 9. Because the results are similar, we only analyze the results on the yeast in detail and the rest of results are briefly introduced.

Table 4 Parameters of each algorithm on datasets

Full size table

Table 5 Performance comparision on Krogan-core, DIP and combined 6 datasets

Full size table

Table 6 Performance comparision on Krogan-core, DIP and combined6 datasets

Full size table

Table 7 Performance comparision on Homo sapiens (Human) DIP and HPRD+BioGRID datasets

Full size table

Table 8 Performance comparision on DIP and HPRD+BioGRID datasets

Full size table

Table 9 Performance comparision on Mouse BioGRID datasets

Full size table

The experimental results of F-measure for different algorithms on yeast PPINs have been summarized in Table 5. As the Table 5 shows, although SE-DMTG doesn’t always obtain best performance on precision or recall, but it always keeps in the top three in all cases. Furthermore, SE-DMTG obtains best F-measure in all three yeast datasets. It means that SE-DMTG makes a better compromise between precision and recall. Therefore, the results of F-measure for SE-DMTG are better than other algorithms. In other words, SE-DMTG is obviously better than other algorithms, especially for the overall accuracy in detected protein complexes. Generally, the performance of SE-DMTG in detecting protein complexes is very promising. The principle reason is that SE-DMTG takes into consideration not only gene ontology data but also the topological structure of the tested PPIN.

We have mentioned the limitations of precision, recall and F-measure earlier in this paper. Furthermore, we employ Jaccard measure to reflect that match ratio between detected protein complex set and standard complex set. Table 6 presents all comparative performance results for different algorithms evaluated based on Jaccard metrics by using CYC2008 and SGD standard complexes, respectively. As can be seen from Table 6, in three yeast PPINs, for Jaccard metric, SE-DMTG consistently outperforms other compared algorithms. That is SE-DMTG has the best value of Jaccard and superior performance. Furthermore, we can see that SE-DMTG clearly dominates the other algorithms in all tested datasets. Therefore, SE-DMTG algorithm can get more competitive value of Jaccard compare to other algorithms, which suggests that SE-DMTG performs better than other classic algorithms in terms of matching ratio on all three datasets. According to the above analysis, we known that the new fitness function we designed is used for dealing with the problem of protein complex detection and seems reasonable to use GO annotations for the detection of protein complexes.

Moreover, we make use of Krogan core dataset to compare the performance of all comparing methods by using CYC2008 and SGD as the standard complexes. As shown in Table 6, the Jaccard of SE-DMTG achieve 0.4688 and 0.4008, respectively, which significantly outperforms other algorithms. Similarly, on DIP dataset, SE-DMTG achieves the highest Jaccard (0.386 and 0.3485). For the combined6 dataset, SE-DMTG also achieves the highest value of Jaccards and the values of Jaccards are 0.5208 and 0.493, respectively. Therefore, it shows that the values of Jaccard in combined6 dataset for SE-DMTG is superior to the results in other datasets. This is mainly because combined6 is more reliable than other two datasets. In other words, PPIN contains multiple source dataset, which maybe lead to more real protein-protein interactions.

To further demonstrate the effectiveness of SE-DMTG algorithm in PPINs on other species, we also carry experiment on the human and mouse PPINs. All comparison results are listed in Tables 7, 8 and 9. Similarly, SE-DMTG also achieves the highest F-measure and Jaccard on other species in most cases. It is noteworthy that the higher F-measure means we can identify protein complexes more accurately and the higher Jaccard represents that detected algorithms have a better matching ratio between detected protein complexes and real protein complexes. In summary, for different species PPINs, SE-DMTG has the best performance over other comparative algorithms in terms of F-measure and Jaccard.

Biological significance of the detected protein complexes

Due to the incompleteness of the known protein complexes, we should calculate the p-value of the detected protein complexes on Cellular component ontologies (CC) by using the tool LAGO (http://go.princeton.edu/cgi-bin/LAGO), which is used for making a functional enrichment analysis [78]. All parameters of LAGO are set default. Because CC includes the information of protein complexes, thus it can better compare the performance of different algorithms. Generally speaking, each protein complex detected by detection algorithm is associated with a p-value to show its GO annotations. If the p-value of a protein complex is less than 0.01, we consider it biologically significant. In fact, the p-values of detected protein complexes have close relationship with their size [33].

Here, to evaluate the functional enrichment of protein complexes detected by different algorithms more comprehensively, we mainly focus on the following three aspects: (1) the number of significant detected protein complexes; (2) the percentage of significant detected protein complexes; (3) the average p-value of detected protein complexes. Furthermore, selecting the above approaches to compare with SE-DMTG is because these algorithms are robust performances in most of datasets. More detail you can see their results from Tables 5, 6, 7, 8 and 9. The p-values of DPClus, IPCA, CMC, COACH, SPICi, ClusterONE, WPNCA and SE-DMTG are presented in Table 10.

Table 10 Function enrichment analysis of the protein complexes identified by SE-DMTG and other algorithms on different datasets

Full size table

In Table 10, we summarize the results of DPClus, IPCA, CMC, COACH, SPICi, ClusterONE, WPNCA and SE-DMTG by using function enrichment tests with different thresholds of p-value. As shown in Table 10, in most cases, SE-DMTG can detect many candidates of protein complexes than other methods such as DPClus, CMC, SPICi and ClusterONE in all PPINs. Furthermore, by analyzing functional enrichment, especially for the number, percentage and average p-value of detected protein complexes detected by SE-DMTG have statistical significance to compare with these algorithms mentioned above. As the Table 10 shows, although the number of significant protein complexes detected by IPCA is the most, the percentage and the average p-value of significant detected protein complexes is slight lower than SE-DMTG, COACH and WPNCA. Furthermore, the percentage and the average p-value of significant protein complexes detected by SE-DMTG from the six PPINs is a bit lower than COACH and WPNCA. It is the third highest among all methods. The major reason is that the size of protein complexes detected by SE-DMTG is smaller than the size of detected protein complexes by COACH and WPNCA. In fact, the smaller detected protein complexes have the larger p-values. More detail about the relationship between the size of detected protein complexes and the p-value of detected protein complexes. We will discuss in the relationship of the size of identified protein complexes and the p-value of significant detected protein complexes section.

Examples of detected complexes

In Tables 11 and 12, we further reveal the computation results, 18 detected protein complexes with very low p-values (≤E-20) detected by our SE-DMTG algorithm in six datasets are presented. You can see that the p-value of these detected protein complexes are very low. It demonstrates that the detected protein complexes by SE-DMTG have high statistic significance.

Table 11 Eighteen detected protein complexes which have low p-value by SE-DMTG on different datasets

Full size table

Table 12 Eighteen detected protein complexes detected by SE-DMTG

Full size table

To further reveal the comparison results obtained by SE-DMTG, we provide with a more vivid description by taking the 391th known protein complex of CGPK complexes-’RNase complex’ as example. As shown in Fig. 1a, the known protein complex has 11 proteins. Meanwhile the detected protein complex obtained by SE-DMTG algorithm also consists of 11 proteins and it successfully match all proteins and its OS is 100% which is the highest among all algorithms. This result is shown in Fig. 1b. However, the IPCA, DPClus, COACH, WPNCA, MCL and SPICi just cover 11, 11, 11, 11, 6 and 10 proteins of the real RNase complex, respectively. And for the rest of compared algorithms, their OS (see Eq. (1)) is lower than 0.47 or they are not able to get the detected results. So we don’t show them in Fig. 1. However, for the IPCA, DPClus, COACH, WPNCA, MCL and SPICi algorithms, their OS value is only 73%,73%,68%,68%,54% and 47%, respectively. This result means that SE-DMTG can detect protein complexes accurately, indicating that the new definition of protein complex is also a good model to characterize the topological structure of the protein complexes. Additionally, from this example we explain that why SE-DMTG could achieve highest F-measure and Jaccard but its the percentage of significant detected protein complexes and the average of p-value are slightly lower than COACH and WPNCA. In summary, protein complexes detected by SE-DMTG are more biological significance.

In a word, based on the results of p-value test, we have the conclusion that SE-DMTG can detect quite accurately and have good functional enrichments than other thirteen comparative algorithms.

Discussion

The relationship between the size of detected protein complexes and the p-value of detected protein complexes

To illustrate the relationship between the size of detected protein complexes and the p-value of detected protein complexes, we do some statistical analysis. Because standard complexes and detected protein complexes are resemble ’power law’ distribution. Thus we only display part of the distribution informations in Fig. 2. According to Fig. 2a, the size of most of standard complexes is very smaller. As shown in Fig. 2b, standard complexes whose size is less than or equal to 7 is just 76.96%. Meanwhile, our statistic results show that the average size of the combined standard complexes is 6.38 and the average size of detected protein complexes by SE-DMTG is 6.86. But the average size of detected protein complexes by IPCA, COACH and WPNCA is 10.96, 10.20 and 27.12, respectively. The average size of detected protein complexes by SE-DMTG is similar with standard complexes. However, in Fig. 2c, we found IPCA, COACH and WPNCA detect a larger number of large protein complexes. Additionally, the size of detected protein complexes by SE-DMTG is similar distribution with standard complexes in Fig. 2a and c.

Next, we make Fig. 3 to illustrate the relationship of the size of protein complexes with the percentage of significant detected protein complexes and the average p-value of detected protein complexes. From Fig. 3, it is obvious that the value of p-value (E) decreases gradually with the detected protein complexes whose size increasing. For example, the p-value of standard complexes decreases gradually with the size of protein complexes increasing in Fig. 3a. Similarly, for detected protein complexes by IPCA in Fig. 3c, the value of p-value decreases gradually when the size of detected protein complexes increases. Therefore, it illustrates that large detected protein complexes have small p-value. But in Fig. 2a and b, we know that most of standard complexes and protein complexes by SE-DMTG have small size. Above analysis explains why SE-DMTG has a higher accuracy and matching better with standard complexes according to Tables 5, 6, 7, 8 and 9. However, as for the percentage of significant detected protein complexes and the average p-value of detected protein complexes, SE-DMTG is slightly lower than COACH and WPNCA, and it is the third highest among all methods according to Table 10.

All in all, although p-value has limitation in evaluating functional significant of detected protein complexes, it also reflects the function enrichment of detected protein complexes in a certain level. Overall, considering the superior accuracy and matching ratio and their strong performance in the function enrichment test, we believe the protein complexes detected by SE-DMTG are more likely to be real protein complexes.

Computational complexity of SE-DMTG

Experimental setup

We implement SE-DMTG in python and execute all the experiments on a 64-bit Window system, whose memory of PC is 12GB and Intel CPU is i7 3.60 GHz. In the meantime all state-of-the-art methods are also executed on the same machine, except SPICi. While SPICi method is used through its web site.

Time complexity analysis

In this part, we try to analyze the time complexity of the SE-DMTG algorithm. It is difficult to give the accurately computational complexity of SE-DMGT because it depends on not only the number of detected protein complexes but also their size. Moreover, for each seed, we need to execute an iterative procedure until the current cluster doesn’t changes, Obviously the number of iterations have significant influence for the computational complexity of SE-DMTG. Thus, we only roughly analyze the time complexity. Let n and m denote the number of nodes and edges in graph G, respectively, and let $\overline {k}$ be the average number of neighbors of all the nodes. Then we have $\overline {k}=\frac {\sum \nolimits _{v\in V}N(v)}{n}$, where N(v) is the number of all neighbors of v. In construct a weighted PPIN step, time complexity of calculating the weight of all edge is $O\left (n*\overline {k}\right)=O\left (n*\frac {\sum \nolimits _{v\in V}N(v)}{n}\right)=O\left (\sum \nolimits _{v\in V}N(v)\right)=O(2*m)$. In constructing a seed queue SQ and selecting the initial cluster step, according to Eq. (12), the time complexity of we calculating the score of each protein is $O(n*(\overline {k})+1)^{2}=O(n*\left (\frac {\sum \nolimits _{v\in V}N(v)}{n}+1\right)^{2}=\frac {4*m^{2}}{n}+4*m+n$ and the time complexity of sorting all proteins by their Score(v) is O(n∗log(n)). In the generate detected protein complex step, the worst case is that we need calculate the fitness of each protein and its worst time complexity also is $\frac {4*m^{2}}{n}+4*m+n$.

In generating detected protein complexes step, we firstly analysis the time complexity when SE-DMTG iteratively adds proteins to the cluster SG from its neighbors. It has three basic phases: (1) obtain all candidate nodes which will be added to the cluster SG, whose time complexity is $O(n_{{SG}}*\overline {k})=O\left (n_{{SG}}*\frac {\sum \nolimits _{v\in V}N(v)}{n}\right)=O\left (\frac {2*n_{{SG}}*m}{n}\right)$, where n_SG is the number of the cluster SG. (2) find the highest priority vertex according to Eq. (18) then add it into the cluster SG. The worst time case is that each candidate node is checked, so the time complexity of this case is $O\left ((N_{{SG}}+N_{{SG}}-1+...+1)*\overline {k}\right)=O\left (\frac {m*N_{{SG}}*(N_{{SG}}-1)}{n}\right)$, where N_SG is the number of neighbors of SG. (3) calculate the fitness of graph SG, whose time complexity is $O(n_{{SG}}^{2})$. Thus, the time complexity of the whole time when program iteratively add candidate nodes to the cluster SG is $O\left (\frac {2*n_{{SG}}*m}{n}+\frac {m*N_{{SG}}*(N_{{SG}}-1)}{n}+n_{{SG}}^{2}\right)$. Meanwhile, we further analyze the time complexity of iteratively removing some inner nodes from SG. Similar, it also has three basic calculations: (1) determine the inner nodes which are removed them from the cluster SG. Its time complexity is also $O\left (\frac {2*n_{{SG}}*m}{n}\right)$. (2) find a high priority vertex according to Eq. (18) in order to remove it from the cluster SG. Its time complexity is also $O\left ((N_{{SG}}+N_{{SG}}-1+...+1)*\overline {k}\right)=O\left (\frac {m*N_{{SG}}*(N_{{SG}}-1)}{n}\right)$. (3) calculate the fitness of graph SG. Its time complexity is $O(n_{{SG}}^{2})$. Hence the time complexity of this step is $O\left (\frac {2*n_{{SG}}*m}{n}+\frac {m*N_{{SG}}*(N_{{SG}}-1)}{n}+n_{{SG}}^{2}\right)$.

Suppose t is the number of iteractions when we generate a detected protein complex and N is the number of detected protein complexes. Finally, the time complexity of Algorithm 2 is $O(N*t*\frac {m}{n}*\left (N_{{SG}}*(N_{{SG}}-1)+3*n_{{SG}}*(1+n_{{SG}})\right)$. Finally, we need to discard some redundant protein complexes whose time complexity is O(PCs²), where PCs is the size of candidate identified protein complexes. All in all, the time complexity of the algorithm SE-DMTG is $O(2*m+\frac {4*m^{2}}{n}+4*m+n+n*log(n)+N*t*\frac {m}{n}*\left (N_{{SG}}*(N_{{SG}}-1)+3*n_{{SG}}*(1+n_{{SG}})+len(PCs)^{2}\right)$, where N,t and PCs are constant. In addition, we assume N_SG and n_SG as variables. To facilitate the intuitive understanding of these variables, we provide Table 13 so that you can get more details.

Table 13 Some variables used in SE-DMTG algorithm

Full size table

Conclusion

Many high-throughput experimental techniques and computational algorithms have been developed to identify protein complexes from the PPINs. However, most of these methods are based on the original network or use the topological property alone and are thus limited in terms of not only the quality of protein complex identification but also ignoring other useful biological information, such as functional properties. In our opinion, both topological and functional properties are meaningful and important for identifying protein complexes. We therefore combine common neighbor and functional properties to calculate edge weights and construct weighted PPINs. Moreover, we also propose a new local search heuristic graph clustering algorithm, SE-DMTG, to extract detected protein complexes with various densities and modularities based on a new model. Although models that consider density or modularity have been applied to study PPINs, our model is novel in considering both density and modularity simultaneously.

We evaluate the performance of the proposed SE-DMTG on three PPINs of species under some standard complex datasets and compare the results with those of thirteen competing algorithms. The experimental results show that SE-DMTG is competitive in identifying protein complexes and that adding the topological information and GO information increases the detection accuracy. Meanwhile, the experimental results reveal that SE-DMTG outperforms the current state-of-the-art algorithms in terms of some measures in overall. Furthermore, we analysis the biological significance of detected protein complexes by different methods. The results show that these detected protein complexes by SE-DMTG have biological significant. With the wide application of supervised learning, we will try to design a new algorithm that combines classification model and unsupervised clustering algorithms to improve the performance in the future. Additionally, SE-DMTG is also robust to false positives in experimental data because of the integration of functional properties. Furthermore, SE-DMTG may be extended naturally to other types of biological data fusion to study more comprehensive characteristics of the biological networks and to analyze other forms of complex networks, such as Internet networks, citation networks, ecological networks and social networks.

Methods

Preliminaries

Since the interactions among proteins in the PPINs are symmetric, these PPINs could be formulated as a undirected weighted graph G=(V,E,W), where V is a set of nodes representing the proteins of the PPINs, E is a set of undirected edges corresponding to those interactions, and W represents the likelihoods between nodes. In this paper, we obtain the weights by using the topological information and the biological information. The symbols, abbreviations and their interpretation are shown in Table 1.

Algorithm framework

The SE-DMTG algorithm is developed to detect protein complexes based on GO annotations and PPINs topological structure. Furthermore, we propose a composite model for the identification of protein complexes. Algorithm 1 represents the main function of the proposed SE-DMTG. SE-DMTG operates in three phases. In the first step, given a PPIN, and we construct a weighted PPIN by using common neighbors and GO annotations defined by Eqs. (7) and (8). In the second step, SE-DMTG constructs a seed node queue based on a seed score function to form the initial cluster defined by Eq. (12). In the third step, based on the initial cluster in the previous step, we provide a quantitative definition of protein complexes to formulate the problem of protein complexes identification as an optimization problem defined by Eq. (17). Finally, we apply an iterative greedy search process to generate protein complexes (See Algorithm 2).False and redundancy candidate protein complexes are filtered to ultimately obtain identified protein complexes. Figure 4 shows a flowchart of SE-DMTG, which is composed of the following main steps:

1.
Construct a weighted PPIN based on common neighbors and GO annotations.
2.
Generate a seed queue and form an initial cluster.
3.
Define the protein complex model.
4.
Extend and correct the cluster to generate a locally optimal subgraph.
5.
Obtain a list of identified protein complexes.

In step 1, the edge clustering coefficient probability is computed based on common neighbor via Eq. (7). The functional similarity between two proteins is calculated based on GO annotations according to Eq. (8). In step 2, we give each protein a score on the basis of both the weight degree (see Eq.(10)) and the neighborhood graph clustering coefficient (see Eq.(11)), and we sort the proteins based on their score according to Eq.(12). In step 3, we introduce a new model to estimate the quantitative value of a cluster (see Eq.(17)). In step 4, we iteratively extend and correct the cluster to generate a protein complex from the weighted PPIN. This process involves four sub-steps: selecting the highest score protein as the seed node to generate a seed queue and form the initial cluster; assessing the priority of boundary nodes in determining the priority section; iteratively adding neighbor nodes to the cluster, removing inner nodes from the cluster, and filtering and removing false candidate identified protein complex with size less than or equal to two in the extending and correcting cluster to generate a locally optimal subgraph section. In step 5, we discard some redundant candidate protein complexes and output a list of identified protein complexes. For more details of this processes, see the related sections.

Construction of a weighted PPIN based on common neighbors and GO annotations

Recent studies [30,35,36] have shown that the accuracy of protein complex detection can be significantly improved by taking network weights into account. In the following subsections, we introduce how to calculate the weight of the PPIN.

Common neighbors

The edge clustering coefficient [47] is first developed to describe how strongly neighbors are connected. However, Radicchi et al. [47] note that the edge clustering coefficient may not be suitable for using in PPINs because PPINs are disassortative networks. To overcome this limitation, Zhao et al. [48,49] propose a new method to calculate the possibility of protein-protein interactions. Following their work, we also use the same method to calculate the weight of each edge, namely common neighbors (CN). Then, the existence probability of an edge (v,u) in a PPIN is defined as follows:

$$\begin{array}{@{}rcl@{}} CN(v,u)\,=\,\left\{ \begin{array}{ll} \sqrt{\frac{|N(v) \cap N(u)|^{2}}{|N(v)|\ast |N(u)|}}, & |N(v)|\!\geqslant\! 1 \ and\ |N(u)|\geqslant 1 \cr 0, &otherwise \end{array}\right. \end{array} $$

(7)

where N(v) and N(u) are the neighborhood sets of v and u, respectively. In Eq. (7), |N(v)∩N(u)| denotes the set of common neighbors between two proteins. CN is a measure that can describe how closely proteins v and u are related. In this paper, we assume that the similarity of different interactions are independent of each other. The higher the value is, the larger the probability that proteins v and u belong to the same protein complex is.

Protein functional similarity computation

On the other hand, from a biological perspective, gene ontology (GO) [50] is currently one of the most comprehensive ontology databases in the bioinformatics community [51]. The database provides a series of GO terms to describe gene product features. Proteins constituting a complex possibly have similar function. A large functional similarity means higher confidence that two proteins share similar functions. In other words, if two interacting proteins v and u have more common GO annotations and their functions are more similar, then they are more likely to belong to the same protein complex. Additionally, proteins with similar functions tend to be co-expressed [52]. Note that when two terminal nodes v and u of an edge (v, u) do not have common GO annotations, the weight of edge (v, u) may be regarded as noise and set 0.0. Here, we define a new measure to describe the similarity of two interacting proteins v and u based on a biologically similarity function defined as follows:

$$ {GO(\!v,u)\,=\, \left\{\begin{array}{ll} \!\frac{|GO(v) \cap GO(u)|}{\max \left(\min(|GO(v)|,|GO(u)|),Average(\!GO)\!\right)}, & \!|GO\! \cap\! GO(u)| \!>\! 0 \cr 0, &otherwise \end{array}\right.} $$

(8)

where |GO(v)| and |GO(u)| represent the number of GO annotations in protein v and protein u, respectively. |GO(v)∩GO(u)| represents the common GO annotations for both proteins v and u. If proteins v and u share more common neighbors, the functional score is larger. Here, we use min(|GO(v)|,|GO(u)|) because some proteins are overlapping nodes. $Average(GO)=\frac {\sum \nolimits _{i\in V,|GO(i)|\geqslant 1}|GO(i)|}{|N|}$ is the average of the number of GO annotations for each protein in the whole PPIN. |N| is the number of proteins for which the number of GO annotations is greater than or equal to 1. Based on this definition, if the number of the proteins containing GO annotation is below the number of the average, then the number is adjusted to the average. max(min(|GO(v)|,|GO(u)|),Average(GO)) can penalize the reliability of edge (v,u) between protein v and protein u with very few GO annotations.

In this paper, SE-DMTG integrates both the topological and biological information of the PPIN by using the CN and GO. CN captures the static topological information and GO assesses the functional similarity of proteins. To incorporate both measures into our method, we use the arithmetic mean as the edge weights in the PPINs. The weight of each edge between two proteins is calculated as follows:

$$\begin{array}{@{}rcl@{}} w(v,u)= \left\{\begin{array}{ll} \frac{GO(v,u)+CN(v,u)}{2}, & GO(v,u)+CN(v,u) > 0 \cr 0, &otherwise \end{array}\right. \end{array} $$

(9)

Here,

1.
Neighbors shared by two proteins in the network are called the common neighbors (CN) of Eq. (7).
2.
The functional similarity of two proteins is quantified in terms of the GO annotation (GO) in Eq. (8).

The above two properties express the interaction based on CN and GO annotations. Note that the value of w(v,u) has a range between 0.0 and 1.0 and is used for evaluating the reliability of protein pairs to construct a weighted PPIN. The weights of each edge in the PPIN are obtained by integrating both topological information and biological information. Edges whose weights are 0.0 are considered to be noise and are deleted from the PPIN.

Generation of a seed queue and formation of the initial cluster

Choosing high-quality protein seeds for expansion is critical. Each cluster starts at an initial cluster that consists of a single node that is generally called the seed node. An inappropriate choice of a seed node will likely affect the process of detecting protein complexes. For example, a low-quality seed node may result in a false positive protein complex being detected. Furthermore, if a protein that belongs to multiple complexes is chosen as a seed node, the resulting identified complex may subsume the multiple complexes under an unrealistically large false protein complex that cannot match any real protein complex [36]. From a topological perspective, the central part of a protein complex often corresponds to a dense subgraph with high clustering coefficient and more reliable weight in the PPINs [29–31,46,53].

According to the preliminaries section, we have given a confidence score 0≤w_v,u≤1.0 to every edge (v,u)∈E. We utilize several measures to select seed nodes. For each node v in the PPIN, we define its weight degree, d_w(v), as the sum of all its edge weight values:

$$ d_{w}(v)=\sum\limits_{(v,u)\in E}w(v,u). $$

(10)

For each node v, the neighborhood graph consists of v, all its neighbors and the edges among them, is defined as G_v=(V_v,E_v), where V_v={v}∪{u|u∈V,(v,u)∈E} and E_v={(u_i,u_j)|(u_i,u_j)∈E,u_i,u_j∈V_v}. Futhermore, the neighborhood graph clustering coefficient (NGCC) is the sum of the weights of the edges, divided by the total number of possible edges. Thus, for a node v, the NGCC is defined in Eq. (11) [54]:

$$ NGCC(v)=\frac{\sum\nolimits_{v,u\in V_{v}}w(v,u)}{(|V_{v}|\ast(|V_{v}|-1))/2}. $$

(11)

Here, V_v is the degree of node v, $\sum \nolimits _{v,u\in V_{v}}w(v,u)$ is the sum of the weights of the edges, and $\frac {(|V_{v}|\ast (|V_{v}|-1))}{2}$ is the total number of triangles that could pass through node v. The NGCC reflects the weight degree of aggregation of proteins in the PPINs. Note that the NGCC is a measure of the closeness of the node v and its neighbors, which varies from 0.0 to 1.0.

We devise the following score function to sort all proteins in a PPIN. If a protein has a higher score according to Eq. (12), it is more likely to be used as the seed node, to be inside a protein complex, and to have high centrality in the complexes. Thus, the score of each protein v is defined as the product of the its neighborhood graph clustering coefficient and its weight degree, and is defined in Eq. (12):

$$ Score(v)=d_{w}(v)*NGCC(v). $$

(12)

The seed score function takes both weight degree centrality and neighborhood graph density into consideration for prioritizing the proteins for seeds. Here, we sort all proteins in the PPIN and use a queue (data structure) SQ to record the order. We select the highest score according to Eq. (12) as the seed node to grow a detected protein complex. Once the new detected protein complex is generated, all nodes in the detected protein complex are recorded in a list table and we choose the next highest node that is not visited in the queue SQ as the next seed node. Note that, we calculate the score of each protein only once based on the PPIN, which is more biological meaning [30].

Definition of a protein complex model

As mentioned in the Background section, several protein complexes identification algorithms have been presented. Most existing algorithms make many assumptions to define a subgraph of possible protein complexes in the PPINs. However, in terms of the actual performance of these algorithms, the graphs with high density or high modularity in PPINs generally correspond to protein complexes [29,35]. In fact, a dense graph could have low modularity, and a graph with high modularity may have low density. Therefore, the density-based algorithms ignore protein complexes with low density and the modularity-based algorithms miss protein complexes with low modularity. Overall, these methods have limitations when identifying protein complexes with various densities and modularities [46]. To overcome these limitations, we define a new protein complex model to detect protein complexes by considering both density and modularity in the PPINs. We begin by presenting some related definitions.

According to the preliminaries section, for an undirected weighted subgraph SG, its density is donated as D_SG:

$$ D_{{SG}}=\frac{\sum\nolimits_{(u,v)\in SG} w_{u,v}}{|SG|*(|SG|-1) /2} $$

(13)

where $\sum \nolimits _{u,v\in SG} w_{u,v}$ is the sum weight of the edges contained in subgraph SG, and |SG| represents the size of the subgraph SG, respectively. The density of a graph measures how close the graph is to a clique, and the density takes value between 0.0 and 1.0.

For the subgraph SG⊆G, its weighted in-degree, denoted as $d_{w}^{in}(SG)$, is the sum of the weights of all edges belonging to SG, and its weighted out-degree, denoted as $d_{w}^{out}(SG)$, is the sum of the weights of the edges connecting the nodes in SG to the nodes in the rest of graph G. $d_{w}^{in}(SG)$ and $d_{w}^{out}(SG)$ can be obtained as follows [46]:

$$ d_{w}^{in}(SG)=\sum\limits_{u,v\in SG;(u,v)\in E}w(u,v). $$

(14)

$$ d_{w}^{out}(SG)=\sum\limits_{v\in SG;u\notin SG;(u,v)\in E}w(u,v). $$

(15)

Clearly, the weighted degree of d_w(SG) is equal to the sum of $d_{w}^{in}(SG)$ and $d_{w}^{out}(SG)$.

The modularity M_SG of a subgraph SG⊆G is defined as follows:

$$ M_{{SG}}=\frac{d_{w}^{in}(SG)}{d_{w}^{in}(SG)+d_{w}^{out}(SG)}. $$

(16)

Obviously, M_SG takes values from 0.0 to 1.0. If a subgraph has higher modularity, it has more connections within itself and fewer connections to the rest of the PPIN. A subgraph with a modularity of 1.0 has no connections with the rest of the PPIN.

In this model, in the process of identifying protein complexes, we measure the quality of SG by considering its density (D_SG) and modularity (M_SG). D_SG describes the density of subgraph SG, M_SG describes the modularity of subgraph SG and $\sqrt {D_{{SG}}*M_{{SG}}}$ describes the subgraph with both high density and high modularity. Here, to make the value range of a subgraph with both high density and high modularity the same as that of the density and modularity, i.e, [0.0,1.0], the value of D_SG∗M_SG is normalized by the geometric mean of D_SG and M_SG. The fitness of a subgraph SG in an undirected weighted graph G, denoted as F(SG), is defined as:

$$ F(SG)=\frac{D_{{SG}}+M_{{SG}}+\sqrt{D_{{SG}}*M_{{SG}}}}{3}. $$

(17)

Generally, as the subgraph SG expands, its modularity increases and its density decreases. Thus, by expanding from a node, we can obtain a subgraph with the local maximum fitness score and output the result as a protein complex. Thus, this new model can be used for identifying protein complexes with different topology, including high density but low modularity, high modularity but low density, and high density and high modularity. Therefore, our model can identify the protein complexes with various densities and modularities.

Extending and correcting the cluster to generate a locally optimal subgraph

Determining the priority of boundary nodes

An initial cluster (SG) starts as single protein, and then grows and shrinks gradually as proteins are added and removed one by one. The process of adding proteins from the neighbor of SG, and is denoted as Neighbor(SG), and the process of removing proteins from the inner nodes is denoted as inner_nodes(SG). In this process, we first define two concepts: if p∈Neighbor(SG), the neighbor node connects to at least one edge with any protein of cluster SG but does not belong to SG; If p∈inner_nodes(SG), the inner node belongs to SG, but connects to at least one node which is a neighbor of SG. A key problem is to decide the priority to add and remove proteins in terms of SG. In general, if a protein v belongs to SG, it may have a strong connection with its cluster SG=(V_SG,E_SG). Therefore, if the protein v is added to SG, it could increase the average of the weighted interactions within SG. By contrast, if the protein v is removed from SG, it could increase the average of the weighted interactions within SG. Here, we introduce a measure to assess the priority, denoted as weight_avg(SG), which is defined as:

$$ weight_{{avg}}(SG)=\frac{2*\sum\nolimits_{(v,u)\in E_{{SG}}} weight(v,u)}{|V_{{SG}}|}, $$

(18)

where weight_avg(SG) is the average of the weighted interactions of all proteins within SG, |V_SG| is the number of proteins in SG and $\sum \nolimits _{(v,u)\in E_{{SG}}} weight(v,u)$ represents the total weight of the interactions in SG. The priority of adding the node p into the cluster SG, where p∈Neighbor(SG), or deleting the node p from the cluster SG, where p∈inner_nodes(SG), SG is determined by the value of weight_avg(SG). We choose the highest weight_avg(SG) of the boundary node to add it to SG or remove it from SG to maximize the value of F(SG) (see Eq.(17)).

Extending and correcting estimation

For a cluster SG, in extending step, we first obtain all the neighbors, namely, Neighbors(SG). The priority of all neighbors is determined by the value of weight_avg(SG) see Eq. (18). Whether the highest priority protein v is added to SG is determined by whether the fitness (F(SG)) of SG is increased after the highest priority protein v is added and whether the actual edge between the highest priority protein v and the SG, denoted as |SG∩N(v)|, which is the number of proteins in SG connected with v is greater than the expectation edge, denoted as F(SG)∗|SG|, where F(SG) is the fitness of SG and |SG| is the number of proteins in SG. Once the highest priority protein v is added to SG, SG is updated, i.e., the highest priority protein v is removed from Neighbors(SG). Then, the next highest priority protein is tested, and the priorities of list Neighbors(SG) and the fitness (F(SG)) of SG are recalculated, and so on. If the highest priority protein v fails any of two tests, then SG cannot be further extended.

For a cluster SG, in the correcting step, we first obtain all inner nodes, namely Inner_nodes(SG). The priority of all proteins in Inner_nodes(SG) is determined by the value of weight_avg(SG) (see Eq. (18)). Whether the highest priority protein v is deleted from SG is determined by whether the fitness (F(SG)) of the cluster SG−{v} is increased after the highest priority protein v is removed from SG and whether the actually edge between the highest priority protein v and SG−{v}, denoted as |SG−{v}∩N(v)|, which represents the number of proteins in SG−{v} connected with v, is greater than the expectation edge, denoted as F(SG)∗|SG|, where F(SG) is the fitness (F(SG)) of SG and |SG| is the number of proteins in SG. Once the highest priority protein v is removed from SG, the cluster SG is updated, i.e., the highest priority protein v is removed from Inner_nodes(SG). Then, the next highest priority protein is tested, and the priorities of Inner_nodes(SG) and the fitness of the cluster SG−{v} are recalculated, and so on. If the highest priority protein v fails any of two tests, then the cluster SG cannot be further corrected.

Obtaining a list of identified protein complexes.

On the basis of the quantitative description of protein complexes, we develop a novel clustering algorithm based on density and modularity with network topology and GO annotations, named SE-DMTG, to identify protein complexes in a weighted PPIN whose edge weights reflect the reliability of the edge in a protein complex according to topological and biological information.

The input of the SE-DMTG algorithm is a PPIN, which is described as a simple undirected graph G(V,E) with GO annotations. The SE-DMTG algorithm broadly consists of four phases. First, SE-DMTG constructs a weighted PPIN-based topological and biological information at lines 2-11. Second, SE-DMTG calculates the scores of all nodes and selects the node with the maximum score as the seed in lines 12-18. Third, starting from the seed node, a greedy procedure is used for adding nodes to or removing nodes from the cluster SG to obtain a subgraph with high graph fitness. The growth process is repeated from different seeds to form multiple, possibly overlapping subgraphs in lines 19-49. Once a new cluster is completed, all nodes in this cluster SG are recorded to prevent them from being used as seed nodes. Then, we select the next seed node from those remaining in the queue SQ to generate the next cluster SG in lines 41-45. Moreover, we discard candidate complexes whose size is less than 3 [35] and remove unreliable candidate complexes at line 38-46. Finally, we discard redundant protein complexes in lines 50-55. A detailed description of the SE-DMTG algorithm is shown in Algorithm 1.

In the first step, we assign a weight to each edge based on common neighbor and gene ontology data (lines 2 ∼11).

In the second step, SE-DMTG calculates the score of each node (lines 12 ∼17). Furthermore, all the nodes in network G are queued into SQ in non-increasing order of Score(v) (line 18).

In the third step, we choose the node with the highest Score(v) that has not yet been visited before to bring it up (lines 19 ∼29). The key idea of this step is that any neighbors of the current subgraph SG that make a positive contribution to F(SG) will be added to SG or removed from SG (line 37). The description of iterative generation of a complex is shown in Algorithm 2. Algorithm 2 has two subphases, and we can gradually add neighbors to cluster SG or remove inner nodes from cluster SG. As for the priority of candidate nodes is based on (see Eq. (18)) and two conditions. More details are introduced in the section on extending and correcting the cluster to generate a locally optimal subgraph.

Next the step-by-step procedure of step 3 is given in Algorithm 2.

In the first phase in lines 3 ∼25, after obtaining a seed protein, we first get an external boundary protein set that consists of the neighbors of SG called Neighbor(SG), in lines 4 ∼5. Then, we calculate the graph fitness of SG at line 8. Furthermore, we find the neighbor protein with the highest priority according to weight_avg(SG+{p}) in Neighbor(SG), which is added to SG to maximize the value of weight_avg(SG+{p}) in lines 7 ∼14. Furthermore, we calculate the fitness of graph SG+{p} in line 15, and Expectation_edges is calculated according to the graph fitness of SG × the size of SG in line 16. Meanwhile, we also calculate the value of Actually_edges which is the size of the interaction set between Neighbor(node_max) and SG, denoted as Neighbor(node_max)∩SG, in line 18. If the node_max with the highest priority is added to increase the value of F(SG) and the Actually_edges is larger than Expectation_edges, then we add node_max to SG and remove it from Neighbor(SG) in lines 19 ∼24. We continually check the next highest priority node in Neighbor(SG) and judge whether the node can be added to the SG in lines 6-25. Otherwise, the iterative addition of the neighbors of SG phase is terminated when one of two conditions is not satisfied in line 19 or when no more remaining neighbor nodes can be added to SG in line 6.

In the second phase, SE-DMTG allows the removal of any inner nodes in cluster SG to maximize the value of F(SG) in lines 26 ∼57. We first find the inner nodes that have edges with nodes that are not in SG, denote as Inner_node(SG) in lines 27 ∼34, and then we test whether each node in Inner_node(SG) can be removed from SG in lines 35-57. We first find the highest priority node according to Eq. (18) in lines 36-43. Meanwhile, we calculate the graph fitness F(SG−{p}) of SG−{p} in line 44. Similarly, we calculate the values of Expectation_edges and Actually_edges in lines 45 ∼47. If the two conditions in line 48 are satisfied, we remove the node from SG and Inner_node(SG) in lines 49 ∼50; otherwise, the second phase is terminated in lines 51 ∼57.

In Algorithm 2, the key idea is to iteratively add the highest priority node in Neighbor(SG) to the cluster SG or remove the highest priority node in Inner_node(SG) from the cluster SG to maximize the value of graph fitness F(SG) in lines 2 ∼59. This growth process is repeated until the current cluster SG no longer changes and is a locally optimal subgraph in line 59; then, the detected protein complex is output by Algorithm 1 in line 37.

After we obtain a detected complex SG by using Algorithm 2 in line 37, and we discard fake protein complexes and complexes whose size is less than 3 [35] in line 39. As a result, we save the detected complex SG in line 40. Meanwhile, SE-DMTG records the nodes in SG in lines 41 ∼45 and selects the next seed node by considering the rest of nodes in seed queue SQ that have not been included in any of the detected complexes found thus far. The next node with the highest score is selected as the seed (lines 31 ∼35). We recursively perform the above key operations in PPIN to identify the remaining candidate protein complexes until no seed nodes remain in seed queue SQ (lines 31-49). Note that when this process is repeated, the nodes in the previously generated protein complex remain in the PPIN; therefore, SE-DMTG is able to generate overlapping complexes.

Finally, SE-DMTG outputs all identified protein complexes in line 56.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding literatures and datasets.

Abbreviations

BP:: Biological process
CC:: Cellular component
ClusterONE:: Clustering with overlapping neighborhood expansion
CMC:: Clustering-based on maximal cliques
CN:: Common neighbors
Co-IP:: Co-immunoprecipitation
GO:: Gene ontology
GO:: GO annotations (gene ontology)
MCL:: Markov clustering
MCODE:: Molecular complex identification
MF:: Molecular function
PPINs:: Protein-protein interaction networks
SE-DMTG:: A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations
SQ:: Seed queue
TAP-ms:: Tandem affinity purification with mass spectrometry

References

Victor S, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci. 2003; 100:12123–8.
Article CAS Google Scholar
Yu H, Paccanaro A, Trifonov V, Gerstein M. Predicting interactions in protein networks by completing defective cliques. Bioinformatics. 2006; 22:823–9.
Article CAS PubMed Google Scholar
Kasper L, E Olof K, St?Rling ZM, Olason PI, Pedersen AG, Olga R, Hinsby AM, Zeynep T, Flemming P, Niels T. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007; 25:309.
Article CAS PubMed Google Scholar
Safari-Alighiarloo N, Taghizadeh M, Rezaei-Tavirani M, Goliaei B, Peyvandi AA. Protein-protein interaction networks (ppi) and complex diseases. Gastroenterol Hepatol Bed Bench. 2014; 7:17–31.
PubMed PubMed Central Google Scholar
Chen Y, Jacquemin T, Zhang S, Jiang R. Prioritizing protein complexes implicated in human diseases by network optimization. BMC Syst Biol. 2014; 8:2.
Article Google Scholar
Vanunu O R. E. E. A. MaggerO. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010; 6:1000641.
Article CAS Google Scholar
Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P. A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae. Nature. 2000; 403:623.
Article CAS PubMed Google Scholar
Yuen H, Albrecht G, Adrian H, Bader GD, Lynda M, Sally-Lin A, Anna M, Paul T, Keiryn B, Kelly B. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002; 415:180.
Article Google Scholar
Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T. Global analysis of protein activities using proteome chips. science. 2001; 293:2101–5.
Article CAS PubMed Google Scholar
Zhao J, Hu X, He T, Li P, Zhang M, Shen X. An edge-based protein complex identification algorithm with gene co-expression data (pcia-geco). IEEE Trans Nanobiosci. 2014; 13:80–8.
Article Google Scholar
Hart GT, Ramani AK, Marcotte EM. How complete are current yeast and human protein-interaction networks?Genome Biol. 2006; 7:1–9.
Article CAS Google Scholar
Nesvizhskii AI. Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments. Proteomics. 2012; 12:1639–55.
Article CAS PubMed PubMed Central Google Scholar
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001; 98:4569–74.
Article CAS PubMed PubMed Central Google Scholar
Anne-Claude G, Patrick A, Paola G, Roland K, Markus B, Martina M, Christina R, Lars Juhl J, Sonja B, Birgit D. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440:631.
Article Google Scholar
Krogan NJ, Gerard C, Haiyuan Y, Gouqing Z, Xinghua G, Alexandr I, Joyce L, Shuye P, Nira D, Tikuisis AP. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature. 2006; 440:637.
Article CAS PubMed Google Scholar
Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Séraphin B. A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol. 1999; 17:1030–2.
Article CAS PubMed Google Scholar
Gentz R, Rauscher FJ, Abate C, Curran T. Parallel association of fos and jun leucine zippers juxtaposes dna binding domains. Science. 1989; 243:1695–9.
Article CAS PubMed Google Scholar
Nobumasa T, Taisuke T, Ikuo H, Makiko T, Manabu N, Yasuko T, Gopal T, Takeshi I. The role of presenilin cofactors in the y-secretase complex. Nature. 2003; 422:438–41.
Article CAS Google Scholar
Trevor C, Eivind H. From proteomes to complexomes in the era of systems biology. Proteomics. 2014; 14:24–41.
Article CAS Google Scholar
Chien CT, Bartel PL, Sternglanz R, Fields S. The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc Natl Acad Sci. 1991; 88:9578–82.
Article CAS PubMed PubMed Central Google Scholar
Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999; 402:47–52.
Article CAS Google Scholar
Barabasi A. -L., Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004; 5:101.
Article CAS PubMed Google Scholar
Jianxin W, Xiaoqing P, Min L, Yi P. Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics. 2013; 13:301–12.
Article CAS Google Scholar
Jianxin W, Xiaoqing P, Min L, Yi P. Cpredictor3.0: detecting protein complexes from ppi networks with expression data and functional annotations. BMC Syst Biol. 2017; 11:135.
Article CAS Google Scholar
Jain AK, Dubes RC. Algorithms for clustering data. Technometrics. 1988; 32:227–9.
Google Scholar
Adamcsek B, Palla G, Farkas I, Ijderenyi, Vicsek T. Cfinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006; 22:1021–3.
Article CAS PubMed Google Scholar
Liu G, Wong L, Chua HN. Complex discovery from weighted ppi networks. Bioinformatics. 2009; 25:1891–7.
Article CAS PubMed Google Scholar
Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003; 4:2.
Article PubMed PubMed Central Google Scholar
Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006; 7:1–13.
Article CAS Google Scholar
Li M, Chen J-E, Wang J-X, Hu B, Chen G. Modifying the dpclus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008; 9(1):398.
Article PubMed PubMed Central CAS Google Scholar
Jiang P, Singh M. Spici: a fast clustering algorithm for large biological networks. Bioinformatics. 2010; 26(8):1105–11.
Article PubMed PubMed Central CAS Google Scholar
Cho YR, Hwang W, Ramanathan M, Zhang A. A core-attachment based method to detect protein complexes in ppi networks. BMC Bioinformatics. 2009; 10:169.
Article CAS Google Scholar
Peng W, Wang J, Zhao B, Wang L. Identification of protein complexes using weighted pagerank-nibble algorithm and core-attachment structure. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2015; 12(1):179–92.
Article Google Scholar
Van Dongen S. Graph Clustering by Flow Simulation. University of Utrecht: Amsterdam, PhD Thesis. 2000.
Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 2012; 9:471.
Article CAS PubMed PubMed Central Google Scholar
Wang R, Liu G, Wang C, Su L, Sun L. Predicting overlapping protein complexes based on core-attachment and a local modularity structure. BMC Bioinformatics. 2018; 19:305.
Article PubMed PubMed Central Google Scholar
Bhowmick SS, Seah BS. Clustering and summarizing protein-protein interaction networks: A survey. IEEE Trans Knowl Data Eng. 2016; 28:638–58.
Article Google Scholar
Newman ME. Modularity and community structure in networks. Proc Natl Acad Sci. 2006; 103:8577–82.
Article CAS PubMed PubMed Central Google Scholar
Li M, Wang J, Chen J. A fast agglomerate algorithm for mining functional modules in protein interaction networks. In: 2008 International Conference on Biomedical Engineering and Informatics. IEEE: 2008. p. 3–7.
Li M, Wang J, Chen J, Pan Y. Hierarchical organization of functional modules in weighted protein interaction networks using clustering coefficient. Berlin, Heidelberg: Springer; 2009, pp. 75–86.
Book Google Scholar
Wang J, Li M, Chen J, Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2011; 8:607–20.
Article Google Scholar
Cho YR, Hwang W, Ramanathan M, Zhang A. Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007; 8:265.
Article PubMed PubMed Central CAS Google Scholar
Liu Q, Song J, Li J, Liu Q, Song J, Li J. Using contrast patterns between true complexes and random subgraphs in ppi networks to predict unknown protein complexes. Sci Rep. 2016; 6:21223.
Article CAS PubMed PubMed Central Google Scholar
Liu Q, Song J, Li J, Liu Q, Song J, Li J. Classification and feature selection techniques in data mining. Int J Eng Res Technol (ijert). 2012; 1:1–6.
Article CAS Google Scholar
Liu X, Yang Z, Zhou Z, Sun Y, Lin H, Wang J, Xu B. The impact of protein interaction networks’ characteristics on computational complex detection methods. J Theoret Biol. 2018; 439:141–51.
Article Google Scholar
Ren J, Wang J, Li M, Wang L. Identifying protein complexes based on density and modularity in protein-protein interaction network. BMC Syst Biol. 2013; 7:12.
Article Google Scholar
Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D. Defining and identifying communities in networks. Proc Natl Acad Sci. 2004; 101:2658–63.
Article CAS PubMed PubMed Central Google Scholar
Zhao B, Wang J, Li M, Wu F. -X., Pan Y. Detecting protein complexes based on uncertain graph model. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2014; 11(3):486–97.
Article Google Scholar
Zhang Y, Lin H, Yang Z, Wang J, Liu Y. An uncertain model-based approach for identifying dynamic protein complexes in uncertain protein-protein interaction networks. BMC Genomics. 2017; 18(7):743.
Article PubMed PubMed Central Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25:25.
Article CAS PubMed PubMed Central Google Scholar
Consortium GO. The gene ontology (go) project in 2006. Nucleic Acids Res. 2006; 34:322–6.
Article CAS Google Scholar
Lei X, Jie Z, Fujita H, Zhang A. Predicting essential proteins based on rna-seq, subcellular localization and go annotation datasets. Knowl-Based Syst. 2018; 151:095070511830159.
Article Google Scholar
Liu X, Yang Z, Zhou Z, Sun Y, Lin H, Wang J, Xu B. Dynamic protein interaction network construction and applications. Proteomics. 2014; 14:338–52.
Article CAS Google Scholar
Watts DJ, Strogatz SH. Collective dynamics of ’small-world’networks. Nature. 1998; 393:440.
Article CAS PubMed Google Scholar
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S-M, Eisenberg D. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002; 30:303–5.
Article CAS PubMed PubMed Central Google Scholar
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams S-L, Millar A, Taylor P, Bennett K, Boutilier K, et al. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002; 415:180.
Article CAS PubMed Google Scholar
Gavin A-C, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon A-M, Cruciat C-M, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002; 415:141.
Article CAS PubMed Google Scholar
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S. -M., Eisenberg D. Dip, the database of interacting proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002; 30:303–5.
Article CAS PubMed PubMed Central Google Scholar
Keshava Prasad T, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database–2009 update. Nucleic Acids Res. 2008; 37:767–72.
Article CAS Google Scholar
Chatr-Aryamontri A, Breitkreutz B-J, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O’Donnell L, et al. The biogrid interaction database: 2013 update. Nucleic Acids Res. 2012; 41(D1):816–23.
Article CAS Google Scholar
Ma C-Y, Chen Y-PP, Berger B, Liao C-S. Identification of protein complexes by integrating multiple alignment of protein interaction networks. Bioinformatics. 2017; 33(11):1681–8.
CAS PubMed PubMed Central Google Scholar
Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(suppl_1):535–9.
Article CAS Google Scholar
Pu S, Wong J, Turner B, Cho E, Wodak SJ. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2008; 37:825–31.
Article PubMed PubMed Central CAS Google Scholar
Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G, Costanzo MC, Dwight SS, Engel SR, Fisk DG, et al. Gene ontology annotations at sgd: new data sources and annotation methods. Nucleic Acids Res. 2007; 36:577–81.
Article CAS Google Scholar
Mewes H-W, Amid C, Arnold R, Frishman D, Güldener U, Mannhaupt G, Münsterkötter M, Pagel P, Strack N, Stümpflen V, et al. Mips: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004; 32:41–4.
Article Google Scholar
Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin A-C, Bork P, Superti-Furga G, Serrano L, et al. Structure-based assembly of protein complexes in yeast. Science. 2004; 303:2026–9.
Article CAS PubMed Google Scholar
Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, et al. Saccharomyces genome database (sgd) provides secondary gene annotation using the gene ontology (go). Nucleic Acids Res. 2000; 30:69–72.
Article Google Scholar
Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes H-W. Corum: the comprehensive resource of mammalian protein complexes—2009. Nucleic Acids Res. 2009; 38(suppl_1):497–501.
Article CAS Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nature Genet. 2000; 25(1):25.
Article CAS PubMed Google Scholar
Luc P-V, Tempst P. Pindb: a database of nuclear protein complexes from human and yeast. Bioinformatics. 2004; 20(9):1413–5.
Article CAS PubMed Google Scholar
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. Kegg for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2011; 40(D1):109–14.
Article CAS Google Scholar
Consortium U. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 2018; 47(D1):506–15.
Article CAS Google Scholar
Luo J, Li G, Song D, Liang C. Integrating functional and topological properties to identify biological network motif in protein interaction networks. J Comput Theoret Nanosci. 2014; 11:744–50.
Article CAS Google Scholar
Xu B, Guan J. From function to interaction: A new paradigm for accurately predicting protein complexes based on protein-to-protein interaction networks. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2014; 11:616–27.
Article Google Scholar
Cai B, Wang H, Zheng H, Wang H. Integrating domain similarity to improve protein complexes identification in tap-ms data. Proteome Sci. 2013; 11(1):2.
Article Google Scholar
Song J, Singh M. How and when should interactome-derived clusters be used to predict functional modules and protein function?Bioinformatics. 2009; 25(23):3143–50.
Article CAS PubMed PubMed Central Google Scholar
Zhang X-F, Dai D-Q, Li X-X. Protein complexes discovery based on protein-protein interaction data via a regularized sparse generative network model. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2012; 9(3):857–70.
Article Google Scholar
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. Go: Termfinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics. 2004; 20(18):3710–5.
Article CAS PubMed Google Scholar
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. Go: Termfinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms. Bioinformatics. 2004; 20:3710–5.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank Wu Min, Tamás Nepusz, Guimei Liu and Eileen Marie Hanna for providing codes and datasets.

Funding

This work was supported by the National Natural Science Foundation of China (61772226, 61373051 and 61502343), the Interdisciplinary research funding program for doctoral candidates of jilin university (Grant No.10183201835) and the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China. The funding agencies played no roles in the design of the study, collection, analysis, interpretation of data, or in writing the manuscript.

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, No. 2699 Qianjin Street, Changchun, 130012, China
Rongquan Wang, Liyan Sun & Guixia Liu
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, No. 2699 Qianjin Street, Changchun, 130012, China
Rongquan Wang, Liyan Sun & Guixia Liu
School of International Economics, China Foreign Affairs University, 24 Zhanlanguan Road, Xicheng District, Beijing, 100037, China
Caixia Wang

Authors

Rongquan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Caixia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liyan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Guixia Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RW conceived and designed the study and drafted the manuscript. CW participated in the design and discussion of the research, and helped to carefully revise English editing. LS provided technical implementation assistance. GL participated in its design and coordination and exercised general supervision. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Guixia Liu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

extsuperscript\dag Equal contributors

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Wang, R., Wang, C., Sun, L. et al. A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations. BMC Genomics 20, 637 (2019). https://doi.org/10.1186/s12864-019-5956-y

Download citation

Received: 27 March 2019
Accepted: 04 July 2019
Published: 07 August 2019
DOI: https://doi.org/10.1186/s12864-019-5956-y

A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations

Abstract

Background

Results

Conclusion

Background

Related work

Approaches based on cliques or dense subgraphs

Approaches based on core-attachment structure

Approaches based on model

Approaches based on hierarchical clustering

Approaches based on supervised learning

Our work

Results

Protein-protein interactions datasets selection selection

Protein complexes selection

Preprocessing

Gene Ontology(GO) selection

Evaluation metrics

Comparison with existing algorithms based on known protein complexes

Biological significance of the detected protein complexes

Examples of detected complexes

Discussion

The relationship between the size of detected protein complexes and the p-value of detected protein complexes

Computational complexity of SE-DMTG

Experimental setup

Time complexity analysis

Conclusion

Methods

Preliminaries

Algorithm framework

Construction of a weighted PPIN based on common neighbors and GO annotations

Common neighbors

Protein functional similarity computation

Generation of a seed queue and formation of the initial cluster

Definition of a protein complex model

Extending and correcting the cluster to generate a locally optimal subgraph

Determining the priority of boundary nodes

Extending and correcting estimation

Obtaining a list of identified protein complexes.

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us