 Methodology article
 Open Access
 Published:
A seedextended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations
BMC Genomics volumeÂ 20, ArticleÂ number:Â 637 (2019)
Abstract
Background
The detection of protein complexes is of great significance for researching mechanisms underlying complex diseases and developing new drugs. Thus, various computational algorithms have been proposed for protein complex detection. However, most of these methods are based on only topological information and are sensitive to the reliability of interactions. As a result, their performance is affected by falsepositive interactions in PPINs. Moreover, these methods consider only density and modularity and ignore protein complexes with various densities and modularities.
Results
To address these challenges, we propose an algorithm to exploit protein complexes in PPINs by a SeedExtended algorithm based on Density and Modularity with Topological structure and GO annotations, named SEDMTG to improve the accuracy of protein complex detection. First, we use common neighbors and GO annotations to construct a weighted PPIN. Second, we define a new seed selection strategy to select seed nodes. Third, we design a new fitness function to detect protein complexes with various densities and modularities. We compare the performance of SEDMTG with that of thirteen stateoftheart algorithms on several real datasets.
Conclusion
The experimental results show that SEDMTG not only outperforms some classical algorithms in yeast PPINs in terms of the Fmeasure and Jaccard but also achieves an ideal performance in terms of functional enrichment. Furthermore, we apply SEDMTG to PPINs of several other species and demonstrate the outstanding accuracy and matching ratio in detecting protein complexes compared with other algorithms.
Background
A protein complex is a group of proteins that interact with each other to perform different cellular functions [1]. The detection of protein complexes from proteinprotein interaction networks (PPINs) plays an important role in the realization of the cell function in the proteomics era. Specifically, protein complexes contribute to the study of protein interaction network [2], function, diseases [3], etc. Protein complexes help researchers to fully study the causes of various diseases and further develop new drugs. Research on protein complexes is helpful to analyze the different stages of diseases [4]. Current studies have shown that disease genes tend to be highly connected among themselves in disease networks. These highly connected subgraphs could be disease protein complexes and investigation of the cause and effect of these complexes in disease networks could contribute to providing the search space for bioinformaticists, enhance the analysis process [5, 6] and help medical researchers to design new drugs. As a result, the detection of protein complexes plays an indispensable role in complex diseases.
During the past decade, because of the development of highthroughput techniques such as yeasttwohybrid [7], mass spectrometry [8], and protein chip technologies [9], the number of available PPINs has rapidly increased and have been collected from different public databases. In general, a PPIN can be naturally represented in the form of a network, which not only provides a people the panoramic scope of PPIs on a proteomics scale but also help us to understand the basic organization of cell machinery based on the whole network. How to use PPINs to analyze biological systems remains a meaningful task [10]. Although most of PPINs are missing and inaccurate [11, 12], they reveal biological processes and inherent organizational structures within cells [13â€“15]. How to accurately discover biological protein complexes is a main subject in biology and bioinformatics. In biology, there are some experimental methods have been designed to detect protein complexes in PPINs, including TAPms [16], CoIP [17â€“19] and the twohybrid system [13, 20]. However, biological experimental methods have their own shortcomings; for example, they are timeconsuming, relatively expensive and inefficient. Thus, the use of to provide computational algorithms to improve the effectiveness of protein complex detection in PPINs is appealing.
To overcome these experimental constraints, various computational methods have been developed to improve the effectiveness of protein complex detection in PPINs. Some researchers have shown that a protein complex in a PPIN is a molecular structure consisting of both function and structure [21]. Furthermore, some related empirical studies on PPINs also support this point and indicate that modular components in these networks do exist [22]. These results have two implications: one is that these modules are composed closely related proteins and these proteins could have many common neighbor from the perspective of network topology; the other is that proteins in the same modules perform similar functions together in terms of biology. Thus, many researchers believe that proteins in the same complex generally implement the same or similar function and tend to interact with each other [23]. Generally, a PPIN is usually modeled as an undirected graph, where the nodes represent proteins and the edges correspond to proteinprotein interactions. Therefore, protein complexes can be detected by mining the modular structures (i.e., dense subgraphs or subnetworks) from PPINs [24]. Based on this idea, the problem of detecting protein complexes in PPINs can be computationally addressed via graph clustering methods, where the resulting biological subgraphs or clusters are considered to be protein complexes. Herein, clustering consists of grouping nodes into groups (also called clusters or communities) such that the nodes in the same cluster are more similar to each other than the nodes in the other clusters [25]. Therefore, to overcome the disadvantages of the experimental methods, a series of graph clustering algorithms based on machine learning and data mining are developed as an compensatory choice to detect protein complexes.
Related work
Up to now, a variety of computational algorithms for detecting protein complexes have been proposed. We first try to make a brief classification of relation work. They mainly include Approaches based on cliques or dense subgraphs, Approaches based on coreattachment structure, Approaches based on hierarchical clustering, Approaches based on model, Approaches based on supervised learning. We will further discuss these methods in the following sections.
Approaches based on cliques or dense subgraphs
A large number of existing algorithms suppose protein complexes correspond to kcliques or highly dense subgraphs. Thus, in the past decade a series of algorithms based on cliques or dense subgraphs have been proposed for detecting protein complexes from PPINs. Until now, many protein complexes detection algorithms also belong to this category. For example, adamcsek et al. [26] provide an application called CFinder to find the kclique percolation clusters as protein complexes in PPINs. Another example is CMC [27], which first mines the maximal cliques from weighted PPIN, and then removes or merges some highly overlapping maximal cliques. However, this kind of methods require a protein complex to be kclique or clique. Consequently, some researchers try to discover dense subgraph by using a heuristical searching strategy in a PPIN. For instance, MCODE [28] is one of the earliest this kind methods, which detects protein complexes based on seedextend method and subgraph with highly density in a PPIN. Several years later, AltafUIAmin et al. [29] propose DPClus, unlike MCODE, DPClus detect densely subgraphs as protein complexes based on the concepts of density and periphery. Following the DPClus, based on the diameter and density, Li et al. [30] present a improved clustering algorithm called IPCA. Several years later, a fast, memoryefficient cluster algorithm SPICi [31] is presented. This cluster algorithm uses density and support function for clustering larger networks.
In fact, approaches based on cliques or dense subgraphs are effective to detect the kcliques or highly density protein complexes, but they fail to detect either the sparsely subgraph or the relatively peripheral proteins. How to tackle these challenges will be emphasis for further study.
Approaches based on coreattachment structure
Most of approaches based on cliques or dense subgraphs mainly focus on the assumption that the highly connected subgraphs may be protein complexes, but these methods ignore the inherent organization of protein complexes. Gavin et al. [14] recently have demonstrated that protein complexes consist a core and some attachments, in which proteins in the core are highly interconnected, and some attachments or protein modules often interact with their core sparsely and assist their core in performing subordinate functions. Employing the coreattachment structure, some outstanding detection algorithms are developed. They have mainly two stages: the first stage is identifying all dense subgraphs and letting them to be the protein complex cores and the second stage is to extend all complex cores by adding peripheral proteins into its core. For example, Wu et al. [32] develop the algorithm named COACH, which first mines some dense subgraphs as protein complex cores and then identifies peripheral proteins. And then peripheral proteins is cooperating with their protein complex core to form a protein complex. Recently, Peng et al. [33] propose another algorithm called WPNCA, which is a new algorithm by using the PageRankNibble algorithm and coreattachment structure. Experiments results show that WPNCA is superior to other stateoftheart algorithm in detecting complexes.
Generally speaking, identified complexes with coreattachment structures have a larger size. In fact, the real protein complexes have a smaller size. It is a directions for further research in the future.
Approaches based on model
Up to now, approaches based on model in protein complexes detection are very popular in protein complexes detection. That because they show an excellent performance. Unlike most of algorithms that we mentioned above, approaches based on model focus predominantly on seeking to some relation model or graph pattern to predict protein complexes. It is a new way to discover protein complexes. Markov clustering (MCL) [34] is one of the most popular model by using the random walk strategy in a PPIN, and it has two basic operators called expansion and inflation. MCL can tolerate more noises than other types of algorithms. However, its result depends on the parameter inflation and it does not detect overlapping protein complexes. In fact, overlapping protein complexes takes up a large proportion of protein complexes. Based on this fact, Nepusz et al. [35] introduce a novel method (called ClusterONE) to predict overlapping protein complexes. ClusterONE introduces a cohesiveness (also called graph modularity) to assess the quality of protein complexes for the first time. On the basis of ClusterONE, we introduce CALM [36], a improved method, to detect protein complexes. Firstly, we identify overlapping nodes and seed nodes by calculating node degree and betweenness, then uses a greedy local research approach based on coreattachment and local modularity structure to produce detected protein complexes.
Although the algorithms based on model have good performance for the detection of protein complexes, their accuracy need to be improved by employing network topological features. For example, they could take multiple network topological property or biological informations into account.
Approaches based on hierarchical clustering
Recently, due to the form of a tree [37] in PPINs and the nature of modularity [38] in biological networks, some traditional hierarchical clustering algorithms are tried to detect protein complexes in the PPINs. The major difference among them is how to construct the hierarchical structure. More specifically, the key is how to measure the similarity of nodes. Next we introduce some representative algorithms.
Generally, traditional hierarchical clustering algorithms can not be use directly in PPINs with false positives. To overcome this challenge, based on the edge clustering coefficients and Î»module, Li et al. [39, 40] propose a new fast hierarchical algorithm for identifying protein complexes, named FAGEC. Wang et al. modify FAGEC and propose HCPIN [41] to identify overlapping and hierarchical functional modules in a PPIN.
In summary, approaches based on hierarchical clustering can provide a global perspective to look at the hierarchical modular organization of a PPIN. Whatâ€™s more, they are easy to implement and understand. However, most of them can not identify overlapping clusters and are sensitive to the noisiness of the PPINs [42]. Thus, their accuracies are limited. In practice, their performance is deficient in some cases.
Approaches based on supervised learning
The aforementioned various computational clustering algorithms are unsupervisedbased clustering and they are used for finding protein complexes. All of these unsupervised clustering algorithms only consider one of the multiple topological structure of protein complexes and do not use the known complexes, thus they may ignore complexes with other types of topological structure.
To tackle the defect, with the development of supervised learning algorithms, some researchers utilize the information of known complexes to detect protein complexes from the PPINs. Supervised learning algorithms generally contain three main steps: (1) extract useful features from the known complexes; (2) train a supervised model by distinguishing the real complexes from random subgraphs based on the extracted features; (3) detect protein complexes from the PPINs by using the trained model as fitness evaluating function. So far ClusterEPs [43] is the best among them. It uses emerging patterns to measure the possibility of a subgraph being a complex.
Unfortunately, there is no appropriate feature selection method and the PPINs always have a considerable number of noise. Moreover, the number of known protein complexes is available for training is too small. These disadvantages make the trained model imprecise [44]. Meanwhile, some features are often related to the specific mapping PPINs, so these extracted features may be unique and not universal. As a result, their performance could decrease [45]. Therefore, how to overcome these issues is critical for further improving the accuracy of detection protein complexes.
Our work
The above algorithms have been shown to detect protein complexes effectively. Furthermore, proteins in the same complex generally possess high functional similarity; thus, protein constituting a complex possibly have similar function. Based on the strengths and weaknesses of the relative works and considering the fact that highthroughput PPINs are noisy and incomplete. Furthermore, proteins in the same protein complex generally possess high functional similarity and more neighbors, proteins constituting a protein complex possibly have similar function and more the same common neighbors. In this paper, we first integrate both common neighbors and GO annotations to construct a weighted PPIN. According to some evidence and research [30, 35, 46], the densitybased algorithms and modularitybased algorithms have outstanding performance in PPINs. Thus, we define a new model to quantitatively assess protein complex detection by considering both the density and modularity of a subgraph, and we propose a new graph clustering method based on seedextend algorithm, namely (SEDMTG), to detect protein complexes of various dense and modularity. In this process, we grow each seed node to a subgraph until this subgraph is a locally optimal cluster. Furthermore, we remove redundant detected complexes and treat the derived complexes as finally identified protein complexes. Finally, to validate the performance of SEDMTG, we apply it to PPINs of three different species and compare the results, in terms of the Fmeasure and Jaccard with those of some representative stateoftheart algorithms by using several known protein complex datasets that are widely used in biological experiments. The experimental results demonstrate that SEDMTG outperforms the other competing algorithms in terms of accuracy and matching with known complexes. In addition, these identified protein complexes are subjected to functional enrichment analysis to ascertain their biological significance.
Results
Proteinprotein interactions datasets selection selection
For performance testing, we carry out all the experiments on three species PPINs: S.cerevisiaecerevisiae (Yeast), Homosapiens (Human) and Musmusculus (Mouse). For yeast, we mainly tested three real yeast PPINs. They are Krogan core [15], DIP [55] and combined6, where combined6 [27] is generated by six individual experiments, including interactions characterized by mass spectrometry technique (2002) [56], Gavin et al. (2002, 2006) [14, 57] and Krogan et al. (2006) [15], and interactions produced using twohybrid techniques [7, 13]. For human, we use two PPINs, which consists of DIP (version Hsapi20170205 on 9/5/2019) [58] and a combined dataset from HPRD (Human Protein Reference Database, 7/2010) [59] and BioGRID (version 3.2.109) [60], namely, HPRD+BioGRID, which is downloaded from Ref [61]. For the mouse, the PPIN of Mus musculus is also obtained from Biogrid (version 3.5.172) [62]: we download Biogrid Mus musculus (BIOGRIDORGANISMMus _musculus3.5.172.tab.txt), and then we extract the related of mouse file (Biogrid UNIPROT.tab.txt,14/5/2019). Note that, we use all the unweight PPINs to test all algorithms and we remove all selfconnecting interactions and repeated interactions. The detail information of these datasets is listed in Table 2.
Protein complexes selection
To evaluate the performance of different protein complex detection algorithms. For yeast, we employ two known protein complexes sets as standard complexes to evaluate the quality of identified protein complexes by various algorithms in yeast PPINs, namely CYC2008 [63] and SGD [64]. In particular, CYC2008 is constructed from three sources, i.e., 1) MIPS [65], 2) Aloy et al [66], and 3) SGD database [67]. For human, we use two standard complexes, which include: 1. CORUM complexes [68]. 2. CGPK complexes [61] is constructed from four sources, i.e., (1) the Comprehensive Resource of Mammalian protein complexes (CORUM) [68]; (2) protein complexes are annotated by GO [69]; (3) Proteins Interacting in the Nucleus database (PINdb) [70] and (4) KEGG modules [71]. For mouse, we use the CORUM complexes [68]. Following the work done by Nepusz et al. [35], we further eliminate those protein complexes that are made up of fewer than three proteins and discard some redundant protein complexes. Finally, the rest of known protein complexes in these databases are used for performance evaluation. The summary of the these standard protein complexes is presented in Table 3.
Preprocessing
For yeast, we directly use the protein name to represent the proteins in the PPIN and protein complexes. For human and mouse, different PPINs and different standard protein complexes from different sources of datasets are heterogeneous in many aspects. Therefore, we use the Uniprotid [72] to represent each protein in this study. As a result, we have a uniform way to represent proteins for both the different PPINs and the standard protein complexes. In the process, we remove all duplication interactions, and proteins is not exist its associated Uniprot accession id.
Gene Ontology(GO) selection
As for the Gene Ontology (GO) file, for yeast, we use the GO slims which is the cutdown version of GO, it is a subset of the terms in the whole yeast GO. Here, since GO slims of CC include some protein complexes information, we only use GO slims of BP and MF as GO annotations. Moreover, the GO slim information is downloaded from the website (https://www.yeastgenome.org/). Similarly, for human and mouse, we exploit each protein with their associated Biological Process (BP), and Molecular Functions (MF) GO annotation based on the web UniProt [72] (available at https://www.uniprot.org/), and we download these mapping files.
Evaluation metrics
For the purpose of performance evaluation, This section introduces some evaluation metrics that have been used in this paper. These evaluation metrics calculate the matching degree between identified complexes obtained by different algorithms and standard complexes. Generally, the value of these evaluation metrics falls into the interval between 0.0 and 1.0. The higher the value, the better quality of clustering results and better performance an detecting algorithm has.
1) Precision, Recall, and FMeasure: To evaluate the performance of all algorithms, we match generated complexes with known complexes. First, we introduce the overlap score (OS) between the identified protein complexes and known complexes, which is presented as follows [73]:
Here, Np is the size of the detected complex, Ng is the size of the known complex, and Npâˆ©Ng is the common protein number from the detected and known complexes. If OS(p,g)â‰¥Ï‰, we consider p and g to match each other. In our experiment, we set Ï‰=0.2, which is consistent with previous studies [28,29].
After the overlap score (OS) has be defined, we can now give the definition of Precision, Recall, and Fmeasure as follows [74]:
where Precision =\(\frac {N_{{cp}}}{P}\) and Recall =\(\frac {N_{{cg}}}{G}\). The Fmeasure is the harmonic mean of Precision and Recall, which can assess the overall performance of the detection algorithms.
2) JaccardI, JaccardS and Jaccard: As we all known, Precision, Recall and Fmeasure by setting a threshold to judge whether a standard complex and an identified complex are matched or not. It has its limitations because it doesnâ€™t consider the impact of overlapping part on both identified complexes and the corresponding standard complexes [75]. Therefore, we utilize Jaccard measure for evaluating clustering results [76,77]. It considers the proportion of overlap size in the union set of an identified complex and a standard complex [75]. For more details, please refer to Song et al. [76].
Before we give these metrics, we firstly introduce some notations. Let I be the set of identified complexes obtained by a specific identified algorithm, and S be the set of standard complexes. Moreover, let S_{i}âˆˆS be a standard complex and I_{j}âˆˆI represent an identified complex, and then their Jaccard coefficient between them is defined as \(Jac(S_{i},I_{j})=\frac {S_{i}\cap I_{j}}{S_{i}\cup I_{j}}\) [77]. For each identified complex I_{j}, its Jaccard measure is the maximum Jaccard coefficient over all standard complexes i.e, \(\phantom {\dot {i}\!}Jac(I_{j}) = max_{S_{i}\in S} Jac(I_{j},S_{i})\). Taking an average over these identified complexes, weighted by complex size, we compute the weighted average Jaccard measure for the all I identified complexes.
Similarly, for a standard complex S_{i}, its Jaccard measure is \(\phantom {\dot {i}\!}Jac(S_{i}) = max_{I_{j}\in I} Jac(S_{i},I_{j})\) and
Finally, the Jaccard measure between identified complexes and standard complexes is defined as the harmonic mean of JaccardI and JaccardS.
According to the definition of Jaccard measure, we can see that Jaccard measure could better evaluate the performance of the identified algorithms than Fmeasure, especially to compare matching rates of different algorithms.
3) pvalue: To evaluate the statistical significance of the detected protein complexes, many researchers annotate their main biological functions by using pvalue [23,78]. We calculate the function enrichment test to demonstrate the biological significance of detected protein complexes by different algorithms. In this paper, we use LAGO [78] to accomplish the function enrichment test with different threshold. Note that, LAGO is a fast tool which finds significant GO terms among a list of gene names, and it computes the significance (pvalue) via the hypergeometric distribution, and applies (by default) Bonferroni correction. For the details of calculating pvalue, please refer to [78]. The pvalue is used for measuring the biological relevance of detected protein complexes and can be denoted as follows.
where k is the number of proteins of the functional group in the protein complex, N is the number of proteins in the PPIN. F is the size of a functional group in the PPIN, a detected protein complex that contains C proteins. Generally, the lower the pvalue is, the stronger biological significance the protein complex has. The detected protein complex with less than 0.01 is deemed to be meaningful. In additionally, the larger protein complexes possess the smaller pvalues.
Comparison with existing algorithms based on known protein complexes
We have experiments on six PPINs to compare our SEDMTG algorithm with the following stateoftheart protein complex detection algorithms, including MCODE [28], MCL [34], CFinder [26], DPClus [29], IPCA [30], CMC [27], COACH [32], HCPIN [41], SPICi [31], ClusterONE [35], WPNCA [33], CALM [36], and ClusterEPs [43]. Here all parameters are set as their authors advised in Table 4. Meanwhile, to evaluate the performance of all algorithms more comprehensively, all the detection algorithms are tested on the three different species that are yeast, human and mouse. Where three yeast PPINs include the Krogancore, DIP and combined6 dataset. For human, it includes DIP and a combined dataset (HPRD+BioGRID). And we use the BioGRID dataset as mouse PPIN for testing all algorithms. All tested results are presented in Tables 5, 6, 7, 8 and 9. Because the results are similar, we only analyze the results on the yeast in detail and the rest of results are briefly introduced.
The experimental results of Fmeasure for different algorithms on yeast PPINs have been summarized in Table 5. As the Table 5 shows, although SEDMTG doesnâ€™t always obtain best performance on precision or recall, but it always keeps in the top three in all cases. Furthermore, SEDMTG obtains best Fmeasure in all three yeast datasets. It means that SEDMTG makes a better compromise between precision and recall. Therefore, the results of Fmeasure for SEDMTG are better than other algorithms. In other words, SEDMTG is obviously better than other algorithms, especially for the overall accuracy in detected protein complexes. Generally, the performance of SEDMTG in detecting protein complexes is very promising. The principle reason is that SEDMTG takes into consideration not only gene ontology data but also the topological structure of the tested PPIN.
We have mentioned the limitations of precision, recall and Fmeasure earlier in this paper. Furthermore, we employ Jaccard measure to reflect that match ratio between detected protein complex set and standard complex set. Table 6 presents all comparative performance results for different algorithms evaluated based on Jaccard metrics by using CYC2008 and SGD standard complexes, respectively. As can be seen from Table 6, in three yeast PPINs, for Jaccard metric, SEDMTG consistently outperforms other compared algorithms. That is SEDMTG has the best value of Jaccard and superior performance. Furthermore, we can see that SEDMTG clearly dominates the other algorithms in all tested datasets. Therefore, SEDMTG algorithm can get more competitive value of Jaccard compare to other algorithms, which suggests that SEDMTG performs better than other classic algorithms in terms of matching ratio on all three datasets. According to the above analysis, we known that the new fitness function we designed is used for dealing with the problem of protein complex detection and seems reasonable to use GO annotations for the detection of protein complexes.
Moreover, we make use of Krogan core dataset to compare the performance of all comparing methods by using CYC2008 and SGD as the standard complexes. As shown in Table 6, the Jaccard of SEDMTG achieve 0.4688 and 0.4008, respectively, which significantly outperforms other algorithms. Similarly, on DIP dataset, SEDMTG achieves the highest Jaccard (0.386 and 0.3485). For the combined6 dataset, SEDMTG also achieves the highest value of Jaccards and the values of Jaccards are 0.5208 and 0.493, respectively. Therefore, it shows that the values of Jaccard in combined6 dataset for SEDMTG is superior to the results in other datasets. This is mainly because combined6 is more reliable than other two datasets. In other words, PPIN contains multiple source dataset, which maybe lead to more real proteinprotein interactions.
To further demonstrate the effectiveness of SEDMTG algorithm in PPINs on other species, we also carry experiment on the human and mouse PPINs. All comparison results are listed in Tables 7, 8 and 9. Similarly, SEDMTG also achieves the highest Fmeasure and Jaccard on other species in most cases. It is noteworthy that the higher Fmeasure means we can identify protein complexes more accurately and the higher Jaccard represents that detected algorithms have a better matching ratio between detected protein complexes and real protein complexes. In summary, for different species PPINs, SEDMTG has the best performance over other comparative algorithms in terms of Fmeasure and Jaccard.
Biological significance of the detected protein complexes
Due to the incompleteness of the known protein complexes, we should calculate the pvalue of the detected protein complexes on Cellular component ontologies (CC) by using the tool LAGO (http://go.princeton.edu/cgibin/LAGO), which is used for making a functional enrichment analysis [78]. All parameters of LAGO are set default. Because CC includes the information of protein complexes, thus it can better compare the performance of different algorithms. Generally speaking, each protein complex detected by detection algorithm is associated with a pvalue to show its GO annotations. If the pvalue of a protein complex is less than 0.01, we consider it biologically significant. In fact, the pvalues of detected protein complexes have close relationship with their size [33].
Here, to evaluate the functional enrichment of protein complexes detected by different algorithms more comprehensively, we mainly focus on the following three aspects: (1) the number of significant detected protein complexes; (2) the percentage of significant detected protein complexes; (3) the average pvalue of detected protein complexes. Furthermore, selecting the above approaches to compare with SEDMTG is because these algorithms are robust performances in most of datasets. More detail you can see their results from Tables 5, 6, 7, 8 and 9. The pvalues of DPClus, IPCA, CMC, COACH, SPICi, ClusterONE, WPNCA and SEDMTG are presented in Table 10.
In Table 10, we summarize the results of DPClus, IPCA, CMC, COACH, SPICi, ClusterONE, WPNCA and SEDMTG by using function enrichment tests with different thresholds of pvalue. As shown in Table 10, in most cases, SEDMTG can detect many candidates of protein complexes than other methods such as DPClus, CMC, SPICi and ClusterONE in all PPINs. Furthermore, by analyzing functional enrichment, especially for the number, percentage and average pvalue of detected protein complexes detected by SEDMTG have statistical significance to compare with these algorithms mentioned above. As the Table 10 shows, although the number of significant protein complexes detected by IPCA is the most, the percentage and the average pvalue of significant detected protein complexes is slight lower than SEDMTG, COACH and WPNCA. Furthermore, the percentage and the average pvalue of significant protein complexes detected by SEDMTG from the six PPINs is a bit lower than COACH and WPNCA. It is the third highest among all methods. The major reason is that the size of protein complexes detected by SEDMTG is smaller than the size of detected protein complexes by COACH and WPNCA. In fact, the smaller detected protein complexes have the larger pvalues. More detail about the relationship between the size of detected protein complexes and the pvalue of detected protein complexes. We will discuss in the relationship of the size of identified protein complexes and the pvalue of significant detected protein complexes section.
Examples of detected complexes
In Tables 11 and 12, we further reveal the computation results, 18 detected protein complexes with very low pvalues (â‰¤E20) detected by our SEDMTG algorithm in six datasets are presented. You can see that the pvalue of these detected protein complexes are very low. It demonstrates that the detected protein complexes by SEDMTG have high statistic significance.
To further reveal the comparison results obtained by SEDMTG, we provide with a more vivid description by taking the 391th known protein complex of CGPK complexesâ€™RNase complexâ€™ as example. As shown in Fig. 1a, the known protein complex has 11 proteins. Meanwhile the detected protein complex obtained by SEDMTG algorithm also consists of 11 proteins and it successfully match all proteins and its OS is 100% which is the highest among all algorithms. This result is shown in Fig. 1b. However, the IPCA, DPClus, COACH, WPNCA, MCL and SPICi just cover 11, 11, 11, 11, 6 and 10 proteins of the real RNase complex, respectively. And for the rest of compared algorithms, their OS (see Eq. (1)) is lower than 0.47 or they are not able to get the detected results. So we donâ€™t show them in Fig. 1. However, for the IPCA, DPClus, COACH, WPNCA, MCL and SPICi algorithms, their OS value is only 73%,73%,68%,68%,54% and 47%, respectively. This result means that SEDMTG can detect protein complexes accurately, indicating that the new definition of protein complex is also a good model to characterize the topological structure of the protein complexes. Additionally, from this example we explain that why SEDMTG could achieve highest Fmeasure and Jaccard but its the percentage of significant detected protein complexes and the average of pvalue are slightly lower than COACH and WPNCA. In summary, protein complexes detected by SEDMTG are more biological significance.
In a word, based on the results of pvalue test, we have the conclusion that SEDMTG can detect quite accurately and have good functional enrichments than other thirteen comparative algorithms.
Discussion
The relationship between the size of detected protein complexes and the pvalue of detected protein complexes
To illustrate the relationship between the size of detected protein complexes and the pvalue of detected protein complexes, we do some statistical analysis. Because standard complexes and detected protein complexes are resemble â€™power lawâ€™ distribution. Thus we only display part of the distribution informations in Fig. 2. According to Fig. 2a, the size of most of standard complexes is very smaller. As shown in Fig. 2b, standard complexes whose size is less than or equal to 7 is just 76.96%. Meanwhile, our statistic results show that the average size of the combined standard complexes is 6.38 and the average size of detected protein complexes by SEDMTG is 6.86. But the average size of detected protein complexes by IPCA, COACH and WPNCA is 10.96, 10.20 and 27.12, respectively. The average size of detected protein complexes by SEDMTG is similar with standard complexes. However, in Fig. 2c, we found IPCA, COACH and WPNCA detect a larger number of large protein complexes. Additionally, the size of detected protein complexes by SEDMTG is similar distribution with standard complexes in Fig. 2a and c.
Next, we make Fig. 3 to illustrate the relationship of the size of protein complexes with the percentage of significant detected protein complexes and the average pvalue of detected protein complexes. From Fig. 3, it is obvious that the value of pvalue (E) decreases gradually with the detected protein complexes whose size increasing. For example, the pvalue of standard complexes decreases gradually with the size of protein complexes increasing in Fig. 3a. Similarly, for detected protein complexes by IPCA in Fig. 3c, the value of pvalue decreases gradually when the size of detected protein complexes increases. Therefore, it illustrates that large detected protein complexes have small pvalue. But in Fig. 2a and b, we know that most of standard complexes and protein complexes by SEDMTG have small size. Above analysis explains why SEDMTG has a higher accuracy and matching better with standard complexes according to Tables 5, 6, 7, 8 and 9. However, as for the percentage of significant detected protein complexes and the average pvalue of detected protein complexes, SEDMTG is slightly lower than COACH and WPNCA, and it is the third highest among all methods according to Table 10.
All in all, although pvalue has limitation in evaluating functional significant of detected protein complexes, it also reflects the function enrichment of detected protein complexes in a certain level. Overall, considering the superior accuracy and matching ratio and their strong performance in the function enrichment test, we believe the protein complexes detected by SEDMTG are more likely to be real protein complexes.
Computational complexity of SEDMTG
Experimental setup
We implement SEDMTG in python and execute all the experiments on a 64bit Window system, whose memory of PC is 12GB and Intel CPU is i7 3.60 GHz. In the meantime all stateoftheart methods are also executed on the same machine, except SPICi. While SPICi method is used through its web site.
Time complexity analysis
In this part, we try to analyze the time complexity of the SEDMTG algorithm. It is difficult to give the accurately computational complexity of SEDMGT because it depends on not only the number of detected protein complexes but also their size. Moreover, for each seed, we need to execute an iterative procedure until the current cluster doesnâ€™t changes, Obviously the number of iterations have significant influence for the computational complexity of SEDMTG. Thus, we only roughly analyze the time complexity. Let n and m denote the number of nodes and edges in graph G, respectively, and let \(\overline {k}\) be the average number of neighbors of all the nodes. Then we have \(\overline {k}=\frac {\sum \nolimits _{v\in V}N(v)}{n}\), where N(v) is the number of all neighbors of v. In construct a weighted PPIN step, time complexity of calculating the weight of all edge is \(O\left (n*\overline {k}\right)=O\left (n*\frac {\sum \nolimits _{v\in V}N(v)}{n}\right)=O\left (\sum \nolimits _{v\in V}N(v)\right)=O(2*m)\). In constructing a seed queue SQ and selecting the initial cluster step, according to Eq. (12), the time complexity of we calculating the score of each protein is \(O(n*(\overline {k})+1)^{2}=O(n*\left (\frac {\sum \nolimits _{v\in V}N(v)}{n}+1\right)^{2}=\frac {4*m^{2}}{n}+4*m+n\) and the time complexity of sorting all proteins by their Score(v) is O(nâˆ—log(n)). In the generate detected protein complex step, the worst case is that we need calculate the fitness of each protein and its worst time complexity also is \(\frac {4*m^{2}}{n}+4*m+n\).
In generating detected protein complexes step, we firstly analysis the time complexity when SEDMTG iteratively adds proteins to the cluster SG from its neighbors. It has three basic phases: (1) obtain all candidate nodes which will be added to the cluster SG, whose time complexity is \(O(n_{{SG}}*\overline {k})=O\left (n_{{SG}}*\frac {\sum \nolimits _{v\in V}N(v)}{n}\right)=O\left (\frac {2*n_{{SG}}*m}{n}\right)\), where n_{SG} is the number of the cluster SG. (2) find the highest priority vertex according to Eq. (18) then add it into the cluster SG. The worst time case is that each candidate node is checked, so the time complexity of this case is \(O\left ((N_{{SG}}+N_{{SG}}1+...+1)*\overline {k}\right)=O\left (\frac {m*N_{{SG}}*(N_{{SG}}1)}{n}\right)\), where N_{SG} is the number of neighbors of SG. (3) calculate the fitness of graph SG, whose time complexity is \(O(n_{{SG}}^{2})\). Thus, the time complexity of the whole time when program iteratively add candidate nodes to the cluster SG is \(O\left (\frac {2*n_{{SG}}*m}{n}+\frac {m*N_{{SG}}*(N_{{SG}}1)}{n}+n_{{SG}}^{2}\right)\). Meanwhile, we further analyze the time complexity of iteratively removing some inner nodes from SG. Similar, it also has three basic calculations: (1) determine the inner nodes which are removed them from the cluster SG. Its time complexity is also \(O\left (\frac {2*n_{{SG}}*m}{n}\right)\). (2) find a high priority vertex according to Eq. (18) in order to remove it from the cluster SG. Its time complexity is also \(O\left ((N_{{SG}}+N_{{SG}}1+...+1)*\overline {k}\right)=O\left (\frac {m*N_{{SG}}*(N_{{SG}}1)}{n}\right)\). (3) calculate the fitness of graph SG. Its time complexity is \(O(n_{{SG}}^{2})\). Hence the time complexity of this step is \(O\left (\frac {2*n_{{SG}}*m}{n}+\frac {m*N_{{SG}}*(N_{{SG}}1)}{n}+n_{{SG}}^{2}\right)\).
Suppose t is the number of iteractions when we generate a detected protein complex and N is the number of detected protein complexes. Finally, the time complexity of Algorithm 2 is \(O(N*t*\frac {m}{n}*\left (N_{{SG}}*(N_{{SG}}1)+3*n_{{SG}}*(1+n_{{SG}})\right)\). Finally, we need to discard some redundant protein complexes whose time complexity is O(PCs^{2}), where PCs is the size of candidate identified protein complexes. All in all, the time complexity of the algorithm SEDMTG is \(O(2*m+\frac {4*m^{2}}{n}+4*m+n+n*log(n)+N*t*\frac {m}{n}*\left (N_{{SG}}*(N_{{SG}}1)+3*n_{{SG}}*(1+n_{{SG}})+len(PCs)^{2}\right)\), where N,t and PCs are constant. In addition, we assume N_{SG} and n_{SG} as variables. To facilitate the intuitive understanding of these variables, we provide Table 13 so that you can get more details.
Conclusion
Many highthroughput experimental techniques and computational algorithms have been developed to identify protein complexes from the PPINs. However, most of these methods are based on the original network or use the topological property alone and are thus limited in terms of not only the quality of protein complex identification but also ignoring other useful biological information, such as functional properties. In our opinion, both topological and functional properties are meaningful and important for identifying protein complexes. We therefore combine common neighbor and functional properties to calculate edge weights and construct weighted PPINs. Moreover, we also propose a new local search heuristic graph clustering algorithm, SEDMTG, to extract detected protein complexes with various densities and modularities based on a new model. Although models that consider density or modularity have been applied to study PPINs, our model is novel in considering both density and modularity simultaneously.
We evaluate the performance of the proposed SEDMTG on three PPINs of species under some standard complex datasets and compare the results with those of thirteen competing algorithms. The experimental results show that SEDMTG is competitive in identifying protein complexes and that adding the topological information and GO information increases the detection accuracy. Meanwhile, the experimental results reveal that SEDMTG outperforms the current stateoftheart algorithms in terms of some measures in overall. Furthermore, we analysis the biological significance of detected protein complexes by different methods. The results show that these detected protein complexes by SEDMTG have biological significant. With the wide application of supervised learning, we will try to design a new algorithm that combines classification model and unsupervised clustering algorithms to improve the performance in the future. Additionally, SEDMTG is also robust to false positives in experimental data because of the integration of functional properties. Furthermore, SEDMTG may be extended naturally to other types of biological data fusion to study more comprehensive characteristics of the biological networks and to analyze other forms of complex networks, such as Internet networks, citation networks, ecological networks and social networks.
Methods
Preliminaries
Since the interactions among proteins in the PPINs are symmetric, these PPINs could be formulated as a undirected weighted graph G=(V,E,W), where V is a set of nodes representing the proteins of the PPINs, E is a set of undirected edges corresponding to those interactions, and W represents the likelihoods between nodes. In this paper, we obtain the weights by using the topological information and the biological information. The symbols, abbreviations and their interpretation are shown in Table 1.
Algorithm framework
The SEDMTG algorithm is developed to detect protein complexes based on GO annotations and PPINs topological structure. Furthermore, we propose a composite model for the identification of protein complexes. Algorithm 1 represents the main function of the proposed SEDMTG. SEDMTG operates in three phases. In the first step, given a PPIN, and we construct a weighted PPIN by using common neighbors and GO annotations defined by Eqs. (7) and (8). In the second step, SEDMTG constructs a seed node queue based on a seed score function to form the initial cluster defined by Eq. (12). In the third step, based on the initial cluster in the previous step, we provide a quantitative definition of protein complexes to formulate the problem of protein complexes identification as an optimization problem defined by Eq. (17). Finally, we apply an iterative greedy search process to generate protein complexes (See Algorithm 2).False and redundancy candidate protein complexes are filtered to ultimately obtain identified protein complexes. Figure 4 shows a flowchart of SEDMTG, which is composed of the following main steps:

1.
Construct a weighted PPIN based on common neighbors and GO annotations.

2.
Generate a seed queue and form an initial cluster.

3.
Define the protein complex model.

4.
Extend and correct the cluster to generate a locally optimal subgraph.

5.
Obtain a list of identified protein complexes.
In step 1, the edge clustering coefficient probability is computed based on common neighbor via Eq. (7). The functional similarity between two proteins is calculated based on GO annotations according to Eq. (8). In step 2, we give each protein a score on the basis of both the weight degree (see Eq.(10)) and the neighborhood graph clustering coefficient (see Eq.(11)), and we sort the proteins based on their score according to Eq.(12). In step 3, we introduce a new model to estimate the quantitative value of a cluster (see Eq.(17)). In step 4, we iteratively extend and correct the cluster to generate a protein complex from the weighted PPIN. This process involves four substeps: selecting the highest score protein as the seed node to generate a seed queue and form the initial cluster; assessing the priority of boundary nodes in determining the priority section; iteratively adding neighbor nodes to the cluster, removing inner nodes from the cluster, and filtering and removing false candidate identified protein complex with size less than or equal to two in the extending and correcting cluster to generate a locally optimal subgraph section. In step 5, we discard some redundant candidate protein complexes and output a list of identified protein complexes. For more details of this processes, see the related sections.
Construction of a weighted PPIN based on common neighbors and GO annotations
Recent studies [30,35,36] have shown that the accuracy of protein complex detection can be significantly improved by taking network weights into account. In the following subsections, we introduce how to calculate the weight of the PPIN.
Common neighbors
The edge clustering coefficient [47] is first developed to describe how strongly neighbors are connected. However, Radicchi et al. [47] note that the edge clustering coefficient may not be suitable for using in PPINs because PPINs are disassortative networks. To overcome this limitation, Zhao et al. [48,49] propose a new method to calculate the possibility of proteinprotein interactions. Following their work, we also use the same method to calculate the weight of each edge, namely common neighbors (CN). Then, the existence probability of an edge (v,u) in a PPIN is defined as follows:
where N(v) and N(u) are the neighborhood sets of v and u, respectively. In Eq. (7), N(v)âˆ©N(u) denotes the set of common neighbors between two proteins. CN is a measure that can describe how closely proteins v and u are related. In this paper, we assume that the similarity of different interactions are independent of each other. The higher the value is, the larger the probability that proteins v and u belong to the same protein complex is.
Protein functional similarity computation
On the other hand, from a biological perspective, gene ontology (GO) [50] is currently one of the most comprehensive ontology databases in the bioinformatics community [51]. The database provides a series of GO terms to describe gene product features. Proteins constituting a complex possibly have similar function. A large functional similarity means higher confidence that two proteins share similar functions. In other words, if two interacting proteins v and u have more common GO annotations and their functions are more similar, then they are more likely to belong to the same protein complex. Additionally, proteins with similar functions tend to be coexpressed [52]. Note that when two terminal nodes v and u of an edge (v, u) do not have common GO annotations, the weight of edge (v, u) may be regarded as noise and set 0.0. Here, we define a new measure to describe the similarity of two interacting proteins v and u based on a biologically similarity function defined as follows:
where GO(v) and GO(u) represent the number of GO annotations in protein v and protein u, respectively. GO(v)âˆ©GO(u) represents the common GO annotations for both proteins v and u. If proteins v and u share more common neighbors, the functional score is larger. Here, we use min(GO(v),GO(u)) because some proteins are overlapping nodes. \(Average(GO)=\frac {\sum \nolimits _{i\in V,GO(i)\geqslant 1}GO(i)}{N}\) is the average of the number of GO annotations for each protein in the whole PPIN. N is the number of proteins for which the number of GO annotations is greater than or equal to 1. Based on this definition, if the number of the proteins containing GO annotation is below the number of the average, then the number is adjusted to the average. max(min(GO(v),GO(u)),Average(GO)) can penalize the reliability of edge (v,u) between protein v and protein u with very few GO annotations.
In this paper, SEDMTG integrates both the topological and biological information of the PPIN by using the CN and GO. CN captures the static topological information and GO assesses the functional similarity of proteins. To incorporate both measures into our method, we use the arithmetic mean as the edge weights in the PPINs. The weight of each edge between two proteins is calculated as follows:
Here,

1.
Neighbors shared by two proteins in the network are called the common neighbors (CN) of Eq. (7).

2.
The functional similarity of two proteins is quantified in terms of the GO annotation (GO) in Eq. (8).
The above two properties express the interaction based on CN and GO annotations. Note that the value of w(v,u) has a range between 0.0 and 1.0 and is used for evaluating the reliability of protein pairs to construct a weighted PPIN. The weights of each edge in the PPIN are obtained by integrating both topological information and biological information. Edges whose weights are 0.0 are considered to be noise and are deleted from the PPIN.
Generation of a seed queue and formation of the initial cluster
Choosing highquality protein seeds for expansion is critical. Each cluster starts at an initial cluster that consists of a single node that is generally called the seed node. An inappropriate choice of a seed node will likely affect the process of detecting protein complexes. For example, a lowquality seed node may result in a false positive protein complex being detected. Furthermore, if a protein that belongs to multiple complexes is chosen as a seed node, the resulting identified complex may subsume the multiple complexes under an unrealistically large false protein complex that cannot match any real protein complex [36]. From a topological perspective, the central part of a protein complex often corresponds to a dense subgraph with high clustering coefficient and more reliable weight in the PPINs [29â€“31,46,53].
According to the preliminaries section, we have given a confidence score 0â‰¤w_{v,u}â‰¤1.0 to every edge (v,u)âˆˆE. We utilize several measures to select seed nodes. For each node v in the PPIN, we define its weight degree, d_{w}(v), as the sum of all its edge weight values:
For each node v, the neighborhood graph consists of v, all its neighbors and the edges among them, is defined as G_{v}=(V_{v},E_{v}), where V_{v}={v}âˆª{uuâˆˆV,(v,u)âˆˆE} and E_{v}={(u_{i},u_{j})(u_{i},u_{j})âˆˆE,u_{i},u_{j}âˆˆV_{v}}. Futhermore, the neighborhood graph clustering coefficient (NGCC) is the sum of the weights of the edges, divided by the total number of possible edges. Thus, for a node v, the NGCC is defined in Eq. (11) [54]:
Here, V_{v} is the degree of node v, \(\sum \nolimits _{v,u\in V_{v}}w(v,u)\) is the sum of the weights of the edges, and \(\frac {(V_{v}\ast (V_{v}1))}{2}\) is the total number of triangles that could pass through node v. The NGCC reflects the weight degree of aggregation of proteins in the PPINs. Note that the NGCC is a measure of the closeness of the node v and its neighbors, which varies from 0.0 to 1.0.
We devise the following score function to sort all proteins in a PPIN. If a protein has a higher score according to Eq. (12), it is more likely to be used as the seed node, to be inside a protein complex, and to have high centrality in the complexes. Thus, the score of each protein v is defined as the product of the its neighborhood graph clustering coefficient and its weight degree, and is defined in Eq. (12):
The seed score function takes both weight degree centrality and neighborhood graph density into consideration for prioritizing the proteins for seeds. Here, we sort all proteins in the PPIN and use a queue (data structure) SQ to record the order. We select the highest score according to Eq. (12) as the seed node to grow a detected protein complex. Once the new detected protein complex is generated, all nodes in the detected protein complex are recorded in a list table and we choose the next highest node that is not visited in the queue SQ as the next seed node. Note that, we calculate the score of each protein only once based on the PPIN, which is more biological meaning [30].
Definition of a protein complex model
As mentioned in the Background section, several protein complexes identification algorithms have been presented. Most existing algorithms make many assumptions to define a subgraph of possible protein complexes in the PPINs. However, in terms of the actual performance of these algorithms, the graphs with high density or high modularity in PPINs generally correspond to protein complexes [29,35]. In fact, a dense graph could have low modularity, and a graph with high modularity may have low density. Therefore, the densitybased algorithms ignore protein complexes with low density and the modularitybased algorithms miss protein complexes with low modularity. Overall, these methods have limitations when identifying protein complexes with various densities and modularities [46]. To overcome these limitations, we define a new protein complex model to detect protein complexes by considering both density and modularity in the PPINs. We begin by presenting some related definitions.
According to the preliminaries section, for an undirected weighted subgraph SG, its density is donated as D_{SG}:
where \(\sum \nolimits _{u,v\in SG} w_{u,v}\) is the sum weight of the edges contained in subgraph SG, and SG represents the size of the subgraph SG, respectively. The density of a graph measures how close the graph is to a clique, and the density takes value between 0.0 and 1.0.
For the subgraph SGâŠ†G, its weighted indegree, denoted as \(d_{w}^{in}(SG)\), is the sum of the weights of all edges belonging to SG, and its weighted outdegree, denoted as \(d_{w}^{out}(SG)\), is the sum of the weights of the edges connecting the nodes in SG to the nodes in the rest of graph G. \(d_{w}^{in}(SG)\) and \(d_{w}^{out}(SG)\) can be obtained as follows [46]:
Clearly, the weighted degree of d_{w}(SG) is equal to the sum of \(d_{w}^{in}(SG)\) and \(d_{w}^{out}(SG)\).
The modularity M_{SG} of a subgraph SGâŠ†G is defined as follows:
Obviously, M_{SG} takes values from 0.0 to 1.0. If a subgraph has higher modularity, it has more connections within itself and fewer connections to the rest of the PPIN. A subgraph with a modularity of 1.0 has no connections with the rest of the PPIN.
In this model, in the process of identifying protein complexes, we measure the quality of SG by considering its density (D_{SG}) and modularity (M_{SG}). D_{SG} describes the density of subgraph SG, M_{SG} describes the modularity of subgraph SG and \(\sqrt {D_{{SG}}*M_{{SG}}}\) describes the subgraph with both high density and high modularity. Here, to make the value range of a subgraph with both high density and high modularity the same as that of the density and modularity, i.e, [0.0,1.0], the value of D_{SG}âˆ—M_{SG} is normalized by the geometric mean of D_{SG} and M_{SG}. The fitness of a subgraph SG in an undirected weighted graph G, denoted as F(SG), is defined as:
Generally, as the subgraph SG expands, its modularity increases and its density decreases. Thus, by expanding from a node, we can obtain a subgraph with the local maximum fitness score and output the result as a protein complex. Thus, this new model can be used for identifying protein complexes with different topology, including high density but low modularity, high modularity but low density, and high density and high modularity. Therefore, our model can identify the protein complexes with various densities and modularities.
Extending and correcting the cluster to generate a locally optimal subgraph
Determining the priority of boundary nodes
An initial cluster (SG) starts as single protein, and then grows and shrinks gradually as proteins are added and removed one by one. The process of adding proteins from the neighbor of SG, and is denoted as Neighbor(SG), and the process of removing proteins from the inner nodes is denoted as inner_nodes(SG). In this process, we first define two concepts: if pâˆˆNeighbor(SG), the neighbor node connects to at least one edge with any protein of cluster SG but does not belong to SG; If pâˆˆinner_nodes(SG), the inner node belongs to SG, but connects to at least one node which is a neighbor of SG. A key problem is to decide the priority to add and remove proteins in terms of SG. In general, if a protein v belongs to SG, it may have a strong connection with its cluster SG=(V_{SG},E_{SG}). Therefore, if the protein v is added to SG, it could increase the average of the weighted interactions within SG. By contrast, if the protein v is removed from SG, it could increase the average of the weighted interactions within SG. Here, we introduce a measure to assess the priority, denoted as weight_{avg}(SG), which is defined as:
where weight_{avg}(SG) is the average of the weighted interactions of all proteins within SG, V_{SG} is the number of proteins in SG and \(\sum \nolimits _{(v,u)\in E_{{SG}}} weight(v,u)\) represents the total weight of the interactions in SG. The priority of adding the node p into the cluster SG, where pâˆˆNeighbor(SG), or deleting the node p from the cluster SG, where pâˆˆinner_nodes(SG), SG is determined by the value of weight_{avg}(SG). We choose the highest weight_{avg}(SG) of the boundary node to add it to SG or remove it from SG to maximize the value of F(SG) (see Eq.(17)).
Extending and correcting estimation
For a cluster SG, in extending step, we first obtain all the neighbors, namely, Neighbors(SG). The priority of all neighbors is determined by the value of weight_{avg}(SG) see Eq. (18). Whether the highest priority protein v is added to SG is determined by whether the fitness (F(SG)) of SG is increased after the highest priority protein v is added and whether the actual edge between the highest priority protein v and the SG, denoted as SGâˆ©N(v), which is the number of proteins in SG connected with v is greater than the expectation edge, denoted as F(SG)âˆ—SG, where F(SG) is the fitness of SG and SG is the number of proteins in SG. Once the highest priority protein v is added to SG, SG is updated, i.e., the highest priority protein v is removed from Neighbors(SG). Then, the next highest priority protein is tested, and the priorities of list Neighbors(SG) and the fitness (F(SG)) of SG are recalculated, and so on. If the highest priority protein v fails any of two tests, then SG cannot be further extended.
For a cluster SG, in the correcting step, we first obtain all inner nodes, namely Inner_nodes(SG). The priority of all proteins in Inner_nodes(SG) is determined by the value of weight_{avg}(SG) (see Eq. (18)). Whether the highest priority protein v is deleted from SG is determined by whether the fitness (F(SG)) of the cluster SGâˆ’{v} is increased after the highest priority protein v is removed from SG and whether the actually edge between the highest priority protein v and SGâˆ’{v}, denoted as SGâˆ’{v}âˆ©N(v), which represents the number of proteins in SGâˆ’{v} connected with v, is greater than the expectation edge, denoted as F(SG)âˆ—SG, where F(SG) is the fitness (F(SG)) of SG and SG is the number of proteins in SG. Once the highest priority protein v is removed from SG, the cluster SG is updated, i.e., the highest priority protein v is removed from Inner_nodes(SG). Then, the next highest priority protein is tested, and the priorities of Inner_nodes(SG) and the fitness of the cluster SGâˆ’{v} are recalculated, and so on. If the highest priority protein v fails any of two tests, then the cluster SG cannot be further corrected.
Obtaining a list of identified protein complexes.
On the basis of the quantitative description of protein complexes, we develop a novel clustering algorithm based on density and modularity with network topology and GO annotations, named SEDMTG, to identify protein complexes in a weighted PPIN whose edge weights reflect the reliability of the edge in a protein complex according to topological and biological information.
The input of the SEDMTG algorithm is a PPIN, which is described as a simple undirected graph G(V,E) with GO annotations. The SEDMTG algorithm broadly consists of four phases. First, SEDMTG constructs a weighted PPINbased topological and biological information at lines 211. Second, SEDMTG calculates the scores of all nodes and selects the node with the maximum score as the seed in lines 1218. Third, starting from the seed node, a greedy procedure is used for adding nodes to or removing nodes from the cluster SG to obtain a subgraph with high graph fitness. The growth process is repeated from different seeds to form multiple, possibly overlapping subgraphs in lines 1949. Once a new cluster is completed, all nodes in this cluster SG are recorded to prevent them from being used as seed nodes. Then, we select the next seed node from those remaining in the queue SQ to generate the next cluster SG in lines 4145. Moreover, we discard candidate complexes whose size is less than 3 [35] and remove unreliable candidate complexes at line 3846. Finally, we discard redundant protein complexes in lines 5055. A detailed description of the SEDMTG algorithm is shown in Algorithm 1.
In the first step, we assign a weight to each edge based on common neighbor and gene ontology data (lines 2 âˆ¼11).
In the second step, SEDMTG calculates the score of each node (lines 12 âˆ¼17). Furthermore, all the nodes in network G are queued into SQ in nonincreasing order of Score(v) (line 18).
In the third step, we choose the node with the highest Score(v) that has not yet been visited before to bring it up (lines 19 âˆ¼29). The key idea of this step is that any neighbors of the current subgraph SG that make a positive contribution to F(SG) will be added to SG or removed from SG (line 37). The description of iterative generation of a complex is shown in Algorithm 2. Algorithm 2 has two subphases, and we can gradually add neighbors to cluster SG or remove inner nodes from cluster SG. As for the priority of candidate nodes is based on (see Eq. (18)) and two conditions. More details are introduced in the section on extending and correcting the cluster to generate a locally optimal subgraph.
Next the stepbystep procedure of step 3 is given in Algorithm 2.
In the first phase in lines 3 âˆ¼25, after obtaining a seed protein, we first get an external boundary protein set that consists of the neighbors of SG called Neighbor(SG), in lines 4 âˆ¼5. Then, we calculate the graph fitness of SG at line 8. Furthermore, we find the neighbor protein with the highest priority according to weight_{avg}(SG+{p}) in Neighbor(SG), which is added to SG to maximize the value of weight_{avg}(SG+{p}) in lines 7 âˆ¼14. Furthermore, we calculate the fitness of graph SG+{p} in line 15, and Expectation_edges is calculated according to the graph fitness of SG Ã— the size of SG in line 16. Meanwhile, we also calculate the value of Actually_edges which is the size of the interaction set between Neighbor(node_max) and SG, denoted as Neighbor(node_max)âˆ©SG, in line 18. If the node_max with the highest priority is added to increase the value of F(SG) and the Actually_edges is larger than Expectation_edges, then we add node_max to SG and remove it from Neighbor(SG) in lines 19 âˆ¼24. We continually check the next highest priority node in Neighbor(SG) and judge whether the node can be added to the SG in lines 625. Otherwise, the iterative addition of the neighbors of SG phase is terminated when one of two conditions is not satisfied in line 19 or when no more remaining neighbor nodes can be added to SG in line 6.
In the second phase, SEDMTG allows the removal of any inner nodes in cluster SG to maximize the value of F(SG) in lines 26 âˆ¼57. We first find the inner nodes that have edges with nodes that are not in SG, denote as Inner_node(SG) in lines 27 âˆ¼34, and then we test whether each node in Inner_node(SG) can be removed from SG in lines 3557. We first find the highest priority node according to Eq. (18) in lines 3643. Meanwhile, we calculate the graph fitness F(SGâˆ’{p}) of SGâˆ’{p} in line 44. Similarly, we calculate the values of Expectation_edges and Actually_edges in lines 45 âˆ¼47. If the two conditions in line 48 are satisfied, we remove the node from SG and Inner_node(SG) in lines 49 âˆ¼50; otherwise, the second phase is terminated in lines 51 âˆ¼57.
In Algorithm 2, the key idea is to iteratively add the highest priority node in Neighbor(SG) to the cluster SG or remove the highest priority node in Inner_node(SG) from the cluster SG to maximize the value of graph fitness F(SG) in lines 2 âˆ¼59. This growth process is repeated until the current cluster SG no longer changes and is a locally optimal subgraph in line 59; then, the detected protein complex is output by Algorithm 1 in line 37.
After we obtain a detected complex SG by using Algorithm 2 in line 37, and we discard fake protein complexes and complexes whose size is less than 3 [35] in line 39. As a result, we save the detected complex SG in line 40. Meanwhile, SEDMTG records the nodes in SG in lines 41 âˆ¼45 and selects the next seed node by considering the rest of nodes in seed queue SQ that have not been included in any of the detected complexes found thus far. The next node with the highest score is selected as the seed (lines 31 âˆ¼35). We recursively perform the above key operations in PPIN to identify the remaining candidate protein complexes until no seed nodes remain in seed queue SQ (lines 3149). Note that when this process is repeated, the nodes in the previously generated protein complex remain in the PPIN; therefore, SEDMTG is able to generate overlapping complexes.
Finally, SEDMTG outputs all identified protein complexes in line 56.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding literatures and datasets.
Abbreviations
 BP:

Biological process
 CC:

Cellular component
 ClusterONE:

Clustering with overlapping neighborhood expansion
 CMC:

Clusteringbased on maximal cliques
 CN:

Common neighbors
 CoIP:

Coimmunoprecipitation
 GO:

Gene ontology
 GO:

GO annotations (gene ontology)
 MCL:

Markov clustering
 MCODE:

Molecular complex identification
 MF:

Molecular function
 PPINs:

Proteinprotein interaction networks
 SEDMTG:

A seedextended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations
 SQ:

Seed queue
 TAPms:

Tandem affinity purification with mass spectrometry
References
Victor S, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci. 2003; 100:12123â€“8.
Yu H, Paccanaro A, Trifonov V, Gerstein M. Predicting interactions in protein networks by completing defective cliques. Bioinformatics. 2006; 22:823â€“9.
Kasper L, E Olof K, St?Rling ZM, Olason PI, Pedersen AG, Olga R, Hinsby AM, Zeynep T, Flemming P, Niels T. A human phenomeinteractome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007; 25:309.
SafariAlighiarloo N, Taghizadeh M, RezaeiTavirani M, Goliaei B, Peyvandi AA. Proteinprotein interaction networks (ppi) and complex diseases. Gastroenterol Hepatol Bed Bench. 2014; 7:17â€“31.
Chen Y, Jacquemin T, Zhang S, Jiang R. Prioritizing protein complexes implicated in human diseases by network optimization. BMC Syst Biol. 2014; 8:2.
Vanunu O R. E. E. A. MaggerO. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010; 6:1000641.
Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P. A comprehensive analysis of proteinâ€“protein interactions in saccharomyces cerevisiae. Nature. 2000; 403:623.
Yuen H, Albrecht G, Adrian H, Bader GD, Lynda M, SallyLin A, Anna M, Paul T, Keiryn B, Kelly B. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002; 415:180.
Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T. Global analysis of protein activities using proteome chips. science. 2001; 293:2101â€“5.
Zhao J, Hu X, He T, Li P, Zhang M, Shen X. An edgebased protein complex identification algorithm with gene coexpression data (pciageco). IEEE Trans Nanobiosci. 2014; 13:80â€“8.
Hart GT, Ramani AK, Marcotte EM. How complete are current yeast and human proteininteraction networks?Genome Biol. 2006; 7:1â€“9.
Nesvizhskii AI. Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments. Proteomics. 2012; 12:1639â€“55.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001; 98:4569â€“74.
AnneClaude G, Patrick A, Paola G, Roland K, Markus B, Martina M, Christina R, Lars Juhl J, Sonja B, Birgit D. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440:631.
Krogan NJ, Gerard C, Haiyuan Y, Gouqing Z, Xinghua G, Alexandr I, Joyce L, Shuye P, Nira D, Tikuisis AP. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature. 2006; 440:637.
Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, SÃ©raphin B. A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol. 1999; 17:1030â€“2.
Gentz R, Rauscher FJ, Abate C, Curran T. Parallel association of fos and jun leucine zippers juxtaposes dna binding domains. Science. 1989; 243:1695â€“9.
Nobumasa T, Taisuke T, Ikuo H, Makiko T, Manabu N, Yasuko T, Gopal T, Takeshi I. The role of presenilin cofactors in the ysecretase complex. Nature. 2003; 422:438â€“41.
Trevor C, Eivind H. From proteomes to complexomes in the era of systems biology. Proteomics. 2014; 14:24â€“41.
Chien CT, Bartel PL, Sternglanz R, Fields S. The twohybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc Natl Acad Sci. 1991; 88:9578â€“82.
Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999; 402:47â€“52.
Barabasi A. L., Oltvai ZN. Network biology: understanding the cellâ€™s functional organization. Nat Rev Genet. 2004; 5:101.
Jianxin W, Xiaoqing P, Min L, Yi P. Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics. 2013; 13:301â€“12.
Jianxin W, Xiaoqing P, Min L, Yi P. Cpredictor3.0: detecting protein complexes from ppi networks with expression data and functional annotations. BMC Syst Biol. 2017; 11:135.
Jain AK, Dubes RC. Algorithms for clustering data. Technometrics. 1988; 32:227â€“9.
Adamcsek B, Palla G, Farkas I, Ijderenyi, Vicsek T. Cfinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006; 22:1021â€“3.
Liu G, Wong L, Chua HN. Complex discovery from weighted ppi networks. Bioinformatics. 2009; 25:1891â€“7.
Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003; 4:2.
AltafUlAmin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006; 7:1â€“13.
Li M, Chen JE, Wang JX, Hu B, Chen G. Modifying the dpclus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008; 9(1):398.
Jiang P, Singh M. Spici: a fast clustering algorithm for large biological networks. Bioinformatics. 2010; 26(8):1105â€“11.
Cho YR, Hwang W, Ramanathan M, Zhang A. A coreattachment based method to detect protein complexes in ppi networks. BMC Bioinformatics. 2009; 10:169.
Peng W, Wang J, Zhao B, Wang L. Identification of protein complexes using weighted pageranknibble algorithm and coreattachment structure. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2015; 12(1):179â€“92.
Van Dongen S. Graph Clustering by Flow Simulation. University of Utrecht: Amsterdam, PhD Thesis. 2000.
Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in proteinprotein interaction networks. Nat Methods. 2012; 9:471.
Wang R, Liu G, Wang C, Su L, Sun L. Predicting overlapping protein complexes based on coreattachment and a local modularity structure. BMC Bioinformatics. 2018; 19:305.
Bhowmick SS, Seah BS. Clustering and summarizing proteinprotein interaction networks: A survey. IEEE Trans Knowl Data Eng. 2016; 28:638â€“58.
Newman ME. Modularity and community structure in networks. Proc Natl Acad Sci. 2006; 103:8577â€“82.
Li M, Wang J, Chen J. A fast agglomerate algorithm for mining functional modules in protein interaction networks. In: 2008 International Conference on Biomedical Engineering and Informatics. IEEE: 2008. p. 3â€“7.
Li M, Wang J, Chen J, Pan Y. Hierarchical organization of functional modules in weighted protein interaction networks using clustering coefficient. Berlin, Heidelberg: Springer; 2009, pp. 75â€“86.
Wang J, Li M, Chen J, Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2011; 8:607â€“20.
Cho YR, Hwang W, Ramanathan M, Zhang A. Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007; 8:265.
Liu Q, Song J, Li J, Liu Q, Song J, Li J. Using contrast patterns between true complexes and random subgraphs in ppi networks to predict unknown protein complexes. Sci Rep. 2016; 6:21223.
Liu Q, Song J, Li J, Liu Q, Song J, Li J. Classification and feature selection techniques in data mining. Int J Eng Res Technol (ijert). 2012; 1:1â€“6.
Liu X, Yang Z, Zhou Z, Sun Y, Lin H, Wang J, Xu B. The impact of protein interaction networksâ€™ characteristics on computational complex detection methods. J Theoret Biol. 2018; 439:141â€“51.
Ren J, Wang J, Li M, Wang L. Identifying protein complexes based on density and modularity in proteinprotein interaction network. BMC Syst Biol. 2013; 7:12.
Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D. Defining and identifying communities in networks. Proc Natl Acad Sci. 2004; 101:2658â€“63.
Zhao B, Wang J, Li M, Wu F. X., Pan Y. Detecting protein complexes based on uncertain graph model. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2014; 11(3):486â€“97.
Zhang Y, Lin H, Yang Z, Wang J, Liu Y. An uncertain modelbased approach for identifying dynamic protein complexes in uncertain proteinprotein interaction networks. BMC Genomics. 2017; 18(7):743.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25:25.
Consortium GO. The gene ontology (go) project in 2006. Nucleic Acids Res. 2006; 34:322â€“6.
Lei X, Jie Z, Fujita H, Zhang A. Predicting essential proteins based on rnaseq, subcellular localization and go annotation datasets. KnowlBased Syst. 2018; 151:095070511830159.
Liu X, Yang Z, Zhou Z, Sun Y, Lin H, Wang J, Xu B. Dynamic protein interaction network construction and applications. Proteomics. 2014; 14:338â€“52.
Watts DJ, Strogatz SH. Collective dynamics of â€™smallworldâ€™networks. Nature. 1998; 393:440.
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002; 30:303â€“5.
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002; 415:180.
Gavin AC, BÃ¶sche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002; 415:141.
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S. M., Eisenberg D. Dip, the database of interacting proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002; 30:303â€“5.
Keshava Prasad T, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference databaseâ€“2009 update. Nucleic Acids Res. 2008; 37:767â€“72.
ChatrAryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, Oâ€™Donnell L, et al. The biogrid interaction database: 2013 update. Nucleic Acids Res. 2012; 41(D1):816â€“23.
Ma CY, Chen YPP, Berger B, Liao CS. Identification of protein complexes by integrating multiple alignment of protein interaction networks. Bioinformatics. 2017; 33(11):1681â€“8.
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(suppl_1):535â€“9.
Pu S, Wong J, Turner B, Cho E, Wodak SJ. Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2008; 37:825â€“31.
Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G, Costanzo MC, Dwight SS, Engel SR, Fisk DG, et al. Gene ontology annotations at sgd: new data sources and annotation methods. Nucleic Acids Res. 2007; 36:577â€“81.
Mewes HW, Amid C, Arnold R, Frishman D, GÃ¼ldener U, Mannhaupt G, MÃ¼nsterkÃ¶tter M, Pagel P, Strack N, StÃ¼mpflen V, et al. Mips: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004; 32:41â€“4.
Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, SupertiFurga G, Serrano L, et al. Structurebased assembly of protein complexes in yeast. Science. 2004; 303:2026â€“9.
Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, IsselTarver L, Schroeder M, Sherlock G, et al. Saccharomyces genome database (sgd) provides secondary gene annotation using the gene ontology (go). Nucleic Acids Res. 2000; 30:69â€“72.
Ruepp A, Waegele B, Lechner M, Brauner B, DungerKaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. Corum: the comprehensive resource of mammalian protein complexesâ€”2009. Nucleic Acids Res. 2009; 38(suppl_1):497â€“501.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nature Genet. 2000; 25(1):25.
Luc PV, Tempst P. Pindb: a database of nuclear protein complexes from human and yeast. Bioinformatics. 2004; 20(9):1413â€“5.
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. Kegg for integration and interpretation of largescale molecular data sets. Nucleic Acids Res. 2011; 40(D1):109â€“14.
Consortium U. Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 2018; 47(D1):506â€“15.
Luo J, Li G, Song D, Liang C. Integrating functional and topological properties to identify biological network motif in protein interaction networks. J Comput Theoret Nanosci. 2014; 11:744â€“50.
Xu B, Guan J. From function to interaction: A new paradigm for accurately predicting protein complexes based on proteintoprotein interaction networks. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2014; 11:616â€“27.
Cai B, Wang H, Zheng H, Wang H. Integrating domain similarity to improve protein complexes identification in tapms data. Proteome Sci. 2013; 11(1):2.
Song J, Singh M. How and when should interactomederived clusters be used to predict functional modules and protein function?Bioinformatics. 2009; 25(23):3143â€“50.
Zhang XF, Dai DQ, Li XX. Protein complexes discovery based on proteinprotein interaction data via a regularized sparse generative network model. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2012; 9(3):857â€“70.
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. Go: Termfinderâ€“open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics. 2004; 20(18):3710â€“5.
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. Go: Termfinderâ€“open source software for accessing gene ontology information and finding significantly enriched gene ontology terms. Bioinformatics. 2004; 20:3710â€“5.
Acknowledgements
The authors would like to thank Wu Min, TamÃ¡s Nepusz, Guimei Liu and Eileen Marie Hanna for providing codes and datasets.
Funding
This work was supported by the National Natural Science Foundation of China (61772226, 61373051 and 61502343), the Interdisciplinary research funding program for doctoral candidates of jilin university (Grant No.10183201835) and the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China. The funding agencies played no roles in the design of the study, collection, analysis, interpretation of data, or in writing the manuscript.
Author information
Authors and Affiliations
Contributions
RW conceived and designed the study and drafted the manuscript. CW participated in the design and discussion of the research, and helped to carefully revise English editing. LS provided technical implementation assistance. GL participated in its design and coordination and exercised general supervision. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
extsuperscript\dag Equal contributors
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wang, R., Wang, C., Sun, L. et al. A seedextended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations. BMC Genomics 20, 637 (2019). https://doi.org/10.1186/s128640195956y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128640195956y
Keywords
 Graph clustering algorithms
 Protein complex
 Proteinprotein interaction networks
 Density
 Modularity
 functional properties