A seed-extended algorithm for detecting protein complexes based on density and modularity with topological structure and GO annotations

Background The detection of protein complexes is of great significance for researching mechanisms underlying complex diseases and developing new drugs. Thus, various computational algorithms have been proposed for protein complex detection. However, most of these methods are based on only topological information and are sensitive to the reliability of interactions. As a result, their performance is affected by false-positive interactions in PPINs. Moreover, these methods consider only density and modularity and ignore protein complexes with various densities and modularities. Results To address these challenges, we propose an algorithm to exploit protein complexes in PPINs by a Seed-Extended algorithm based on Density and Modularity with Topological structure and GO annotations, named SE-DMTG to improve the accuracy of protein complex detection. First, we use common neighbors and GO annotations to construct a weighted PPIN. Second, we define a new seed selection strategy to select seed nodes. Third, we design a new fitness function to detect protein complexes with various densities and modularities. We compare the performance of SE-DMTG with that of thirteen state-of-the-art algorithms on several real datasets. Conclusion The experimental results show that SE-DMTG not only outperforms some classical algorithms in yeast PPINs in terms of the F-measure and Jaccard but also achieves an ideal performance in terms of functional enrichment. Furthermore, we apply SE-DMTG to PPINs of several other species and demonstrate the outstanding accuracy and matching ratio in detecting protein complexes compared with other algorithms.


Background
A protein complex is a group of proteins that interact with each other to perform different cellular functions [1]. The detection of protein complexes from proteinprotein interaction networks (PPINs) plays an important Full list of author information is available at the end of the article role in the realization of the cell function in the proteomics era. Specifically, protein complexes contribute to the study of protein interaction network [2], function, diseases [3], etc. Protein complexes help researchers to fully study the causes of various diseases and further develop new drugs. Research on protein complexes is helpful to analyze the different stages of diseases [4]. Current studies have shown that disease genes tend to be highly connected among themselves in disease networks. These highly connected subgraphs could be disease protein complexes and investigation of the cause and effect of these complexes in disease networks could contribute to providing the search space for bioinformaticists, enhance the analysis process [5,6] and help medical researchers to design new drugs. As a result, the detection of protein complexes plays an indispensable role in complex diseases.
During the past decade, because of the development of high-throughput techniques such as yeast-two-hybrid [7], mass spectrometry [8], and protein chip technologies [9], the number of available PPINs has rapidly increased and have been collected from different public databases. In general, a PPIN can be naturally represented in the form of a network, which not only provides a people the panoramic scope of PPIs on a proteomics scale but also help us to understand the basic organization of cell machinery based on the whole network. How to use PPINs to analyze biological systems remains a meaningful task [10]. Although most of PPINs are missing and inaccurate [11,12], they reveal biological processes and inherent organizational structures within cells [13][14][15]. How to accurately discover biological protein complexes is a main subject in biology and bioinformatics. In biology, there are some experimental methods have been designed to detect protein complexes in PPINs, including TAP-ms [16], Co-IP [17][18][19] and the two-hybrid system [13,20]. However, biological experimental methods have their own shortcomings; for example, they are timeconsuming, relatively expensive and inefficient. Thus, the use of to provide computational algorithms to improve the effectiveness of protein complex detection in PPINs is appealing.
To overcome these experimental constraints, various computational methods have been developed to improve the effectiveness of protein complex detection in PPINs. Some researchers have shown that a protein complex in a PPIN is a molecular structure consisting of both function and structure [21]. Furthermore, some related empirical studies on PPINs also support this point and indicate that modular components in these networks do exist [22]. These results have two implications: one is that these modules are composed closely related proteins and these proteins could have many common neighbor from the perspective of network topology; the other is that proteins in the same modules perform similar functions together in terms of biology. Thus, many researchers believe that proteins in the same complex generally implement the same or similar function and tend to interact with each other [23]. Generally, a PPIN is usually modeled as an undirected graph, where the nodes represent proteins and the edges correspond to protein-protein interactions. Therefore, protein complexes can be detected by mining the modular structures (i.e., dense subgraphs or subnetworks) from PPINs [24]. Based on this idea, the problem of detecting protein complexes in PPINs can be computationally addressed via graph clustering methods, where the resulting biological subgraphs or clusters are considered to be protein complexes. Herein, clustering consists of grouping nodes into groups (also called clusters or communities) such that the nodes in the same cluster are more similar to each other than the nodes in the other clusters [25]. Therefore, to overcome the disadvantages of the experimental methods, a series of graph clustering algorithms based on machine learning and data mining are developed as an compensatory choice to detect protein complexes.

Related work
Up to now, a variety of computational algorithms for detecting protein complexes have been proposed. We first try to make a brief classification of relation work. They mainly include Approaches based on cliques or dense subgraphs, Approaches based on core-attachment structure, Approaches based on hierarchical clustering, Approaches based on model, Approaches based on supervised learning. We will further discuss these methods in the following sections.

Approaches based on cliques or dense subgraphs
A large number of existing algorithms suppose protein complexes correspond to k-cliques or highly dense subgraphs. Thus, in the past decade a series of algorithms based on cliques or dense subgraphs have been proposed for detecting protein complexes from PPINs. Until now, many protein complexes detection algorithms also belong to this category. For example, adamcsek et al. [26] provide an application called CFinder to find the k-clique percolation clusters as protein complexes in PPINs. Another example is CMC [27], which first mines the maximal cliques from weighted PPIN, and then removes or merges some highly overlapping maximal cliques. However, this kind of methods require a protein complex to be k-clique or clique. Consequently, some researchers try to discover dense subgraph by using a heuristical searching strategy in a PPIN. For instance, MCODE [28] is one of the earliest this kind methods, which detects protein complexes based on seed-extend method and subgraph with highly density in a PPIN. Several years later, Altaf-UI-Amin et al. [29] propose DPClus, unlike MCODE, DPClus detect densely subgraphs as protein complexes based on the concepts of density and periphery. Following the DPClus, based on the diameter and density, Li et al. [30] present a improved clustering algorithm called IPCA. Several years later, a fast, memory-efficient cluster algorithm SPICi [31] is presented. This cluster algorithm uses density and support function for clustering larger networks.
In fact, approaches based on cliques or dense subgraphs are effective to detect the k-cliques or highly density protein complexes, but they fail to detect either the sparsely subgraph or the relatively peripheral proteins. How to tackle these challenges will be emphasis for further study.

Approaches based on core-attachment structure
Most of approaches based on cliques or dense subgraphs mainly focus on the assumption that the highly connected subgraphs may be protein complexes, but these methods ignore the inherent organization of protein complexes. Gavin et al. [14] recently have demonstrated that protein complexes consist a core and some attachments, in which proteins in the core are highly interconnected, and some attachments or protein modules often interact with their core sparsely and assist their core in performing subordinate functions. Employing the core-attachment structure, some outstanding detection algorithms are developed. They have mainly two stages: the first stage is identifying all dense subgraphs and letting them to be the protein complex cores and the second stage is to extend all complex cores by adding peripheral proteins into its core. For example, Wu et al. [32] develop the algorithm named COACH, which first mines some dense subgraphs as protein complex cores and then identifies peripheral proteins. And then peripheral proteins is cooperating with their protein complex core to form a protein complex. Recently, Peng et al. [33] propose another algorithm called WPNCA, which is a new algorithm by using the PageRank-Nibble algorithm and core-attachment structure. Experiments results show that WPNCA is superior to other state-of-the-art algorithm in detecting complexes.
Generally speaking, identified complexes with coreattachment structures have a larger size. In fact, the real protein complexes have a smaller size. It is a directions for further research in the future.

Approaches based on model
Up to now, approaches based on model in protein complexes detection are very popular in protein complexes detection. That because they show an excellent performance. Unlike most of algorithms that we mentioned above, approaches based on model focus predominantly on seeking to some relation model or graph pattern to predict protein complexes. It is a new way to discover protein complexes. Markov clustering (MCL) [34] is one of the most popular model by using the random walk strategy in a PPIN, and it has two basic operators called expansion and inflation. MCL can tolerate more noises than other types of algorithms. However, its result depends on the parameter inflation and it does not detect overlapping protein complexes. In fact, overlapping protein complexes takes up a large proportion of protein complexes. Based on this fact, Nepusz et al. [35] introduce a novel method (called ClusterONE) to predict overlapping protein complexes. ClusterONE introduces a cohesiveness (also called graph modularity) to assess the quality of protein complexes for the first time. On the basis of Clus-terONE, we introduce CALM [36], a improved method, to detect protein complexes. Firstly, we identify overlapping nodes and seed nodes by calculating node degree and betweenness, then uses a greedy local research approach based on core-attachment and local modularity structure to produce detected protein complexes.
Although the algorithms based on model have good performance for the detection of protein complexes, their accuracy need to be improved by employing network topological features. For example, they could take multiple network topological property or biological informations into account.

Approaches based on hierarchical clustering
Recently, due to the form of a tree [37] in PPINs and the nature of modularity [38] in biological networks, some traditional hierarchical clustering algorithms are tried to detect protein complexes in the PPINs. The major difference among them is how to construct the hierarchical structure. More specifically, the key is how to measure the similarity of nodes. Next we introduce some representative algorithms.
Generally, traditional hierarchical clustering algorithms can not be use directly in PPINs with false positives. To overcome this challenge, based on the edge clustering coefficients and λ-module, Li et al. [39,40] propose a new fast hierarchical algorithm for identifying protein complexes, named FAG-EC. Wang et al. modify FAG-EC and propose HC-PIN [41] to identify overlapping and hierarchical functional modules in a PPIN.
In summary, approaches based on hierarchical clustering can provide a global perspective to look at the hierarchical modular organization of a PPIN. What's more, they are easy to implement and understand. However, most of them can not identify overlapping clusters and are sensitive to the noisiness of the PPINs [42]. Thus, their accuracies are limited. In practice, their performance is deficient in some cases.

Approaches based on supervised learning
The aforementioned various computational clustering algorithms are unsupervised-based clustering and they are used for finding protein complexes. All of these unsupervised clustering algorithms only consider one of the multiple topological structure of protein complexes and do not use the known complexes, thus they may ignore complexes with other types of topological structure.
To tackle the defect, with the development of supervised learning algorithms, some researchers utilize the information of known complexes to detect protein complexes from the PPINs. Supervised learning algorithms generally contain three main steps: (1) extract useful features from the known complexes; (2) train a supervised model by distinguishing the real complexes from random subgraphs based on the extracted features; (3) detect protein complexes from the PPINs by using the trained model as fitness evaluating function. So far ClusterEPs [43] is the best among them. It uses emerging patterns to measure the possibility of a subgraph being a complex.
Unfortunately, there is no appropriate feature selection method and the PPINs always have a considerable number of noise. Moreover, the number of known protein complexes is available for training is too small. These disadvantages make the trained model imprecise [44]. Meanwhile, some features are often related to the specific mapping PPINs, so these extracted features may be unique and not universal. As a result, their performance could decrease [45]. Therefore, how to overcome these issues is critical for further improving the accuracy of detection protein complexes.

Our work
The above algorithms have been shown to detect protein complexes effectively. Furthermore, proteins in the same complex generally possess high functional similarity; thus, protein constituting a complex possibly have similar function. Based on the strengths and weaknesses of the relative works and considering the fact that highthroughput PPINs are noisy and incomplete. Furthermore, proteins in the same protein complex generally possess high functional similarity and more neighbors, proteins constituting a protein complex possibly have similar function and more the same common neighbors. In this paper, we first integrate both common neighbors and GO annotations to construct a weighted PPIN. According to some evidence and research [30,35,46], the densitybased algorithms and modularity-based algorithms have outstanding performance in PPINs. Thus, we define a new model to quantitatively assess protein complex detection by considering both the density and modularity of a subgraph, and we propose a new graph clustering method based on seed-extend algorithm, namely (SE-DMTG), to detect protein complexes of various dense and modularity. In this process, we grow each seed node to a subgraph until this subgraph is a locally optimal cluster. Furthermore, we remove redundant detected complexes and treat the derived complexes as finally identified protein complexes. Finally, to validate the performance of SE-DMTG, we apply it to PPINs of three different species and compare the results, in terms of the F-measure and Jaccard with those of some representative state-of-the-art algorithms by using several known protein complex datasets that are widely used in biological experiments. The experimental results demonstrate that SE-DMTG outperforms the other competing algorithms in terms of accuracy and matching with known complexes. In addition, these identified protein complexes are subjected to functional enrichment analysis to ascertain their biological significance.

Protein complexes selection
To evaluate the performance of different protein complex detection algorithms. For yeast, we employ two known protein complexes sets as standard complexes to evaluate the quality of identified protein complexes by various algorithms in yeast PPINs, namely CYC2008 [63] and SGD [64]. In particular, CYC2008 is constructed from three sources, i.e., 1) MIPS [65], 2) Aloy et al [66], and 3) SGD database [67]. For human, we use two standard complexes, which include: 1. CORUM complexes [68]. 2. CGPK complexes [61] is constructed from four sources, i.e., (1) the Comprehensive Resource of Mammalian protein complexes (CORUM) [68]; (2) protein complexes are annotated by GO [69]; (3) Proteins Interacting in the Nucleus database (PINdb) [70] and (4) KEGG modules [71]. For mouse, we use the CORUM complexes [68]. Following the work done by Nepusz et al. [35], we further eliminate those protein complexes that are made up of fewer than three proteins and discard some redundant protein complexes. Finally, the rest of known protein complexes in these databases are used for performance evaluation. The summary of the these standard protein complexes is presented in Table 3.

Preprocessing
For yeast, we directly use the protein name to represent the proteins in the PPIN and protein complexes. For Neighbor(SG), the set includes the neighbor node connects to at least one edge with any protein of the cluster SG but not belongs to SG; inner_nodes(SG), the set includes the inner node belongs to the cluster SG, but it connects to at least one node which is the neighbor of SG; human and mouse, different PPINs and different standard protein complexes from different sources of datasets are heterogeneous in many aspects. Therefore, we use the Uniprot id [72] to represent each protein in this study. As a result, we have a uniform way to represent proteins for both the different PPINs and the standard protein complexes. In the process, we remove all duplication interactions, and proteins is not exist its associated Uniprot accession id.

Gene Ontology(GO) selection
As for the Gene Ontology (GO) file, for yeast, we use the GO slims which is the cut-down version of GO, it is a subset of the terms in the whole yeast GO. Here, since GO slims of CC include some protein complexes information, we only use GO slims of BP and MF as GO annotations. Moreover, the GO slim information is downloaded from the website (https://www.yeastgenome.org/). Similarly, for human and mouse, we exploit each protein with their associated Biological Process (BP), and Molecular Functions (MF) GO annotation based on the web UniProt [72] (available at https://www.uniprot.org/), and we download these mapping files.

Evaluation metrics
For the purpose of performance evaluation, This section introduces some evaluation metrics that have been used in this paper. These evaluation metrics calculate the matching degree between identified complexes obtained by different algorithms and standard complexes. Generally, the value of these evaluation metrics falls into the interval between 0.0 and 1.0. The higher the value, the better quality of clustering results and better performance an detecting algorithm has. 1) Precision, Recall, and F-Measure: To evaluate the performance of all algorithms, we match generated complexes with known complexes. First, we introduce the overlap score (OS) between the identified protein complexes and known complexes, which is presented as follows [73]: Here, |Np| is the size of the detected complex, |Ng| is the size of the known complex, and |Np ∩ Ng| is the common protein number from the detected and known complexes. If OS(p, g) ≥ ω, we consider p and g to match each other.  In our experiment, we set ω = 0.2, which is consistent with previous studies [28,29]. After the overlap score (OS) has be defined, we can now give the definition of Precision, Recall, and F-measure as follows [74]: where Precision = N cp |P| and Recall = N cg |G| . The F-measure is the harmonic mean of Precision and Recall, which can assess the overall performance of the detection algorithms.
2) JaccardI, JaccardS and Jaccard: As we all known, Precision, Recall and F-measure by setting a threshold to judge whether a standard complex and an identified complex are matched or not. It has its limitations because it doesn't consider the impact of overlapping part on both identified complexes and the corresponding standard complexes [75]. Therefore, we utilize Jaccard measure for evaluating clustering results [76,77]. It considers the proportion of overlap size in the union set of an identified complex and a standard complex [75]. For more details, please refer to Song et al. [76].
Before we give these metrics, we firstly introduce some notations. Let I be the set of identified complexes obtained by a specific identified algorithm, and S be the set of standard complexes. Moreover, let S i ∈ S be a standard complex and I j ∈ I represent an identified complex, and then their Jaccard coefficient between them is defined as Jac(S i , [77]. For each identified complex I j , its Jaccard measure is the maximum Jaccard coefficient over all standard complexes i.e, Jac(I j ) = max S i ∈S Jac(I j , S i ). Taking an average over these identified complexes, weighted by complex size, we compute the weighted average Jaccard measure for the all I identified complexes.
Similarly, for a standard complex S i , its Jaccard measure is Finally, the Jaccard measure between identified complexes and standard complexes is defined as the harmonic mean of JaccardI and JaccardS.
According to the definition of Jaccard measure, we can see that Jaccard measure could better evaluate the performance of the identified algorithms than Fmeasure, especially to compare matching rates of different algorithms.
3) p-value: To evaluate the statistical significance of the detected protein complexes, many researchers annotate their main biological functions by using p-value [23,78]. We calculate the function enrichment test to demonstrate the biological significance of detected protein complexes by different algorithms. In this paper, we use LAGO [78] to accomplish the function enrichment test with different threshold. Note that, LAGO is a fast tool which finds significant GO terms among a list of gene names, and it computes the significance (p-value) via the hypergeometric distribution, and applies (by default) Bonferroni correction. For the details of calculating p-value, please refer to [78]. The p-value is used for measuring the biological relevance of detected protein complexes and can be denoted as follows.
where k is the number of proteins of the functional group in the protein complex, N is the number of proteins in the PPIN. F is the size of a functional group in the PPIN, a detected protein complex that contains C proteins. Generally, the lower the p-value is, the stronger biological significance the protein complex has. The detected protein complex with less than 0.01 is deemed to be meaningful. In additionally, the larger protein complexes possess the smaller p-values.

Comparison with existing algorithms based on known protein complexes
We have experiments on six PPINs to compare our SE-DMTG algorithm with the following state-of-the-art protein complex detection algorithms, including MCODE [28], MCL [34], CFinder [26], DPClus [29], IPCA [30], CMC [27], COACH [32], HC-PIN [41], SPICi [31], Clus-terONE [35], WPNCA [33], CALM [36], and ClusterEPs [43]. Here all parameters are set as their authors advised in Table 4. Meanwhile, to evaluate the performance of all algorithms more comprehensively, all the detection algorithms are tested on the three different species that are yeast, human and mouse. Where three yeast PPINs include the Krogan-core, DIP and combined6 dataset. For human, it includes DIP and a combined dataset (HPRD+BioGRID). And we use the BioGRID dataset as mouse PPIN for testing all algorithms. All tested results are presented in Tables 5, 6, 7, 8 and 9. Because the results are similar, we only analyze the results on the yeast in detail and the rest of results are briefly introduced.
The experimental results of F-measure for different algorithms on yeast PPINs have been summarized in Table 5. As the Table 5 shows, although SE-DMTG doesn't always obtain best performance on precision or recall, but it always keeps in the top three in all cases. Furthermore, SE-DMTG obtains best F-measure in all three yeast datasets. It means that SE-DMTG makes a better compromise between precision and recall. Therefore, the results of F-measure for SE-DMTG are better than other algorithms. In other words, SE-DMTG is obviously better than other algorithms, especially for the overall accuracy in detected protein complexes. Generally, the performance of SE-DMTG in detecting protein complexes is very promising. The principle reason is that SE-DMTG takes into consideration not only gene ontology data but also the topological structure of the tested PPIN.
We have mentioned the limitations of precision, recall and F-measure earlier in this paper. Furthermore, we employ Jaccard measure to reflect that match ratio between detected protein complex set and standard complex set. Table 6 presents all comparative performance results for different algorithms evaluated based on Jaccard metrics by using CYC2008 and SGD standard complexes, respectively. As can be seen from Table 6, in three yeast PPINs, for Jaccard metric, SE-DMTG consistently outperforms other compared algorithms. That is SE-DMTG has the best value of Jaccard and superior performance. Furthermore, we can see that SE-DMTG clearly dominates the other algorithms in all tested datasets. Therefore, SE-DMTG algorithm can get more competitive value of Jaccard compare to other algorithms, which suggests that SE-DMTG performs better than other classic algorithms in terms of matching ratio on all three datasets. According to the above analysis, we known that the new fitness function we designed is used for dealing with the problem of protein complex detection and seems reasonable to use GO annotations for the detection of protein complexes.
Moreover, we make use of Krogan core dataset to compare the performance of all comparing methods by using CYC2008 and SGD as the standard complexes. As shown in Table 6, the Jaccard of SE-DMTG achieve 0.4688 and 0.4008, respectively, which significantly outperforms other algorithms. Similarly, on DIP dataset, SE-DMTG achieves the highest Jaccard (0.386 and 0.3485). For the combined6 dataset, SE-DMTG also achieves the highest value of Jaccards and the values of Jaccards are 0.5208 and 0.493, respectively. Therefore, it shows that the values of Jaccard in combined6 dataset for SE-DMTG is superior to the results in other datasets. This is mainly because com-bined6 is more reliable than other two datasets. In other words, PPIN contains multiple source dataset, which maybe lead to more real protein-protein interactions.
To further demonstrate the effectiveness of SE-DMTG algorithm in PPINs on other species, we also carry experiment on the human and mouse PPINs. All comparison results are listed in Tables 7, 8 and 9. Similarly, SE-DMTG also achieves the highest F-measure and Jaccard on other species in most cases. It is noteworthy that the higher Fmeasure means we can identify protein complexes more accurately and the higher Jaccard represents that detected algorithms have a better matching ratio between detected protein complexes and real protein complexes. In summary, for different species PPINs, SE-DMTG has the best performance over other comparative algorithms in terms of F-measure and Jaccard.

Biological significance of the detected protein complexes
Due to the incompleteness of the known protein complexes, we should calculate the p-value of the detected protein complexes on Cellular component ontologies (CC) by using the tool LAGO (http://go.princeton.edu/ cgi-bin/LAGO), which is used for making a functional enrichment analysis [78]. All parameters of LAGO are set default. Because CC includes the information of protein complexes, thus it can better compare the performance of different algorithms. Generally speaking, each protein complex detected by detection algorithm is associated with a p-value to show its GO annotations. If the p-value of a protein complex is less than 0.01, we consider it biologically significant. In fact, the p-values of detected protein complexes have close relationship with their size [33].
Here, to evaluate the functional enrichment of protein complexes detected by different algorithms more    Table 10.
In Table 10, we summarize the results of DPClus, IPCA, CMC, COACH, SPICi, ClusterONE, WPNCA and SE-DMTG by using function enrichment tests with different thresholds of p-value. As shown in Table 10, in most cases, SE-DMTG can detect many candidates of protein complexes than other methods such as DPClus, CMC, SPICi and ClusterONE in all PPINs. Furthermore, by analyzing functional enrichment, especially for the number, percentage and average p-value of detected protein complexes detected by SE-DMTG have statistical significance to compare with these algorithms mentioned above. As the Table 10 shows, although the number of significant protein complexes detected by IPCA is the most, the percentage and the average p-value of significant detected protein complexes is slight lower than SE-DMTG, COACH and WPNCA. Furthermore, the percentage and the average p-value of significant protein complexes detected by SE-DMTG from the six PPINs is a bit lower than COACH and WPNCA. It is the third highest among all methods. The major reason is that the size of protein complexes detected by SE-DMTG is smaller than the size of detected protein complexes by COACH and WPNCA. In fact, the smaller detected protein complexes have the larger p-values. More detail about the relationship between the size of detected protein complexes and the p-value of detected protein complexes. We will discuss in the relationship of the size of identified protein complexes and the p-value of significant detected protein complexes section.

Examples of detected complexes
In Tables 11 and 12, we further reveal the computation results, 18 detected protein complexes with very low To further reveal the comparison results obtained by SE-DMTG, we provide with a more vivid description by taking the 391th known protein complex of CGPK complexes-'RNase complex' as example. As shown in Fig. 1a, the known protein complex has 11 proteins. Meanwhile the detected protein complex obtained by SE-DMTG algorithm also consists of 11 proteins and it successfully match all proteins and its OS is 100% which is the highest among all algorithms. This result is shown in Fig. 1b. However, the IPCA, DPClus, COACH, WPNCA,  (1)) is lower than 0.47 or they are not able to get the detected results. So we don't show them in Fig. 1. However, for the IPCA, DPClus, COACH, WPNCA, MCL and SPICi algorithms, their OS value is only 73%,73%,68%,68%,54% and 47%, respectively. This result means that SE-DMTG can detect protein complexes accurately, indicating that the new definition of protein complex is also a good model to characterize the topological structure of the protein complexes. Additionally, from this example we explain that why SE-DMTG could achieve highest F-measure and Jaccard but its the percentage of significant detected protein complexes and the average of p-value are slightly lower than COACH and WPNCA. In summary, protein complexes detected by SE-DMTG are more biological significance. In a word, based on the results of p-value test, we have the conclusion that SE-DMTG can detect quite accurately and have good functional enrichments than other thirteen comparative algorithms.

The relationship between the size of detected protein complexes and the p-value of detected protein complexes
To illustrate the relationship between the size of detected protein complexes and the p-value of detected protein complexes, we do some statistical analysis. Because standard complexes and detected protein complexes are resemble 'power law' distribution. Thus we only display part of the distribution informations in Fig. 2. According to Fig. 2a, the size of most of standard complexes is very smaller. As shown in Fig. 2b, standard complexes whose size is less than or equal to 7 is just 76.96%. Meanwhile, our statistic results show that the average size of the combined standard complexes is 6.38 and the average size of detected protein complexes by SE-DMTG is 6.86. But the average size of detected protein complexes by IPCA, COACH and WPNCA is 10.96, 10.20 and 27.12, respectively. The average size of detected protein complexes by SE-DMTG is similar with standard complexes. However, in Fig. 2c, we found IPCA, COACH and WPNCA detect a larger number of large protein complexes. Additionally, the size of detected protein complexes by SE-DMTG is similar distribution with standard complexes in Fig. 2a and c.
Next, we make Fig. 3 to illustrate the relationship of the size of protein complexes with the percentage of significant detected protein complexes and the average p-value of detected protein complexes. From Fig. 3, it is obvious that the value of p-value (E) decreases gradually with the detected protein complexes whose size increasing. For example, the p-value of standard complexes decreases gradually with the size of protein complexes increasing in Fig. 3a. Similarly, for detected protein complexes by IPCA in Fig. 3c, the value of p-value decreases gradually when the size of detected protein complexes increases. Therefore, it illustrates that large detected protein complexes have small p-value. But in Fig. 2a and b, we know that most of standard complexes and protein complexes by SE-DMTG have small size. Above analysis explains why SE-DMTG has a higher accuracy and matching better    Table 6 presents 18 detected protein complexes which have low p-value. The first column and the fourth column show their ID and their p-value. The second column presents the size of detected protein complexes. Gene ontology term (in the third column) show the detected complex contains the proteins of the detected complexes, in which the protein with emph style matches the gene ontology. Number annotated (in the fifth column) represents the number of genes from the detected protein complexes that are found within the annotation and within the aspect  with standard complexes according to Tables 5, 6, 7,  8 and 9. However, as for the percentage of significant detected protein complexes and the average p-value of detected protein complexes, SE-DMTG is slightly lower than COACH and WPNCA, and it is the third highest among all methods according to Table 10. All in all, although p-value has limitation in evaluating functional significant of detected protein complexes, it also reflects the function enrichment of detected protein complexes in a certain level. Overall, considering the superior accuracy and matching ratio and their strong performance in the function enrichment test, we believe the protein complexes detected by SE-DMTG are more likely to be real protein complexes.

Computational complexity of SE-DMTG Experimental setup
We implement SE-DMTG in python and execute all the experiments on a 64-bit Window system, whose memory of PC is 12GB and Intel CPU is i7 3.60 GHz. In the meantime all state-of-the-art methods are also executed on the same machine, except SPICi. While SPICi method is used through its web site.

Time complexity analysis
In this part, we try to analyze the time complexity of the SE-DMTG algorithm. It is difficult to give the accurately computational complexity of SE-DMGT because it depends on not only the number of detected protein complexes but also their size. Moreover, for each seed, we need to execute an iterative procedure until the current cluster doesn't changes, Obviously the number of iterations have significant influence for the computational complexity of SE-DMTG. Thus, we only roughly analyze the time complexity. Let n and m denote the number of nodes and edges in graph G, respectively, and let k be the average number of neighbors of all the nodes. Then we have  *  log(n)).
In the generate detected protein complex step, the worst case is that we need calculate the fitness of each protein and its worst time complexity also is 4 * m 2 n + 4 * m + n. In generating detected protein complexes step, we firstly analysis the time complexity when SE-DMTG iteratively adds proteins to the cluster SG from its neighbors. It has three basic phases: (1) obtain all candidate nodes which will be added to the cluster SG, whose time complexity where n SG is the number of the cluster SG.  1 + n SG )). Finally, we need to discard some redundant protein complexes whose time complexity is O(PCs 2 ), where PCs is the size of candidate identified protein complexes. All in all, the time complexity of the algorithm SE-DMTG is O(2 * m + 4 * m 2 n + 4 * m + n + n * log(n) + N * t * m n * N SG * (N SG − 1) + 3 * n SG * (1 + n SG ) + len(PCs) 2 , where N, t and PCs are constant. In addition, we assume N SG and n SG as variables. To facilitate the intuitive understanding of these variables, we provide Table 13 so that you can get more details.

Conclusion
Many high-throughput experimental techniques and computational algorithms have been developed to identify protein complexes from the PPINs. However, most of these methods are based on the original network or use the topological property alone and are thus limited in terms of not only the quality of protein complex identification but also ignoring other useful biological information, such as functional properties. In our opinion, both topological and functional properties are meaningful and important for identifying protein complexes. We therefore combine common neighbor and functional properties to calculate edge weights and construct weighted PPINs. Moreover, we also propose a new local search heuristic graph clustering algorithm, SE-DMTG, to extract detected protein complexes with various densities and modularities based on a new model. Although models that consider density or modularity have been applied to study PPINs, our model is novel in considering both density and modularity simultaneously.
We evaluate the performance of the proposed SE-DMTG on three PPINs of species under some standard complex datasets and compare the results with those of thirteen competing algorithms. The experimental results show that SE-DMTG is competitive in identifying protein complexes and that adding the topological Fig. 2 The distribution of the size of protein complexes in the PPIN. In Fig.a and c, the horizontal axis is the different algorithms and the size of protein complex, and the vertical axis is the number of protein complexes which fall in each size. In Fig. b, it is the distribution of the different size of combined standard protein complexes consisting of CYC2008 and SGD complexes information and GO information increases the detection accuracy. Meanwhile, the experimental results reveal that SE-DMTG outperforms the current state-of-the-art algorithms in terms of some measures in overall. Furthermore, we analysis the biological significance of detected protein complexes by different methods. The results show that these detected protein complexes by SE-DMTG have biological significant. With the wide application of supervised learning, we will try to design a new algorithm that combines classification model and unsupervised clustering algorithms to improve the performance in the future. Additionally, SE-DMTG is also robust to false positives in Fig. 3 Values of p-value (E) for different sizes of standard and detected protein complexes in combined6 dataset. The horizontal axis is the size of protein complexes and the vertical axis is the average p-value (E) of this size protein complex. a CYC2008 standard protein complexes; b SGD standard protein complexes; c detected protein complexes by IPCA; d detected protein complexes by SE-DMTG; e detected protein complexes by COACH; f detected protein complexes by WPNCA experimental data because of the integration of functional properties. Furthermore, SE-DMTG may be extended naturally to other types of biological data fusion to study more comprehensive characteristics of the biological networks and to analyze other forms of complex networks, such as Internet networks, citation networks, ecological networks and social networks.

Preliminaries
Since the interactions among proteins in the PPINs are symmetric, these PPINs could be formulated as a undirected weighted graph G = (V , E, W ), where V is a set of nodes representing the proteins of the PPINs, E is a set of undirected edges corresponding to those interactions, and W represents the likelihoods between nodes. In this paper, we obtain the weights by using the topological information and the biological information. The symbols, abbreviations and their interpretation are shown in Table 1.

Algorithm framework
The SE-DMTG algorithm is developed to detect protein complexes based on GO annotations and PPINs topological structure. Furthermore, we propose a composite model for the identification of protein complexes. Algorithm 1 represents the main function of the proposed SE-DMTG. SE-DMTG operates in three phases. In the first step, given a PPIN, and we construct a weighted PPIN by using common neighbors and GO annotations defined by Eqs. (7) and (8). In the second step, SE-DMTG constructs a seed node queue based on a seed score function to form the initial cluster defined by Eq. (12). In the third step, based on the initial cluster in the previous step, we provide a quantitative definition of protein complexes to formulate the problem of protein complexes identification as an optimization problem defined by Eq. (17). Finally, we apply an iterative greedy search process to generate protein complexes (See Algorithm 2).False and redundancy candidate protein complexes are filtered to ultimately obtain identified protein complexes. Figure 4 shows a flowchart of SE-DMTG, which is composed of the following main steps: 1. Construct a weighted PPIN based on common neighbors and GO annotations. 2. Generate a seed queue and form an initial cluster. 3. Define the protein complex model. 4. Extend and correct the cluster to generate a locally optimal subgraph. 5. Obtain a list of identified protein complexes.
In step 1, the edge clustering coefficient probability is computed based on common neighbor via Eq. (7). The functional similarity between two proteins is calculated based on GO annotations according to Eq. (8). In step 2, we give each protein a score on the basis of both the weight degree (see Eq.(10)) and the neighborhood graph clustering coefficient (see Eq.(11)), and we sort the proteins based on their score according to Eq.(12). In step 3, we introduce a new model to estimate the quantitative value of a cluster (see Eq. (17)). In step 4, we iteratively extend and correct the cluster to generate a protein complex from the weighted PPIN. This process involves four sub-steps: selecting the highest score protein as the seed node to generate a seed queue and form the initial cluster; assessing the priority of boundary nodes in determining the priority section; iteratively adding neighbor nodes to the cluster, removing inner nodes from the cluster, and filtering and removing false candidate identified protein complex with size less than or equal to two in the extending and correcting cluster to generate a locally optimal subgraph section. In step 5, we discard some redundant candidate protein complexes and output a list of identified protein complexes. For more details of this processes, see the related sections.

Construction of a weighted PPIN based on common neighbors and GO annotations
Recent studies [30,35,36] have shown that the accuracy of protein complex detection can be significantly improved by taking network weights into account. In the following subsections, we introduce how to calculate the weight of the PPIN.

Common neighbors
The edge clustering coefficient [47] is first developed to describe how strongly neighbors are connected. However, Radicchi et al. [47] note that the edge clustering coefficient may not be suitable for using in PPINs because PPINs are disassortative networks. To overcome this limitation, Zhao et al. [48,49] propose a new method to calculate the possibility of protein-protein interactions. Following their work, we also use the same method to calculate the weight of each edge, namely common neighbors (CN). Then, the existence probability of an edge (v, u) in a PPIN is defined as follows: where N(v) and N(u) are the neighborhood sets of v and u, respectively. In Eq. (7), |N(v) ∩ N(u)| denotes the set of common neighbors between two proteins. CN is a measure that can describe how closely proteins v and u are related. In this paper, we assume that the similarity of different interactions are independent of each other. The higher the value is, the larger the probability that proteins v and u belong to the same protein complex is. start Is seed queue null?

End
Select the seed node with highest score from seed queue SQ as initial cluster Iteratively add neighbors to and remove inner nodes from current cluster

Protein-protein interactions network GO annotations
The nodes in the detected cluster are removed from the seed queue SQ Fig. 4 The framework of SE-DMTG

Protein functional similarity computation
On the other hand, from a biological perspective, gene ontology (GO) [50] is currently one of the most comprehensive ontology databases in the bioinformatics community [51]. The database provides a series of GO terms to describe gene product features. Proteins constituting a complex possibly have similar function. A large functional similarity means higher confidence that two proteins share similar functions. In other words, if two interacting proteins v and u have more common GO annotations and their functions are more similar, then they are more likely to belong to the same protein complex. Additionally, proteins with similar functions tend to be co-expressed [52]. Note that when two terminal nodes v and u of an edge (v, u) do not have common GO annotations, the weight of edge (v, u) may be regarded as noise and set 0.0. Here, we define a new measure to describe the similarity of two interacting proteins v and u based on a biologically similarity function defined as follows: where In this paper, SE-DMTG integrates both the topological and biological information of the PPIN by using the CN and GO. CN captures the static topological information and GO assesses the functional similarity of proteins. To incorporate both measures into our method, we use the arithmetic mean as the edge weights in the PPINs. The weight of each edge between two proteins is calculated as follows: Here, 1. Neighbors shared by two proteins in the network are called the common neighbors (CN) of Eq. (7).
2. The functional similarity of two proteins is quantified in terms of the GO annotation (GO) in Eq. (8).
The above two properties express the interaction based on CN and GO annotations. Note that the value of w(v, u) has a range between 0.0 and 1.0 and is used for evaluating the reliability of protein pairs to construct a weighted PPIN. The weights of each edge in the PPIN are obtained by integrating both topological information and biological information. Edges whose weights are 0.0 are considered to be noise and are deleted from the PPIN.

Generation of a seed queue and formation of the initial cluster
Choosing high-quality protein seeds for expansion is critical. Each cluster starts at an initial cluster that consists of a single node that is generally called the seed node. An inappropriate choice of a seed node will likely affect the process of detecting protein complexes. For example, a low-quality seed node may result in a false positive protein complex being detected. Furthermore, if a protein that belongs to multiple complexes is chosen as a seed node, the resulting identified complex may subsume the multiple complexes under an unrealistically large false protein complex that cannot match any real protein complex [36]. From a topological perspective, the central part of a protein complex often corresponds to a dense subgraph with high clustering coefficient and more reliable weight in the PPINs [29-31, 46, 53]. According to the preliminaries section, we have given a confidence score 0 w v,u 1.0 to every edge (v, u) ∈ E. We utilize several measures to select seed nodes. For each node v in the PPIN, we define its weight degree, d w (v), as the sum of all its edge weight values: For each node v, the neighborhood graph consists of v, all its neighbors and the edges among them, is defined as Futhermore, the neighborhood graph clustering coefficient (NGCC) is the sum of the weights of the edges, divided by the total number of possible edges. Thus, for a node v, the NGCC is defined in Eq. (11) [54]: Here, V v is the degree of node v, v,u∈V v w(v, u) is the sum of the weights of the edges, and (|V v | * (|V v |−1)) 2 is the total number of triangles that could pass through node v. The NGCC reflects the weight degree of aggregation of proteins in the PPINs. Note that the NGCC is a measure of the closeness of the node v and its neighbors, which varies from 0.0 to 1.0.
We devise the following score function to sort all proteins in a PPIN. If a protein has a higher score according to Eq. (12), it is more likely to be used as the seed node, to be inside a protein complex, and to have high centrality in the complexes. Thus, the score of each protein v is defined as the product of the its neighborhood graph clustering coefficient and its weight degree, and is defined in Eq. (12): The seed score function takes both weight degree centrality and neighborhood graph density into consideration for prioritizing the proteins for seeds. Here, we sort all proteins in the PPIN and use a queue (data structure) SQ to record the order. We select the highest score according to Eq. (12) as the seed node to grow a detected protein complex. Once the new detected protein complex is generated, all nodes in the detected protein complex are recorded in a list table and we choose the next highest node that is not visited in the queue SQ as the next seed node. Note that, we calculate the score of each protein only once based on the PPIN, which is more biological meaning [30].

Definition of a protein complex model
As mentioned in the Background section, several protein complexes identification algorithms have been presented. Most existing algorithms make many assumptions to define a subgraph of possible protein complexes in the PPINs. However, in terms of the actual performance of these algorithms, the graphs with high density or high modularity in PPINs generally correspond to protein complexes [29,35]. In fact, a dense graph could have low modularity, and a graph with high modularity may have low density. Therefore, the density-based algorithms ignore protein complexes with low density and the modularity-based algorithms miss protein complexes with low modularity. Overall, these methods have limitations when identifying protein complexes with various densities and modularities [46]. To overcome these limitations, we define a new protein complex model to detect protein complexes by considering both density and modularity in the PPINs. We begin by presenting some related definitions.
According to the preliminaries section, for an undirected weighted subgraph SG, its density is donated as D SG : where u,v∈SG w u,v is the sum weight of the edges contained in subgraph SG, and |SG| represents the size of the subgraph SG, respectively. The density of a graph measures how close the graph is to a clique, and the density takes value between 0.0 and 1.0. For the subgraph SG ⊆ G, its weighted in-degree, denoted as d in w (SG), is the sum of the weights of all edges belonging to SG, and its weighted out-degree, denoted as d out w (SG), is the sum of the weights of the edges connecting the nodes in SG to the nodes in the rest of graph G. d in w (SG) and d out w (SG) can be obtained as follows [46]: Clearly, the weighted degree of d w (SG) is equal to the sum of d in w (SG) and d out w (SG). The modularity M SG of a subgraph SG ⊆ G is defined as follows: Obviously, M SG takes values from 0.0 to 1.0. If a subgraph has higher modularity, it has more connections within itself and fewer connections to the rest of the PPIN. A subgraph with a modularity of 1.0 has no connections with the rest of the PPIN. In this model, in the process of identifying protein complexes, we measure the quality of SG by considering its density (D SG ) and modularity (M SG ). D SG describes the density of subgraph SG, M SG describes the modularity of subgraph SG and √ D SG * M SG describes the subgraph with both high density and high modularity. Here, to make the value range of a subgraph with both high density and high modularity the same as that of the density and modularity, i.e, [0.0,1.0], the value of D SG * M SG is normalized by the geometric mean of D SG and M SG . The fitness of a subgraph SG in an undirected weighted graph G, denoted as F(SG), is defined as: Generally, as the subgraph SG expands, its modularity increases and its density decreases. Thus, by expanding from a node, we can obtain a subgraph with the local maximum fitness score and output the result as a protein complex. Thus, this new model can be used for identifying protein complexes with different topology, including high density but low modularity, high modularity but low density, and high density and high modularity. Therefore, our model can identify the protein complexes with various densities and modularities.
Extending and correcting the cluster to generate a locally optimal subgraph Determining the priority of boundary nodes An initial cluster (SG) starts as single protein, and then grows and shrinks gradually as proteins are added and removed one by one. The process of adding proteins from the neighbor of SG, and is denoted as Neighbor(SG), and the process of removing proteins from the inner nodes is denoted as inner_nodes(SG). In this process, we first define two concepts: if p ∈ Neighbor(SG), the neighbor node connects to at least one edge with any protein of cluster SG but does not belong to SG; If p ∈ inner_nodes(SG), the inner node belongs to SG, but connects to at least one node which is a neighbor of SG. A key problem is to decide the priority to add and remove proteins in terms of SG. In general, if a protein v belongs to SG, it may have a strong connection with its cluster SG = (V SG , E SG ). Therefore, if the protein v is added to SG, it could increase the average of the weighted interactions within SG. By contrast, if the protein v is removed from SG, it could increase the average of the weighted interactions within SG. Here, we introduce a measure to assess the priority, denoted as weight avg (SG), which is defined as: where weight avg (SG) is the average of the weighted interactions of all proteins within SG, |V SG | is the number of proteins in SG and (v,u)∈E SG weight(v, u) represents the total weight of the interactions in SG. The priority of adding the node p into the cluster SG, where p ∈ Neighbor(SG), or deleting the node p from the cluster SG, where p ∈ inner_nodes(SG), SG is determined by the value of weight avg (SG). We choose the highest weight avg (SG) of the boundary node to add it to SG or remove it from SG to maximize the value of F(SG) (see Eq. (17)).

Extending and correcting estimation
For a cluster SG, in extending step, we first obtain all the neighbors, namely, Neighbors(SG). The priority of all neighbors is determined by the value of weight avg (SG) see Eq. (18). Whether the highest priority protein v is added to SG is determined by whether the fitness (F(SG)) of SG is increased after the highest priority protein v is added and whether the actual edge between the highest priority protein v and the SG, denoted as |SG ∩ N(v)|, which is the number of proteins in SG connected with v is greater than the expectation edge, denoted as F(SG) * |SG|, where F(SG) is the fitness of SG and |SG| is the number of proteins in SG. Once the highest priority protein v is added to SG, SG is updated, i.e., the highest priority protein v is removed from Neighbors(SG). Then, the next highest priority protein is tested, and the priorities of list Neighbors(SG) and the fitness (F(SG)) of SG are recalculated, and so on. If the highest priority protein v fails any of two tests, then SG cannot be further extended. For a cluster SG, in the correcting step, we first obtain all inner nodes, namely Inner_nodes(SG). The priority of all proteins in Inner_nodes(SG) is determined by the value of weight avg (SG) (see Eq. (18)). Whether the highest priority protein v is deleted from SG is determined by whether the fitness (F(SG)) of the cluster SG−{v} is increased after the highest priority protein v is removed from SG and whether the actually edge between the highest priority protein v and SG − {v}, denoted as |SG − {v} ∩ N(v)|, which represents the number of proteins in SG − {v} connected with v, is greater than the expectation edge, denoted as F(SG) * |SG|, where F(SG) is the fitness (F(SG)) of SG and |SG| is the number of proteins in SG. Once the highest priority protein v is removed from SG, the cluster SG is updated, i.e., the highest priority protein v is removed from Inner_nodes(SG). Then, the next highest priority protein is tested, and the priorities of Inner_nodes(SG) and the fitness of the cluster SG −{v} are recalculated, and so on. If the highest priority protein v fails any of two tests, then the cluster SG cannot be further corrected.

Obtaining a list of identified protein complexes.
On the basis of the quantitative description of protein complexes, we develop a novel clustering algorithm based on density and modularity with network topology and GO annotations, named SE-DMTG, to identify protein complexes in a weighted PPIN whose edge weights reflect the reliability of the edge in a protein complex according to topological and biological information.
The input of the SE-DMTG algorithm is a PPIN, which is described as a simple undirected graph G(V , E) with GO annotations. The SE-DMTG algorithm broadly consists of four phases. First, SE-DMTG constructs a weighted PPIN-based topological and biological information at lines 2-11. Second, SE-DMTG calculates the scores of all nodes and selects the node with the maximum score as the seed in lines 12-18. Third, starting from the seed node, a greedy procedure is used for adding nodes to or removing nodes from the cluster SG to obtain a subgraph with high graph fitness. The growth process is repeated from different seeds to form multiple, possibly overlapping subgraphs in lines 19-49. Once a new cluster is completed, all nodes in this cluster SG are recorded to prevent them from being used as seed nodes. Then, we select the next seed node from those remaining in the queue SQ to generate the next cluster SG in lines 41-45. Moreover, we discard candidate complexes whose size is less than 3 [35] and remove unreliable candidate complexes at line [38][39][40][41][42][43][44][45][46]. Finally, we discard redundant protein complexes in lines 50-55. A detailed description of the SE-DMTG algorithm is shown in Algorithm 1.
In the first step, we assign a weight to each edge based on common neighbor and gene ontology data (lines 2∼11).
In the second step, SE-DMTG calculates the score of each node (lines 12∼17). Furthermore, all the nodes in network G are queued into SQ in non-increasing order of Score(v) (line 18).
In the third step, we choose the node with the highest Score(v) that has not yet been visited before to bring it up (lines 19∼29). The key idea of this step is that any neighbors of the current subgraph SG that make a positive contribution to F(SG) will be added to SG or removed from SG (line 37). The description of iterative generation of a complex is shown in Algorithm 2. Algorithm 2 has two subphases, and we can gradually add neighbors to cluster SG or remove inner nodes from cluster SG. As for the priority of candidate nodes is based on (see Eq. (18)) and two conditions. More details are introduced in the section on extending and correcting the cluster to generate a locally optimal subgraph. Next the step-by-step procedure of step 3 is given in Algorithm 2.
In the first phase in lines 3∼25, after obtaining a seed protein, we first get an external boundary protein set that consists of the neighbors of SG called Neighbor(SG), in lines 4∼5. Then, we calculate the graph fitness of SG at line 8. Furthermore, we find the neighbor protein with the highest priority according to weight avg (SG + {p}) in Neighbor(SG), which is added to SG to maximize the value of weight avg (SG + {p}) in lines 7∼14. Furthermore, we calculate the fitness of graph SG + {p} in line 15, and Expectation_edges is calculated according to the graph fitness of SG × the size of SG in line 16. Meanwhile, we also calculate the value of Actually_edges which is the size of the interaction set between Neighbor(node_max) and SG, denoted as Neighbor(node_max) ∩ SG, in line 18. If the node_max with the highest priority is added to increase the value of F(SG) and the Actually_edges is larger than Expectation_edges, then we add node_max to SG and remove it from Neighbor(SG) in lines 19∼24. We continually check the next highest priority node in Neighbor(SG) and judge whether the node can be added to the SG in lines 6-25. Otherwise, the iterative addition of the neighbors of SG phase is terminated when one of two conditions is not satisfied in line 19 or when no more remaining neighbor nodes can be added to SG in line 6.
In the second phase, SE-DMTG allows the removal of any inner nodes in cluster SG to maximize the value of F(SG) in lines 26∼57. We first find the inner nodes Create a set NS v and include the all neighbors of v; 6: for each protein u ∈ NS v and u is after v do 7: Calculate the CN(v, u) by Eq. (7); 8: Calculate the GO(v, u) by Eq. (8); 9: Calculate the Weight similarity [ [ v, u] ] = w(v, u) according to Eq. (9);/* Calculating the weight of each edge.*/ 10: end for 11: end for 12: Step 2: Construct a seed queue SQ and select the initial cluster. 13: Initialize SQ = φ; /* Saving and recording the order of seed node.*/ Seed score = {}./* Saving the score of each seed.*/ 14: for each protein v in V do 15: Calculate the score of protein v by Eq. (12) and is written as Score(v); 16: Seed score [ v] = Score(v); 17: end for 18: Sort all proteins to queue SQ = {s 1 , s 2 , ..., s n } in descending order by their Score(v); 19: Step 3: Generate detected protein complexes. 20: Initialize F average (NG) = 0.0, and count = 0; /* in order to compute the average fitness of all proteins's neighborhood graph.*/ 21: for each protein v in SQ do 22: Obtain a neighborhood graph which contain itself and its directly neighbors, denoted as NG(v); 23: if NG(v) 2 then 24: Calculate the fitness of NG(v) according to Eq. (17), is written as F(NG(v)); 25 that have edges with nodes that are not in SG, denote as Inner_node(SG) in lines 27∼34, and then we test whether each node in Inner_node(SG) can be removed from SG in lines 35-57. We first find the highest priority node