Volume 11 Supplement 2
The 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009)
Identifying protein complexes from interaction networks based on clique percolation and distance restriction
- Jianxin Wang^{1, 2}Email author,
- Binbin Liu^{1},
- Min Li^{1}Email author and
- Yi Pan^{1, 2}Email author
DOI: 10.1186/1471-2164-11-S2-S10
© Wang et al; licensee BioMed Central Ltd. 2010
Published: 02 November 2010
Abstract
Background
Identification of protein complexes in large interaction networks is crucial to understand principles of cellular organization and predict protein functions, which is one of the most important issues in the post-genomic era. Each protein might be subordinate multiple protein complexes in the real protein-protein interaction networks. Identifying overlapping protein complexes from protein-protein interaction networks is a considerable research topic.
Result
As an effective algorithm in identifying overlapping module structures, clique percolation method (CPM) has a wide range of application in social networks and biological networks. However, the recognition accuracy of algorithm CPM is lowly. Furthermore, algorithm CPM is unfit to identifying protein complexes with meso-scale when it applied in protein-protein interaction networks. In this paper, we propose a new topological model by extending the definition of k-clique community of algorithm CPM and introduced distance restriction, and develop a novel algorithm called CP-DR based on the new topological model for identifying protein complexes. In this new algorithm, the protein complex size is restricted by distance constraint to conquer the shortcomings of algorithm CPM. The algorithm CP-DR is applied to the protein interaction network of Sacchromyces cerevisiae and identifies many well known complexes.
Conclusion
The proposed algorithm CP-DR based on clique percolation and distance restriction makes it possible to identify dense subgraphs in protein interaction networks, a large number of which correspond to known protein complexes. Compared to algorithm CPM, algorithm CP-DR has more outstanding performance.
Background
With the Human Genome Project implement successfully, the biomedical research enters the post-genome era. In the new era, one of the most important challenges is to systematically analyze and comprehensively understand how the proteins accomplish the life activities by interacting with each other [1]. It plays an important role in predicting the protein functions and understanding specific biological processes that identify protein complexes from large-scale protein-protein interaction networks. In recent years, the development of large-scale interaction prediction techniques results a large number of protein-protein interaction (PPI) data. Moreover, a large number of algorithms for detecting protein complexes from protein-protein interaction networks have emerged. According to whether the algorithm could identify overlapping protein complexes, these algorithms can be classed into two types, Non-overlapping Clusters Detecting Algorithms and Overlapping Clustering Identifying Algorithms.
The basic idea of Non-overlapping Clustering Algorithms is that each protein belongs to one and only one protein complex in large-scale protein-protein interaction network. King et al. proposed the Restricted Neighborhood Search Clustering (RNSC) algorithm which aimed at exploring the best partition of networks by using a cost function [2]. In addition, there are some typical Non-overlapping Clustering Algorithms. For example, Hartuv and Shamir used the minimum cut to remove edges recursively and developed a divisive algorithm HCS for mining highly connected clusters in networks [3], Girvan and Newman developed a divisive algorithm G-N based on the edge betweenness [4], Newman et al. proposed a fast agglomerative algorithm based on greedy strategy [5]. In recent years, some researchers extend the G-N algorithm, for instance, Radicchi et al gave a new self-contained algorithm [6] and Luo et al developed an agglomerative algorithm MoNet [7] and so on. In the real protein-protein interaction networks, however, protein complexes are usually overlapping, that is to say, some proteins may be subordinate more than one complex simultaneously [8]. Therefore, the researches on identification algorithm in mining overlapping protein complexes are more significance [9].
In recent years, a variety of algorithms extend the G-N algorithm could be employed to analyze the overlapping structures of the large-scale complex networks, including protein-protein interaction networks. The representative algorithms are Cluster-Overlap Newman Girvan Algorithm (CONGA) [10], the betweenness-based decomposition method (BCe) [11] and the Fuzzy Cluster algorithm [12]. Gregory et al.[10] discussed the edge betweenness centrality measure and developed a new algorithm CONGA, which decompose networks into arbitrary quantity overlapping structures, to discover overlapping communities in networks. By computing splitting betweenness, the algorithm CONGA determined when to split vertices, what vertex to split and how to split them. Because of the time complexity O(m^{ 3 }) relational closely to the scale of edges in network, the efficiency of algorithm CONGA is considerably low. A novel algorithm BCe based on edge betweenness and vertex betweenness obtain overlapping structures by choosing the similarity threshold values between vertices pairs. In the Fuzzy Clustering algorithm [12], Zhang et al. analyze the overlapping structures based on external input parameter which indicate an upper bound of the community quantity, while it is a considerably challenge that the input parameter absolutely equal to the number of complex in real protein-protein interaction networks. Furthermore, according to the Centrality-Lethality Rule generally existing in protein-protein interaction networks, Li et al.[14] developed a graph splitting and reduction model, and an original algorithm OMFinder for identifying overlapping functional modules based on the developed model is proposed. In algorithm OMFinder, the proteins are divided into two classes of high-degree and low-degree nodes, constrain only the high-degree nodes could subordinate to multiple protein complexes. Comparing to other approaches of detecting overlapping complexes, many significant overlapping protein complexes with various topologies could be discovered effectively by this algorithm with low discard rate. However, this algorithm only simply deal with the subgraphs, which decomposed from the protein-protein interaction network, containing high-degree proteins and the ones that include low-degree proteins
With specialized research in the overlapping structure of large-scale complex network, a powerful algorithm for finding protein complexes and exploring the general characteristics of complex networks in biology based on clique percolation has been recently developed by Palla et al., named Clique Percolation Method (CPM) [15]. In addition, a software tool of detecting overlapping clusters CFinder is developed based on this algorithm CPM [8]. As an effective algorithm on identifying overlapping module structures, algorithm CPM has a wide range of application in social networks and biological networks [9, 16], while its recognition accuracy is too low and unfit to identifying protein complexes with meso-scale when it applied in protein-protein interaction networks. Generally speaking, results of algorithm CPM are highly correlated to the value of the clique percolation parameter k. The smaller values of k correspond to the more excessive large subgraphs of high density. In order to conquest these shortcomings, an algorithm called CP-DR (Clique Percolation Method based on Distance Restriction) for identifying protein complexes based on clique percolation and distance restriction is proposed in this paper. In this algorithm, the scale of protein complex is restricted by distance constraint. The experiment results show that algorithm CP-DR can detect a large number of protein complexes with specific biological significance and biological functions more effectively, more precisely and more comprehensively.
The Proposed Algorithm
Algorithm CPM
As is known to all, the result of algorithm CPM associated closely with the value of clique percolation parameter k. Generally speaking, the larger value of k chose, the smaller size of k-clique communities of higher density would be obtained. And it is no doubt that vertices are relatively dense linked internal each k-clique community. Although the algorithm CPM analysis the overlapping modular structure of society and biology is effective, the drawback is protein complexes quantity identified by this algorithm limited. Especially, protein complexes quantity is fewer when the relatively large k value chose. The large-scale k-clique communities usually correspond to small k values, that is to say, the smaller value of parameter k selected, the larger size of k-clique communities have. We applied the CPM method to yeast protein-protein interaction network and detected interesting protein complexes which might overlap each other. When using k=4, taking into account the basic topological unit as 4-clique, there is a large identified protein complex containing 348 vertices and 2499 pairs interactions. Meanwhile, there is an excessive huge protein complex detected with 865 vertices and 4508 pairs interactions when choosing k=3. As the previous examples, the scale of k-clique communities is far greater than scales of k-cliques and sparse k-clique chain.
Algorithm CP-DR
In algorithm CPM, each k-clique community is considered as a protein complex because of its dense internal links and sparse external linkage with other part of the protein-protein interaction networks. According to the relational biological characteristics and topological properties, the CPM method should be improved to identify protein complexes with more advantages such as meso-scale, high accuracy, more effectively and more comprehensively. In our approach, in order to achieve these advantages, we propose a new topological model by extending the definition of k-clique community of algorithm CPM and introduced distance restriction. Therefore, the scale of clusters identified by our approach is restricted by distance constraint and protein complexes considered as clusters satisfying distance limits.
Our new topological structure of identified clusters is based on the observation that a typical member in a cluster is linked to many other members, but not necessary to all other vertices in the cluster. In other words, our new topological structure of identified cluster can be interpreted as a union of small complete (fully connected) subgraphs that share vertices. We could definition the identified cluster as the union of all maximal cliques that satisfying the distance restriction and that can be reached from each other through a series of adjacent maximal clique (where two maximal cliques are said to be adjacent if they share N vertices). In this definition, the distance is represented by the diameter of the identified cluster (i.e., the largest length of a length of shortest path between a pair of vertices in the union of all maximal cliques).
In the following discussion, we donate by U and V basic cluster units, and by C_{ c(U, V) } the number of common vertices between basic cluster units U and V, by C_{ l(U, V) } the largest length of a length of shortest path between a pair of vertices in the union of U and V. Because of our new topological model based on this two condition C_{ c(U, V) } and C_{ l(U, V) }, our discussion will be mainly focused on them.
Condition 1
In the definition of our new topological model, the identified cluster could be seen as the union of all maximal cliques that can be reached from each other through a series of adjacent maximal clique (where two maximal cliques are said to be adjacent if they share N vertices). This condition can be depicted by the following formula:
C_{c(U, V)} ≥ N (1)
where N represents the common vertices between basic cluster units U and V.
Condition 2
In our new topological model, the identified cluster also should be satisfying the distance restriction. As mentioned above, the distance is represented by the diameter of the identified cluster. According to the small-world property of the protein interaction networks [27, 28], this condition can be defined by the following formula:
C_{l(U, V)} ≤ d (2)
where d represents the diameter of the identified cluster.
It is known to all from previous subsection that two k-cliques will be merging if and only if they share k-1 vertices in algorithm CPM. According to this underlying idea, it is obviously that the larger value of the clique percolation parameter k chose, the more difficult k-cliques merge. The vast majority of predicted clusters are sole k-cliques. By the experimental analysis, the predicted clusters identified by this extending rule with large value of k are inefficient and incomplete. Reversely, the smaller value of k chose, the easier k-cliques merge and the scale of predicted clusters will be huger. That is why we obtain excessive huge clusters with 865 vertices and 4508 pair interactions in algorithm CPM when using the clique percolation parameter k=3. Generally speaking, the predicted clusters identified by this extending rule with small value of k are excessive large and low accuracy. In order to overcome this shortcoming of algorithm CPM, we define the novel extending rule as two maximal cliques will merge if they share N vertices. The number of common vertices N defined by the following formula:
N = MIN(|U|,|V|) – 1 (3)
where |U| and |V| are the size of basic cluster unit U and V.
In literature [18], Li et al. analyzing the topology of complex in the protein interaction network of Saccharomyces cerevisiae. Of the 216 gold standard protein complexes, 118 are connected (a protein complex is connected if there is a path connecting every pair of vertices in the complex). They have found that 94.91% of the connected complexes and 82.66% of the non-connected complexes have their diameter bounded 2. Furthermore, they found that 99.15% of the connected complexes and 93.88% of the non-connected complexes have their average shortest path length bound by 2. It is known to all this fact matches the observation that the protein interaction networks have the small-world property [27, 28]. This analysis on the statistical data shows that the length of the shortest path between each pair of vertices in most of the complexes is bounded by 2. With this important conclusion, we believe that the diameter and the average shortest path length are important topological characteristic for detecting protein complexes. Therefore, in our novel topological model, we choose d=2 as the distance restriction condition, that is to say, we restrict arbitrary pair of vertices in identified clusters absolutely no more than 2.
As shown in Fig.3, algorithm CP-DR contains five major steps. The input to algorithm CP-DR is the protein-protein interaction information. According to the protein-protein interaction information, an undirected simple graph G (V, E) with proteins as vertices and protein interactions as edges is created firstly. And then, we search all of maximal cliques with size no less than 3 in G. Next, each maximal clique is initialized as a basic cluster unit. In the basic clusters unit collection, if there are N common vertices between any pairs basic cluster units (U, V) and the union of (U, V) satisfying the condition absolutely no more than 2 when basic cluster unit U is not same as basic cluster unit V, we will save the union of (U, V). The first basic cluster unit will be deleted if and only if mergence appearance and all of comparison accomplishment. At last, all of identified clusters in S are exported.
In step 1 of algorithm CP-DR, the time complexity of protein-protein interaction information transformed into undirected simple graph is O(m). Enumerating all maximal cliques with size no less than 3 is a NP-complete problem, and only non-polynomial time algorithms for solving it are known. It has an upper bound of O(nmu) in step 2. In step 3 of algorithm CP-DR, each maximal clique initialized as a basic cluster unit is O(u) for time complexity. In the core step of algorithm CP-DR, the time complexity has an upper bound of O(u^{ 2 }s^{ 3 }). At last, the step of exporting all of identified clusters is O(u) for time complexity. This implies an upper bound of O(nmu+u^{ 2 }s^{ 3 }) for time complexity of algorithm CP-DR, where n is the number of nodes , m is the number of edges, u is the number of maximal cliques with no less than 3 in the graph and s is the size of largest maximal clique.
Results and Discussions
To evaluate the suitability and validity of our proposed algorithm in identifying the overlapping protein complex in protein-protein interaction networks, we have used C++ language to implement algorithm CP-DR and download the overlapping protein complexes identification tool CFinder from http://angel.elte.hu/clustering/. The protein interaction network of Saccharomyces cerevisiae is downloaded from MIPS (Munich Information Center for Protein Sequences) database. We remove all the self-connecting interactions and repeated interactions. The final network includes 4546 yeast proteins and 12319 interactions. The average clustering coefficient of the final network is 0.4, the network diameter is 13, and the average vertex distance is 4.42. We applied the proposed algorithm CP-DR to this network.
In the following subsections, we will compare the predicted clusters with the known complexes, analyze the sensitivity, specificity and f-measure of the algorithm CP-DR, calculate the overlapping rate of the predicted clusters, and evaluate the significance of the predicted clusters. We will also compare the algorithm CP-DR to the clique percolation method CPM in their performance on these measures.
Comparison with the known complexes
where i is the size of the interaction set of the predicted overlapping structure and the known complex, |V_{ Pc }| is the size of the predicted overlapping structure and |V_{ Kc }| is the size of the known complex.
A known complex Kc that has no proteins in a predicted overlapping structure Pc has OS(Pc, Kc) = 0 and a known complex Kc that perfectly matches a predicted overlapping structure Pc has OS(Pc, Kc) = 1. A known complex and a predicted overlapping structure are considered as a match if their overlapping score is equal to or larger than a specific threshold. Generally speaking, the more known complexes are matched by algorithm, the stronger identification ability algorithm has.
Examples of protein complexes identified by algorithm CPM and algorithm CP-DR.
CPM(k=3) | CP-DR | ||||
---|---|---|---|---|---|
Sequence | Known Complex | Size | OS(Pc, Kc) | Size | OS(Pc, Kc) |
YDR226c YER165w YKR002w | |||||
Complex1 | YMR061w YOL123w YGL044c | 6 | 0.833 | ||
YKR002w YMR061w YLR115w | 13 | 0.089 | |||
Complex2 | YAL013c YLR277c YNL317w | 9 | 0.854 | ||
YJR093c YPR107c YDR301w | |||||
YPR041w YMR036c YBR079c | |||||
Complex3 | YNL244c | 8 | 0.443 | 6 | 0.795 |
YOR361w YMR146c YPL105c | |||||
YDR429c | |||||
YFL088c YKR068c YLR268w YIL004c | |||||
Complex4 | YML077w YDR407c YOR115c | 18 | 0.150 | 13 | 0.923 |
YMR218c | |||||
YBR254c YDR472w YGR166w | |||||
YDR246w |
In Table 1, the size of protein complexes identified by algorithm CP-DR is smaller than that identified by algorithm CPM, while the best matching extent of the protein complexes identified by algorithm CP-DR to known protein complexes are significant higher than that of algorithm CPM. The known protein complexes Complex 1 and Complex 2 are identified as an integral protein complex by algorithm CPM, but the real overlapping protein complexes could be detected by algorithm CP-DR and the matching extent are perfect.
Specificity and Sensitivity
where FP (false positive) equals the total number of the predicted clusters minus TP. According to the assumption in [20], a predicted complex and a known complex are considered to be matched if OS(Pc, Kc) ≥0.2. Here, we also use 0.2 as the matched overlapping threshold.
Comparison of algorithm CP-DR and algorithm CPM in Sensitivity, Specificity and f-measure.
Algorithm | Parameter | Sn ( Sensitivity ) | Sp ( Specificity ) | F-measure |
---|---|---|---|---|
CP-DR | 0.872787611 | 0.391952310 | 0.540966747 | |
k=3 | 0.213592233 | 0.247191011 | 0.229166667 | |
CPM | k=4 | 0.155339806 | 0.524590164 | 0.239700375 |
k=5 | 0.092592593 | 0.722222222 | 0.164141415 |
As is known to all from Table 2, the sensitivity of algorithm CP-DR is greater than 0.87. This result shows that the number of detected complexes, TP, which matched by the known complexes with OS(Pc,Kc) ≥0.2 is significant greater than the unidentified complexes, FN, in the same threshold values. Therefore, there is a conclusion that the higher sensitivity correspond to lager TP and smaller FN. The conclusion also indicated that the detected complexes detected by algorithm CP-DR are greater reliability. The greatest sensitivity of algorithm CPM at arbitrary parameters is 0.213592233, only a quarter of sensitivity of algorithm CP-DR. As is well-known, Specificity is the fraction of the true-positive predictions out of all the positive predictions. The specificity of algorithm CP-DR is greater than 0.39, meanwhile, the specificity of algorithm CPM with arbitrary parameters k are greater than 0.2. Specifically, the algorithm CPM specificity is significant greater than that of algorithm CP-DR when the clique percolation parameter k ≥4. Though the specificity of algorithm CPM is higher than that of algorithm CP-DR, it is far less that the number of complexes identified by algorithm CPM compare to the known complexes quantity. In addition, because of the reference set MIPS is incomplete, some predicted complexes that may be true complexes could be regarded as false positives (FP) if they do not match with the known complexes. Nevertheless, it is still reasonable to consider a method more effective if it identifies more known complexes. From Table 2, we found the information that the comprehensive evaluation f-measure of algorithm CP-DR is more than twice the algorithm CPM. These results illustrate that the performance of algorithm CP-DR is more excellent in the protein complexes identification.
Overlapping Rate Analysis
Definition 1
Overlapping Rate: In undirected graph G, the average occurrence times of each vertex v in all of induced subgraphs.
where k_{ v } is the number of occurrences of each vertex v in all of predicted complexes, N_{ i } is the size of predicted protein complexes, and S is all of identified protein complexes collection.
By the analysis protein complexes detected by algorithm CP-DR and algorithm CPM, we found that a vast majority of proteins only subordinate one or two complex. The situation of three or more protein complexes contain a same protein is rare.
Average overlapping rate of protein complexes identified by algorithm CP-DR and algorithm CPM.
CP-DR | CPM(k=3) | CPM(k=4) | CPM( k=5) | |
---|---|---|---|---|
Overlapping Rate | 2.103 | 1.192 | 1.115 | 1.093 |
OR> | 33.613% | 13.843% | 10.526% | 9.685% |
Function Enrichment Analysis
where N is the total number of vertices in the network, C is the size of the predicted complex, F is the size of a functional group, and k is the number of proteins of the functional group in the predicted complex. As is well-known, the smaller difference between P-value and 0, the smaller possibility of protein complex possesses such function by chance and the larger possibility of protein complex encompasses special biological significance. Generally speaking, the main function of protein complexes corresponds to the minimum P-value. Therefore, we could predict the functions of unknown proteins by conferring the functions with the minimum P-value to identified complexes. The functional classification of proteins used in our paper was collected from the MIPS Functional Catalog (FunCat) database. FunCat [26] is an annotation scheme of tree-like structure for the functional description of proteins.
There are 1896 predicted protein complexes match with the known functional categories with P-value less than 0.01. Under the same conditions, there is 1823 predicted protein complexes match with P-value less than 0.001. Meanwhile, only 128 identified protein complexes by using the clique percolation parameter k=3 match with P-value less than 0.01, and 121 complexes match with P-value less than 0.001 using the clique percolation parameter k=3. By the above function enrichment analysis, we could understand, the capacity of algorithm CP-DR identifying protein complexes with special biological significance is much higher than that of algorithm CPM.
Complexes | ORF | Protein functional categories | |||||||
---|---|---|---|---|---|---|---|---|---|
Table 1 (13) | YLR268w | 20.09.07.03 | 20.09.07.05 | 20.09.07.27 | |||||
YKR068c | 20.09.07.03 | ||||||||
YML077w | 20.09.07.03 | ||||||||
YFL038c | 14.10 | 20.09.07.03 | |||||||
YGR166w | 01.05.25 | 20.09.07.03 | |||||||
YDR108w | 10.03.02 | 20.09.07.03 | 43.01.03.09 | ||||||
YBR254c | 20.09.07.03 | ||||||||
YDR246w | 20.09.07.03 | ||||||||
YDR407c | 20.09.07.03 | ||||||||
YDR472w | 20.09.07.03 | ||||||||
YMR218c | 20.09.07.03 | ||||||||
YOR115c | 20.09.07.03 | ||||||||
YIL004c | 20.09.07.03 | 20.09.07.27 | |||||||
Figure 8 (complex197) | YBL084c | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | ||
YDL008w | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YDR118w | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YFR036w | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YGL240w | 10.03.01.01.11 | 14.07 | 14.13.01.01 | 16.01 | |||||
YHR166c | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | 42.04 | ||
YKL022c | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YLR127c | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YNL172w | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YOR249c | 10.01.09.05 | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | ||
YLR102c | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 | |||
YIR025w | 10.03.01.01.11 | 14.07.05 | 14.10 | 14.13.01.01 | 16.01 | 16.19.03 |
Conclusions
It is believed that identification of protein complexes is useful to explain certain biological progress and to predict functions of proteins. In this paper, we extended the definition of k-clique community of algorithm CPM, introduced distance restriction, proposed a new topological model for protein complexes and developed an algorithm CP-DR to identify protein complexes in large protein interaction networks based on the proposed topological model. Interaction networks are represented by undirected simple graphs and we generate predicted clusters in the networks by using clique percolation and distance restriction. The algorithm CP-DR could generate overlapping protein complexes, which is consistent with the fact that many of the known protein complexes are overlapping. Interesting questions for further research include how many functions a protein can have, how many processes a protein can participate in, and how heavily two protein complexes should overlap with each other.
We applied the algorithm CP-DR to the protein interaction network of Sacchromyces cerevisiae. Many well-known complexes were detected in the protein interaction network. We tested the sensitivity, specificity and f-measure of our algorithm. The results have shown that our algorithm is suitability and efficiency in the protein interaction networks. We also predict the functions for un-characterized proteins and predicted new functions for the known proteins by minimizing the P-values of the predicted clusters. Our algorithm can thus be used to identify new protein complexes in protein interaction networks of various species and to provide references for biologists in their research on protein complexes.
Methods
The protein interaction network of Saccharomyces cerevisiae is downloaded from MIPS (Munich Information Center for Protein Sequences) database. We remove all the self-connecting interactions and repeated interactions. The final network includes 4546 yeast proteins and 12319 interactions. We also collect from the MIPS database protein complexes annotated for Sacchromyces cerevisiae. We discarded those consisting of only one protein and the final remaining 216 manually annotated complexes are considered as the gold standard data. The largest complex contains 81 proteins, the smallest complex contains 2 proteins, and the average size of all the complexes is 6.31. We download the overlapping protein complexes identification tool CFinder from http://angel.elte.hu/clustering/. The proposed algorithm CP-DR has been implemented in C++.
Declarations
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61003124 and 61073036, the National Basic Research 973 Program of China No.2008CB317107, the Ph.D. Programs Foundation of Ministry of Education of China No. 20090162120073, the U.S. National Science Foundation under Grants CCF-0514750, CCF-0646102, and CNS-0831634, and the Program for Changjiang Scholars and Innovative Research Team in University No. IRT0661. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article has been published as part of BMC Genomics Volume 11 Supplement 2, 2010: Proceedings of the 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S2.
Authors’ Affiliations
References
- Garrels JI: Yeast genomic databases and the chanllenge of the post-genomic era. Funct Integr Genomics. 2002, 2 (4-5): 212-237.View ArticlePubMedGoogle Scholar
- King AD, Pržulj N, Jurisica I: Protein Complex Prediction via Cost-based Clustering. Bioinformatics. 2004, 20: 3013-3020.View ArticlePubMedGoogle Scholar
- Hartuv E, Shamir R: A Clustering Algorithm based Graph Connectivity. Information Processing Letters. 2000, 76: 175-181.View ArticleGoogle Scholar
- Girvan M, Newman MEJ: Community structure in social and biological networks. PNAS. 2002, 99: 7821-7826.PubMed CentralView ArticlePubMedGoogle Scholar
- Newman MEJ: Fast algorithm for detecting community structure in networks. Phys Rev E. 2004, 69 (6): 066133-View ArticleGoogle Scholar
- Radicchi F, Castellano C, Cecconi F: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. USA. 2004, 101 (9): 2658-2663.PubMed CentralView ArticlePubMedGoogle Scholar
- Luo F, Yang Y, Chen CF, Chang R, Zhou J, Scheuermann RH: Modular organization of protein interaction networks. Bioinformatics. 2007, 23 (2): 207-14.View ArticlePubMedGoogle Scholar
- Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22 (8): 1021-3.View ArticlePubMedGoogle Scholar
- Guldener U: CYGD:the Comprehensive Yeast Genome Database. Nucleic Acids Res. 2005, 33: D364-D368.PubMed CentralView ArticlePubMedGoogle Scholar
- Gregory S: An algorithm to find overlapping community structure in networks. Proc. 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD07). 2007, 91-102.Google Scholar
- Pinney JW, Westhead DR: Betweenness-based decomposition methods for social and biological networks. In Interdisciplinary Statistics and Bioinformatics. Edited by: S. Barber, P. Baxter, K. Mardia, and R. Walls. 2007, Leeds University Press, pp. 87-90.Google Scholar
- Zhang S, Wang RS, Zhang XS: Identification of Overlapping Community Structure in Complex Networks using Fuzzy C-means Clustering. PHYSICA A. 2007, 374: 483-490.View ArticleGoogle Scholar
- Jeong H, Mason S, Barabási AL: Lethality and centrality in protein networks. Nature. 2001, 411 (6833): 41-42.View ArticlePubMedGoogle Scholar
- Min Li, Jianxin Wang, Jian’er Chen: A Graph-Theoretic Method for Mining Overlapping Functional Modules in Protein Interaction Networks. Proceedings of the 4th International Symposium on Bioinformatics Research and Applications. 2008, 4983: 208-219. LNBIView ArticleGoogle Scholar
- Palla G, Dernyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005, 435 (7043): 814-818.View ArticlePubMedGoogle Scholar
- Cho YR, Hwang W, Ramanmathan M: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007, 8: 265-PubMed CentralView ArticlePubMedGoogle Scholar
- Altaf-UI-Amin Md, Shinbo Yoko, Mihara Kenji: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7: 207-View ArticleGoogle Scholar
- Li Min, Chen Jian’er, Wang Jianxin, Hu Bin, Chen Gang: Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008, 9: 398-PubMed CentralView ArticlePubMedGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. PNAS. 2003, 100: 12123-12128.PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks. BMC Bioinformatics. 2003, 4: 2-PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang S, Ning X, Zhang XS: Identification of functional modules in a PPI network by clique percolation clustering. Comput Biol Chem. 2006, 30 (6): 445-51. Epub 2006 Nov 13View ArticlePubMedGoogle Scholar
- Jonsson P, Cavanna T, Zicha D, Bates P: Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastatis. BMC Bioinform. 2006, 7: 2-View ArticleGoogle Scholar
- Li XL, Tan SH, Foo CS: Interaction Graph mining for protein complexes using local clique merging. Genome Informatics. 2005, 16: 260-269.PubMedGoogle Scholar
- Ruepp A, Zollner A, Maier D: The FunCat: a functional annotation scheme for systematic classication of proteins from whole genomes. Nucleic Acids Research. 2004, 32: 5539-5545.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Dongen S: Graph clustering by flow simulation. In PhD thesis Centersfor mathematics and computer science (CWI). 2000, University of UtrechtGoogle Scholar
- Enright AJ, Dongen SV, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research. 2002, 30 (7): 1575-1584.PubMed CentralView ArticlePubMedGoogle Scholar
- del Sol A, OMeara P: Small-world network approach to identify key residues in protein-protein interaction. Proteins. 2004, 58 (3): 672-682.View ArticleGoogle Scholar
- del Sol A, Fujihashi H, OMeara P: Topology of small-world networks of protein-protein complex structures. Bioinformatics. 2005, 21 (8): 1311-1315.View ArticlePubMedGoogle Scholar
- Hwang W, Cho YR, Zhang A: A novel functional module detection algorithm for protein-protein interaction networks. Algorithms for Molecular Biology. 2006, 12: 1-24.Google Scholar
- Bu D, Zhao Y, Cai L: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research. 2003, 31 (9): 2443-2450.PubMed CentralView ArticlePubMedGoogle Scholar
- Tornow S, Mewes HW: Functional modules by relating protein interaction networks and gene expression. Nucleic Acids Research. 2003, 31 (21): 6283-6289.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.