 Research
 Open Access
 Published:
ClusterM: a scalable algorithm for computational prediction of conserved protein complexes across multiple protein interaction networks
BMC Genomics volume 21, Article number: 615 (2020)
Abstract
Background
The current computational methods on identifying conserved protein complexes across multiple ProteinProtein Interaction (PPI) networks suffer from the lack of explicit modeling of the desired topological properties within conserved protein complexes as well as their scalability.
Results
To overcome those issues, we propose a scalable algorithm—ClusterM—for identifying conserved protein complexes across multiple PPI networks through the integration of network topology and protein sequence similarity information. ClusterM overcomes the computational barrier that existed in previous methods, where the complexity escalates exponentially when handling an increasing number of PPI networks; and it is able to detect conserved protein complexes with both topological separability and cohesive protein sequence conservation. On two independent compendiums of PPI networks from Saccharomyces cerevisiae (Sce, yeast), Drosophila melanogaster (Dme, fruit fly), Caenorhabditis elegans (Cel, worm), and Homo sapiens (Hsa, human), we demonstrate that ClusterM outperforms other stateoftheart algorithms by a significant margin and is able to identify de novo conserved protein complexes across four species that are missed by existing algorithms.
Conclusions
ClusterM can better capture the desired topological property of a typical conserved protein complex, which is densely connected within the complex while being wellseparated from the rest of the networks. Furthermore, our experiments have shown that ClusterM is highly scalable and efficient when analyzing multiple PPI networks.
Background
Advanced highthroughput technologies for measuring protein interactions [1, 2] have provided researchers with rich information about proteinprotein interactions (PPI) in various species [3–5]. In order to translate such information into meaningful biological knowledge about the underlying cellular functions and evolutionary mechanisms, one arising computational challenge is how to integrate these PPI data with other available data—such as sequence data—to identify conserved protein complexes that have similar cellular functions across multiple species.
Intuitively, conserved protein complexes should have the following properties with respect to their network topology and sequence homology. Within each species, proteins within a protein complex is densely connected to each other while being loosely connected to and well separated from the rest of the PPI networks [6]. Across different species, conserved protein complexes should contain many orthologs with high sequence similarity. Therefore, it appears reasonable to expect that the problem of identifying such complexes could be effectively addressed by a comparative network analysis approach.
In recent years, efforts have been made to develop computational techniques for the comparison of PPI networks across different species through global network alignment [7–12]. Most of the existing global network alignment algorithms aim to identify the onetoone mapping with the maximum total similarity of aligned proteins across networks. However, this approach may not directly lead to accurate identification of conserved protein complexes, where an important focus is on grouping proteins that work together towards similar functionalities across species. Furthermore, proteins in conserved protein complexes often have manytomany orthologous relationships. Local network alignment [13–18] searches for conserved subnetworks across species, which better resembles the task of identifying conserved protein complexes. However, many of them focus on the topological criteria based on conserved edges or other network motifs with specific topological structures, often motivated by conjectured evolutionary or functional models [14, 18, 19]. To the best of our knowledge, none of the existing local alignment algorithms explicitly considers the characteristic topological properties of protein complexes, in which proteins within the complex highly interact with each other but rarely interact with the rest of the network. Therefore, directly applying existing local network alignment algorithms may not result in accurate detection of conserved protein complexes with the maximum coverage of the given PPI networks. Furthermore, both local and global network alignment problems essentially reduce to the (sub)graph isomorphism problem and the alignment results tend to be sensitive to topological errors in the PPI networks. This is certainly problematic, since currently available PPI networks may contain a significant number of false positive interactions while many true interactions are still missing. In comparison, detecting conserved protein complexes is generally more robust to similar topological errors, as we focus on interaction density within complexes and their topological separability (i.e., whether they are well separated from the rest of the PPI networks). In other words, the focus lies on groups of proteins that may potentially belong to the same complex rather than individual proteins, which makes the overall prediction less sensitive to errors in the PPI networks.
In addition to the lack of explicit modeling of the desired topological properties within conserved protein complexes, most of the existing network alignment and clustering algorithms [13–15, 17, 18] do not scale well with the number of species and the network size. For example, it is prohibitive for the network alignment algorithms in [13–15, 17, 18] to handle more than three PPI networks due to the exponential growth of the alignment graph with the number of species and the network size. Finally, to the best of our knowledge, currently, there is neither a gold standard for conserved protein complexes that contains protein complexes from multiple species nor commonly accepted metrics for assessing the performance of algorithms for predicting conserved protein complexes. As a consequence, it has been practically difficult to effectively evaluate the capability of such algorithms to unveil “true” conserved protein complexes, just based on the fact that the identified conserved protein complexes may overlap with wellknown protein complexes in another species [13, 14, 17, 18].
To fill these critical gaps in a conserved protein complex identification, we propose a scalable algorithm—ClusterM—that explicitly characterizes the desired topological separability of protein complexes and also incorporates sequence similarity of proteins across the given PPI networks. ClusterM consists of three major steps. The first step is to find a set of protein spines (sets of proteins, one from each network) across different PPI networks. In the second step, wellseparated subnetworks around proteins in each protein spine are identified. The final step is to look for conserved subnetworks within those identified wellseparated subnetworks, which have both cohesive proteinprotein interaction similarity and sequence similarity.
Last but not least, in order to evaluate and compare the performance of ClusterM and other existing algorithms, we have curated a new yeasthuman reference conserved protein complex dataset based on yeast and human goldstandard complexes and propose effective evaluation metrics based on the existing measures for assessing protein complex identification algorithms for individual PPI networks. Experimental results based on comparative analysis of yeast and human PPI networks show that ClusterM clearly outperforms the current stateoftheart algorithms.
Results
Identifying conserved complexes in yeast and human networks
We first evaluated ClusterM on the yeast and human PPI networks based on two constructed datasets: DIPYH and IntActYH. In the DIPYH dataset, yeast (Sce) and human (Hsa) PPI networks are obtained from Database of Interacting Proteins (DIP) version 20150101 [3] and the protein sequence similarity between proteins across networks are computed by BLAST [20]. The human PPI network has 4,278 proteins and 6,446 interactions, and the yeast PPI network has 5,138 proteins and 22,835 interactions. Similarly, the IntActYH dataset contains yeast and human PPI networks extracted from the IntAct database version 20150120 [4] and the corresponding protein sequence similarity across networks. The human PPI network has 23,246 proteins and 106,031 interactions, and the yeast PPI network has 6,392 proteins and 78,287 interactions. We compared ClusterM with the stateoftheart algorithms—AlignNemo [17], AlignMCL [21], MaWISh [14], NetworkBLAST [22], and NetworkBLASTM [23]. The selection of parameters of different methods is discussed in the Methods section. We assessed the quality of the identified conserved protein complexes based on the yeasthuman reference conserved protein complexes generated from handcurated yeast and human protein complex data obtained from CYC2008 [24] and the Comprehensive Resource of Mammalian protein complexes (CORUM) [25]. The detailed description of the reference complexes can be found in the Methods section.
In this study, we define a composite score for performance evaluation, which consists of (i) the fraction of the reference conserved protein complexes matched by at least one identified conserved protein complex, (ii) the accuracy score [6], and (iii) the maximum matching ratio for conserved complexes (MMRC). The description of these three evaluation metrics is detailed in the Methods section.
As shown in Fig. 1a, ClusterM outperformed all competing algorithms by a large margin on every evaluation metric for both datasets. Figure 1b and c illustrate two conserved protein complexes identified by ClusterM that cannot be fully recovered by other methods. Clearly, ClusterM utilizes the network topology to facilitate the detection of conserved protein complexes without considering the restriction shared by most of the algorithms (AlignNemo, AlignMCL, NetworkBLAST, and MaWISh), which is only proteins with similar sequences across species can exist in the predicted conserved complexes. NetworkBLASTM does not have such a restriction but failed to characterize the property of the conserved complexes.
We discuss two specific conserved protein complexes in Figs. 1b and c in detail to highlight the advantages of ClusterM over other competing methods. In Fig. 1b, there are a relatively larger number of unidentified proteins in the predicted conserved protein complex detected by AlignNemo. The precision of the identified complex is also small as it includes a larger number of irrelevant proteins such as Ing2, P06897, and ZNHIT1 in the conserved protein complex in human PPI network and VPS71 and HTA1 in yeast PPI network. However, ClusterM correctly predicted a larger number of proteins in the NuA4 histone acetyltransferase protein complex with a higher precision. One possible explanation for the significant difference is that ClusterM can correctly identify the proteins without a sequence similarity by effectively utilizing the sequence information and topological structure of PPI networks. On the contrary, AlignNemo failed to identify the proteins in the candidate complexes if they do not have corresponding homology with high sequence similarity. This illustration clearly shows the distinctive advantages of ClusterM over the competing methods to identify conserved protein complex across different PPI networks. In Fig. 1c, AlignMCL also failed to identified the proteins without homology information. Moreover, the predicted conserved complex in human PPI network is illconnected. That is, only POP4 and POP5 are connected and RPP21 and RPP30 are not connected. This illustration shows that AlignMCL could have a limitation to effective utilization of the topological separability of the conserved protein complexes and AlignMCL can fail to identify the inserted (or deleted) proteins in the conserved complex if there are insufficient or noisy sequence similarity information. In Fig. 2, we show other representative examples of conserved protein complexes detected by ClusterM, which cannot be correctly identified by other algorithms. We have also applied Gene Ontology (GO) enrichment analysis to every identified conserved complexes and the results indicate that ClusterM achieves remarkably higher coverage (i.e., total number of proteins in the predicted complexes) than all competing methods and more than 90% of the predicted complexes identified by ClusterM are significantly enriched in certain GO terms (detailed results and discussions can be found in Supplementary Materials).
Robustness analysis
To investigate the degree to which the competing algorithms are sensitive to small changes in the input so that meaningful results can be derived even with noisy or incomplete data, we performed a robustness analysis of the algorithms considered in this study. For each algorithm, we optimized the parameters based on the original DIPYH dataset (yeast and human PPI from DIP database and the protein sequence similarity across species) and tested the performance on a perturbed DIPYH dataset, in which we rewired 10% edges in both yeast and human PPI networks through the MP procedure [26] to instill topological noise. Furthermore, we treated the protein sequence similarity between yeast and human as a bipartite network, and thereby perturbed 10% of the homologous relationships (i.e., a sequence similarity) also using the MP procedure [26]. To remove the randomness in the MP procedure, we generated 10 perturbed networks for each test case and reported the averaged scores.
Figure 3 shows the evaluation results—i.e., fraction, accuracy, and MMRC scores—based on four datasets: the noisefree dataset, the dataset with only topological noise, the dataset with only homology noise, and the dataset with both topological and homology noise. Each table corresponds to each dataset in Fig. 3. We first computed the composite score by summing over the fraction, accuracy and MMRC scores. Then, we computed the mean and standard deviation of the four composite scores. The mean of the composite score for ClusterM, MaWISh, NetworkBLAST (NB), AlignMCL, NetworkBLASTM (NBM), and AlignNemo were 0.890, 0.588, 0.552, 0.486, 0.405, and 0.368, respectively. The standard deviation of the composite score for ClusterM, MaWISh, NB, AlignMCL, NBM, and AlignNemo were 0.066, 0.075, 0.073, 0.079, 0.047, and 0.053, respectively. We can see that ClusterM achieves the largest mean and the third smallest standard deviation among all compared algorithms, which implies that ClusterM is relatively less sensitive to small changes in the input data as well as the parameter selection. Although AlignNemo achieved the smallest standard deviation, it showed the smallest composite score as well and there is a clear gap to the composite score of ClusterM. MaWISh attained the second best composite score for the noisefree dataset, but for noisy datasets, its performance was unstable. This unstable behavior may be due to its dependence on the choice of the seven parameters. Even for relatively small changes in the input, those seven parameters may have to be simultaneously (and significantly) changed to obtain the best performance, which may make it practically difficult to find the optimal parameters for large datasets in a robust manner. NB and NBM showed similar performance in term of the changes of the composite scores, and they turned out to be more sensitive to noise compared to ClusterM but less sensitive when compared to MaWISh. These results show that ClusterM is the most userfriendly algorithm among the compared methods, considering that it is robust to noise and that it does not require the joint optimization of multiple parameters to achieve good performance.
Conserved protein complex identification for multiple networks
In this section, we present and discuss a specific example that demonstrates the potential of ClusterM in identifying conserved protein complexes in multiple (more than two) PPI networks. We have constructed two PPI datasets for benchmarking protein complex identification across four PPI networks: DIPPPIs and IntActPPIs, based on the corresponding PPI networks and protein amino acids sequences for human (Hsa), yeast (Sce), fly (Dme), and worm (Cel) from DIP (version 20150101) and IntAct (version 20150120). The detailed description to construct PPI networks can be found in the Method section. Table 1 provides a basic statistical summary and the abbreviation of PPI networks. In Fig. 4a, we show the conserved proteasome core complex identified by ClusterM in IntActPPIs. This complex was completely missed by all other algorithms that were considered in this study. Proteins in light blue, green, pink, and purple represent proteins in yeast, human, fly, and worm, respectively. We further tested the statistical significance of each of the four protein complexes in the respective organisms based on the GO term annotated to proteasome core complex (GO:0005839). The pvalues associated to the corresponding yeast, human, fly, and worm complexes are 5.58E41, 2.80E48, 4.88E16, and 5.48E9, respectively. In Fig. 4b, the spliceosomal complex (GO:0005681) is significantly enriched in the yeast, human, fly, and worm conserved complex with the pvalues 2.37E08, 4.16E12, 6.03E11, and 2.97E08, respectively. Note that red solid lines are PPIs missed in IntAct database but we can identify the interactions in STRING database [27]. The enrichment analysis of these conserved protein complexes, which could be detected solely by ClusterM, clearly shows that ClusterM can effectively mine biologically meaningful protein complexes in multiple largescale PPI networks.
Highlevel GO term consistency
To examine the biological significance of the conserved protein complexes identified by ClusterM and other existing algorithms, we calculated the coverage and the mean normalized entropy (MNE) score [7] based on highlevel GO annotations for the results obtained by ClusterM, MaWISh, AlignMCL, and NetworkBLASTM. Note that the lower MNE indicates that the proteins in the predicted conserved complex are more consistent in terms of GO annotations. We selected the three algorithms—MaWISh, AlignMCL, and NetworkBLASTM—as they show the best performance among all compared algorithms except for ClusterM. For the blanks in Tables 2 and 3, MaWISh and AlignMCL cannot deal with more than 3 networks. We applied these four algorithms to identify conserved protein complexes for every combination of PPI networks in DIPPPIs, where the obtained results are presented in Table 2. The evaluation results based on IntActPPIs datasets can be found in Table 3. Both tables reveal that the conserved protein complexes identified by ClusterM consistently achieve lower MNE and higher coverage scores, where it means that ClusterM can predict a larger number of proteins with a high functional consistency.
Discussion
In this paper, we proposed a scalable algorithm—ClusterM—that can identify conserved protein complexes by integrating protein sequence information and topological structure of the PPI networks. As demonstrated by our results, ClusterM can better capture the desired topological property of a typical conserved protein complex, which is densely connected within the complex while being wellseparated from the rest. Experimental results based on realworld PPI networks and protein complexes show that ClusterM significantly outperforms other stateoftheart algorithms in the task of identifying conserved protein complexes. Additionally, the conserved protein complexes identified by ClusterM have been shown to boast better highlevel GO term consistency compared to other competing algorithms.
ClusterM is an enhanced approach that directly considers the characteristic topological structure of conserved protein complexes, which are typical densely connected within the complex and wellseparated from the rest of the network. Unlike ClusterM, existing stateoftheart algorithms (e.g., NetworkBLAST, NetworkBLASTM, and MaWISh) focus on the interaction density of the conserved protein complex but do not explicitly consider the separability of conserved complexes from the rest of the PPI network. Furthermore, compared to algorithms such as NetworkBLAST and MaWISh, which only consider proteins with homology correspondence (determined based on protein sequence similarity) across PPI networks to reduce the overall computational complexity, ClusterM does not impose such restriction and integrates the topology and homology information in a very flexible manner. Finally, another important advantage of ClusterM is that the algorithm can easily handle multiple (more than two) PPI networks and yield biologically meaningful results. While NetworkBLASTM can also deal with multiple PPI networks, NetworkBLASTM is not capable of handling largescale networks.
Conclusions
We propose a scalable algorithm—ClusterM—that explicitly characterizes the desired topological separability of protein complexes and also incorporates sequence similarity of proteins across the given PPI networks. Thanks to the computational framework used by ClusterM, it can easily handle multiple PPI networks at the same time. We have extensively compared ClusterM with other stateoftheart algorithms on various of PPI networks. The experiments show its outperformance over other methods. In addition, ClusterM shows its potential on analyzing four PPI networks and identifying conversed protein complexes that have not been identify before.
Materials and methods
The ClusterM algorithm
Topological separability by conductance
One of the major innovations in ClusterM as well as its core strength is that it explicitly considers topological separability when searching for conserved protein complexes. In this paper, we adopt the definition of conductance to measure the topological separability of a subnetwork. Let G=(V,E) represent a PPI network, where V denotes the set of proteins in G and E is the interaction set. A is the corresponding adjacency matrix of G, where the element A_{ij}=1 denotes the protein i interacts with the protein j and A_{ij}=0 otherwise. The degree matrix D of G is a diagonal matrix with D_{ii}=d_{i}, where d_{i} is the number of interactions connecting to the protein i.
For a subnetwork S as a potential protein complex, the conductance of S in G is defined as
where \(E(S, \bar {S})\) denotes the set of edges between S and the rest of the network \(\bar {S}\), and \(vol(S) = \sum _{i\in S} d_{i}\) is the number of interactions in S. As vol(S) is typically much smaller than the total number of interactions in G: vol(S)≪vol(V), indicating \(vol(S) = \textup {min}\left \{ vol(S), vol(\bar {S}) \right \}\), we have
where V_{S} is the vertex set containing all vertexes in the subnetwork S.
The algorithm
ClusterM builds on the intuition that conserved protein complexes should simultaneously possess the following two properties. First, topologically, conserved protein complexes in each PPI network should be well separated from the rest of the network and proteins within the complexes should be densely connected in order to give rise to a unique and specific biological form and function. Second, across species, there should exist many homologous proteins in the conserved complexes, which can be practically reflected by high overall protein sequence similarity. Given k PPI networks \(\mathbf {G}=\{\mathcal {G}_{1}, \mathcal {G}_{2},... \mathcal {G}_{k}\}\), where \(\mathcal {G}_{j}(\mathcal {V}_{j}, \mathcal {E}_{j})\) is the jth network with \(\mathcal {V}_{j}\) and \(\mathcal {E}_{j}\) denoting the corresponding proteins and interactions respectively, we use a binary adjacency matrix A_{j} to represent \(\mathcal {G}_{j}\) and use a diagonal matrix D_{j} to represent the degree matrix of \(\mathcal {G}_{j}\) with the number of interactions of each protein on its diagonal.
To identify conserved protein complexes with the aforementioned properties, our ClusterM algorithm takes three major steps:
1. Initial seeds with protein spines from multiple network alignment: In the first step, h homologous seeds can be identified as protein spines \(\mathcal {U} = \left \{u^{1}, u^{2},..., u^{h}\right \}\) by using a multiple network alignment method, where \(u^{i} = \left \{ \left (v_{1}^{i}, v_{2}^{i},..., v_{k}^{i}\right) \mid v_{1}^{i}\in \mathcal {V}_{1}, v_{2}^{i}\in \mathcal {V}_{2},..., v_{k}^{i}\in \mathcal {V}_{k}\right \}\) constitutes the ith protein spine detected through network alignment. In this work, we adopt SMETANA [28], which has been shown to be accurate and scalable. When detecting these conserved protein spines, SMETANA [28] takes both protein interaction and sequence information into account. By iterating through all protein spines in \(\mathcal {U}\), ClusterM adopts a divideandconquer strategy to identify potential conserved protein complexes from each protein spine seed across the given networks.
2. Minimumconductance set for topological separability (Task 1): First, for protein \(v_{j}^{i}\) in protein spine u^{i}, a minimumconductance protein set \(\mathcal {\hat {H}}_{j}^{i}\) including \(v_{j}^{i}\), well separated from the rest of the network \(\mathcal {G}_{j}\), is identified based on a novel local optimization algorithm. Initially, we acquire a set \(\mathcal {H}_{j}^{i}\) consisting of m proteins obtained in terms of the ranking of the personalized PageRank vector with respect to \(v_{j}^{i}\) in \(\mathcal {G}_{j}\) [29]. We further refine the results to yield \(\mathcal {\hat {H}}_{j}^{i}\) based on the definition of the conductance (2) by solving the following optimization problem:
where x is a binary vector with x_{i}=1 indicating that protein i is assigned into \(\mathcal {\hat {H}}_{j}^{i}\) and x_{i}=0 otherwise; \(A^{\mathcal {H}_{j}^{i}}\) and \(D^{\mathcal {H}_{j}^{i}}\) are adjacency and degree matrices of the induced subnetwork with respect to the protein set \(\mathcal {H}_{j}^{i}\). The problem can be solved by transforming it into a mixed integer program (MIP) [30]. After algebraic manipulations, (3) can be transformed into the following equivalent MIP formulation:
where \(d^{\mathcal {H}_{j}^{i}}_{i}\) is the ith elements on the diagonal of \(D^{\mathcal {H}_{j}^{i}}\). After using standard linearization techniques [30] to linearize the terms such as zx_{i} and x_{i}x_{j}, the optimization problem can be solved exactly by existing MIP solvers. We emphasize that we here obtain the exact minimumconductance set around the selected seed, which is critical to identify potential protein complexes with high topological separability from the rest of the networks. Because ClusterM searches for the minimumconductance set locally around each involved protein, and since the size of each protein set \(\mathcal {H}_{j}^{i}\) is much smaller than the size of the entire PPI network in such a divideandconquer strategy, we can efficiently obtain the minimumconductance set \(\mathcal {\hat {H}}_{j}^{i}\) in \(\mathcal {H}_{j}^{i}\) based on (4) [30].
Here, \(\mathcal {\hat {H}}_{j}^{i}\) is the subnetwork including protein \(v_{j}^{i}\) that is well separated from the other proteins in the PPI network \(\mathcal {G}_{j}\). The procedure of identifying \(\mathcal {\hat {H}}_{j}^{i}\) explicitly characterizes the external separability of the subnetwork \(\mathcal {\hat {H}}_{j}^{i}\), which is the main advantage of ClusterM over other existing network comparative approaches [13–15, 17, 18] that allows the algorithm to effectively capture the desired topological property of conserved protein complexes.
3. Conserved protein complexes with desired topological and homologous properties (Task 2): The next task is to collect \(\mathbf {H}^{i} = \left \{ \mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i} \right \}\) for the corresponding protein spine u^{i} with \(\mathbf {S}^{i} = \{ s(p,q) \mid p \in \mathcal {\hat {H}}_{m}^{i}, q \in \mathcal {\hat {H}}_{n}^{i}, \ m\ne n, \ \forall p,q,m,n \}\) denoting potential homologous correspondence within H^{i}. The homologous correspondence between proteins p and q can be approximated by the protein sequence similarity s(p,q)∈[0,1] [31]:
where blast(p,q) is the bit score of the sequence similarity obtained by the local sequence alignment tool BLAST [20]. We note that better homologous correspondence by incorporating functional annotations may further improve the performance of conserved protein complex identification. However, we do not explore such directions to avoid biased performance evaluation due to repeated usage of functional annotations in our algorithm and evaluation metrics.
To detect conserved complexes with high interaction density as well as high sequence similarity between proteins across different species, we propose to optimize the following cost function for protein spine u^{i}:
where the binary value \(A^{\mathcal {\hat {H}}_{j}^{i}}(a,b)\) indicates whether there is an interaction between proteins a and b in the subnetwork induced by the protein set \(\mathcal {\hat {H}}_{j}^{i}\), and δ_{a} is an indicator function with δ_{a}=1 if protein a appears in the identified conserved protein complex with respect to u^{i} and δ_{a}=0 otherwise. The first term is essentially the summation of the interaction density [32] of each subnetwork, which characterizes the topological cohesiveness of each subnetwork. The second term characterizes the homology correspondence represented by the summation of the protein sequence similarities across subnetworks divided by the total number of proteins in the identified conserved protein complex. The coefficient λ balances the contributions from network topology and protein homology information. We optimize the cost function (6) in a greedy manner by first recruiting all the proteins in H^{i}, and greedily removing proteins based on their contributions to the cost function until further deletion does not reduce the cost function any more. The greedy algorithm outputs the conserved protein complex \(\mathcal {C}^{i}\) with respect to the protein spine u^{i}.
Figure 5 illustrates how ClusterM handles a protein spine in a pair of networks to search for conserved protein complex locally. Starting from the protein spine in the red dashed line, ClusterM first identifies the proteins in blue based on the personalized PageRank vectors. It further refines the results for separability by MIP (3) and recognize the proteins in the gray shade to be the subnetworks that are well separated from the rest of the corresponding networks. Finally, ClusterM finds the conserved protein complex in golden color based on the cost function (6).
In the postprocessing step of ClusterM, we remove duplicated conserved protein complexes, and delete the ith conserved complex \(\mathcal {C}^{i}\) if the summation of the sequence similarity score (\(\sum _{(p,q) \in \mathcal {C}^{i}} s(p,q)\)) within \(\mathcal {C}^{i}\) is lower than β (set to β=0.1 in this work).
Scalability and parallelism
The major computation in ClusterM involves SMETANA for identifying homologous seeds, approximating the PageRank vector near a protein, solving the MIP (3), and the greedy algorithm to optimize (6). The scalability of SMETANA has been demonstrated in [28]. The computational complexity for approximating the PageRank vector near a protein is proportional to the number of involved local neighbors [29]. There is no scalability issue for solving the MIP (3) because we only consider m local proteins near the protein spine for individual PPI networks. Both the time and space complexities of the greedy algorithm based on (6) are O(k^{2}m^{2}), where k is the number of species. Therefore, due to the divideandconquer strategy, ClusterM has better scalability for handling multiple species.
For each protein spine u^{i}, the identification of the wellseparated subnetworks \(\mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i}\) are independent to each other. Furthermore, the protein spines and the identification of the potential conserved protein complexes are also independent. Therefore, the computation of \(\mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i}\) and the detection of conserved protein complexes can be easily parallelized to further reduce the total computational time for ClusterM.
Datasets
Construction of DIPPPIs and IntActPPIs
We download PPI networks and protein amino acids sequences for human, yeast, fly, and worm from DIP (version 20150101) and IntAct (version 20150120). To obtain a consistent protein symbol for PPI networks and conserved protein complexes, we change all protein names utilized in this study to UniProt protein symbols. The sequence similarity for each protein pair can be obtained using a local sequence alignment tool BLAST and we select the highest BLAST bit score for each protein pair as their sequence similarity score. Then, we normalize the BLAST bit score based on the equation (5) to obtain protein homologous correspondence and threshold the sequence similarity score at 0.1.
The yeasthuman reference conserved protein complexes
We align yeast complexes in CYC2008 [24] and human complexes in CORUM (February 2012) based on GO terms to obtain the gold standard for yeast and human conserved protein complexes. CYC2008 provides a specific GO term associated to every collected yeast complex. For human complex in CORUM, we first download the mapping of human genes and proteins to GO terms (version 20150329) [33] and annotate proteins in CORUM with GO terms that also appear in CYC2008. If a human complex contains at least half the number of proteins annotated to a GO term in CYC2008, we align the human complex and the yeast complex with the same GO term annotation.
Highlevel GO terms
When analyzing biological consistency, we only consider the highlevel GO terms, which suggest specific biological functions. A GO term is defined as highlevel if its information content is larger than two. The definition of the information content of a GO term g is IC=−log(g/root) [34], where “root” is the corresponding root GO term (either biological process, molecular function, or cellular component) of g and the operation · counts the number of proteins annotated to a specified GO term. Additionally, we remove GO terms “inferred from electronic annotation”, “inferred from protein interactions”, and “inferred from sequence or structural similarity”, because we utilize proteinprotein interactions and the protein sequence similarities in our algorithm.
Metrics for conserved protein complex prediction for pairwise PPI networks
The conserved protein complexes from pairwise PPI network alignment contain proteins from two species. Therefore, we need to examine the correspondence between the reference and the identified conserved complexes species by species. For all competing algorithms, we remove the identified conserved protein complexes if they contain fewer than three proteins for any species.
Fraction of the matched reference conserved protein complexes
For two species, given a reference conserved complex A={A_{1},A_{2}} and an identified conserved complex B={B_{1},B_{2}}, where A_{1} and B_{1} are complexes in the first species and A_{2} and B_{2} are complexes in the second species, we consider B matches A if the neighborhood affinity scores for complexes w(A_{1},B_{1})≥0.25 and w(A_{2},B_{2})≥0.25. The neighborhood affinity score for complexes can be calculated by
where X and Y are two protein sets. Therefore, we can compute the fraction of the number of matched reference conserved complexes.
Assuming the reference conserved protein complexes set is R and the identified conserved protein complexes set is P, then the fraction of the number of matched reference conserved complexes can be computed as
where C={S  w(S_{1},T_{1})≥0.25 and w(S_{2},T_{2})≥0.25, ∀S∈R, T∈P}.
Accuracy score
Suppose that we have n reference conserved complex \(A^{j} = \left \{A^{j}_{1}, A^{j}_{2}\right \}, j=1,2,...,n\), and m identified complex \(B^{i} = \left \{ B^{i}_{1}, B^{i}_{2}\right \}, i=1,2,...,m\), and each has complexes \(A^{j}_{1}\) and \(B^{i}_{1}\) from the first species and another complexes \(A^{j}_{2}\) and \(B^{i}_{2}\) in the second species. Let t_{ij} denotes the number of proteins that exist in both the reference complex A^{j} and the identified complex B^{i}, and w_{j} represent the number of proteins in the jth reference complex. That is, \(t_{ij} = \left  A^{j}_{1} \bigcap B^{i}_{1} \right  + \left  A^{j}_{2} \bigcap B^{i}_{2} \right \) and \(w_{j} = \left  A^{j}_{1} \bigcup A^{j}_{2} \right \). Then, complexwise sensitivity (Sn) and positivepredictive value (PPV) can be defined as
The geometric accuracy (Acc) score is the geometric mean of Sn and PPV: \(\textup {Acc}=\sqrt {\textup {Sn} \times \textup {PPV}}\).
The maximum matching ratio for conserved protein complexes (MMRC)
We can quantify the overlap between the reference and the identified conserved complexes by the maximum matching ratio for conserved complexes (MMRC). MMRC is derived based on MMR [6]. The maximum matching ratio [6] is the maximum sum of weights of edges in a bipartite graph, where the two sets of nodes represent reference complexes C and identified complexes S. The bipartite graph is represented by a weighted matrix B_{n×m}, where each weight B_{ij} is the neighborhood affinity score w(c_{i},s_{j}) introduced earlier for the corresponding edge between complexes c_{i} and s_{j}. For efficiency, we only use B_{ij}≥0.25. The MMR is the solution to the following maximal matching problem.
where σ is an indicator function with \(\sigma _{c_{i}, s_{j}} = 1\) when the edge between complexes c_{i} and s_{j} is selected and \(\sigma _{c_{i}, s_{j}} = 0\) otherwise.
MMRC can be obtained by modifying the overlapping weights w(c_{i},s_{j}) between reference and identified conserved complexes. The overlapping weight between reference conserved complex A and identified conserved complex B is defined as
Metrics for GO term consistency and coverage
We measure the functional consistency for the proteins in the conserved complexes by computing the mean normalized entropy (MNE) [7, 10] scores. We use the highlevel GO term set F to annotate each protein in a conserved complex R_{i}. The union of GO terms used for R_{i} is F_{i}={f_{1},f_{2},...,f_{d}}. The normalized entropy (NE) of R_{i} is computed as
where p_{i} is the fraction of R_{i} with respect to the GO term f_{i}. The MNE is the mean value over all NE (R_{i}). For the coverage, we simply count the number of proteins in all conserved complexes, each of which consists of at least three proteins from each network.
Parameter selection for each method
Among the tested algorithms, ClusterM and NetworkBLASTM have only a single tuning parameter, while NetworkBLAST and MaWISh have 5 and 7 tuning parameters, respectively. In order to obtain the best performing parameters for fair comparison of the selected algorithms, we have selected the best performing parameter(s) for each algorithm by grid search. Within at most N combinations of k parameters (p_{1},p_{2},...,p_{k}), we sample \(n=\left \lfloor \sqrt [k]{N} \right \rfloor \) values for each parameter. The sample values are uniformly distributed between [min_{i},max_{i}] for each parameter p_{i}. For all the competing algorithms, we set N=100 and report the results with the best performing parameters.
Implementation details
The MATLAB code and all used data are available at http://www.ece.tamu.edu/~xqian/ClusterM/. ClusterM is computationally efficient for handling multiple genomescale PPI networks from different species. The memory consumption of ClusterM depends on the size of the subnetworks to be identified and the runtime of ClusterM can be significantly reduced by parallelization. It takes about 40 minutes for ClusterM to handle four PPI networks in IntActPPIs with 50,876 proteins in total on a laptop computer (16 GB memory and Intel i7 2.9 GHz cpu) with 2core parallelization.
Availability of data and materials
The MATLAB code and all used data are available at http://www.ece.tamu.edu/~xqian/ClusterM/.
Abbreviations
 PPI:

ProteinProtein Interaction
 Sce :

Saccharomyces cerevisiae
 Dme :

Drosophila melanogaster
 Cel :

Caenorhabditis elegans
 Hsa :

Homo sapiens
 MMRC:

The maximum matching ratio for conserved complexes
 MNE:

The mean normalized entropy
 GO:

Gene Ontology
 NB:

NetworkBLAST
 NBM:

NetworkBLASTM
References
 1
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001; 98(8):4569–74.
 2
WH D, M M, AC G. Affinitypurification coupled to mass spectrometry: basic principles and strategies. Proteomics. 2012; 12(10):1576–90.
 3
Salwinski L, Miller C, Smith A, Pettit F, JU JB, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004; 32:449–51.
 4
Kerrien S, Aranda B, Breuza L, et al.The intact molecular interaction database in 2012. Nucleic Acids Res. 2012; 40(D1):841–6.
 5
Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: A general repository for interaction datasets. Nucleic Acids Res. 2006; 34:535–9.
 6
Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in proteinprotein interaction networks. Nat Methods. 2012; 9(5):471–472.
 7
Liao C, Lu K, Baym M, Singh R, Berger B. IsoRankN: Spectral methods for global alignment of multiple protein networks. Bioinformatics. 2009; 25:253–8.
 8
Aladag AE, Erten C. Spinal: scalable protein interaction network alignment. Bioinformatics. 2013; 29(7):917–24.
 9
Hasan MM, Kahveci T. Indexing a proteinprotein interaction network expedites network alignment. BMC Bioinformatics. 2015;16:326.
 10
Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci USA. 2008; 105(35):12763–8.
 11
Clark C, Kalita J. A multiobjective memetic algorithm for ppi network alignment. Bioinformatics. 2015; 31(12):1988–98.
 12
Jeong H, Yoon BJ. Accurate multiple network alignment through contextsensitive random walk. BMC Syst Biol. 2015; 9 Suppl 1:7.
 13
Sharan R, Suthram S, Kelley RM, et al.Conserved patterns of protein interaction in multiple species. Proc Natl Acac Sci USA. 2005; 102:1974–9.
 14
Koyuturk M, Grama A, Szpankowski W. Pairwise Local Alignment of Protein Interaction Networks Guided by Models of Evolution. In: RECOMB 2005. Berlin, Heidelberg: Springer: 2005. p. 48–65.
 15
Kalaev M, Bafna V, Sharan R. Fast and accurate alignment of multiple protein networks. J Comput Biol. 2009; 16(8):989–99.
 16
Yoon BJ, Qian X, Sahraeian SME. Comparative analysis of biological networks: hidden markov model and markov chainbased approach. IEEE Signal Proc Mag. 2012; 29(1):22–34.
 17
Ciriello G, Mina M, Guzzi PH, Cannataro M, Guerra C. Alignnemo: A local network alignment method to integrate homology and topology. PLoS ONE. 2012; 7(6):38107.
 18
Hirsh E, Sharan R. Identification of conserved protein complexes based on a model of protein network evolution. Bioinformatics. 2007; 23(2):170–6.
 19
Berg J. Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol. 2004; 4(51). https://doi.org/10.1186/14712148451.
 20
Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning dna sequences. J Comput Biol. 2000; 7(12):203–14.
 21
Mina M, Guzzi PH. Alignmcl: Comparative analysis of protein interaction networks through markov clustering. In: Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference On. IEEE: 2012. p. 174–81. https://doi.org/10.1109/BIBMW.2012.6470300.
 22
Sharan R, Suthram S, Kelley R, Kuhn T, McCuine S, Uetz P, Sittler T, Karp R, Ideker T. Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA. 2005; 102(6):1974–9.
 23
Kalaev M, Bafna V, Sharan R. Fast and accurate alignment of multiple protein networks. In: Proc of the 10th Annu Int Conf Res Comput Mol Bio (RECOMB 2008): 2008. https://doi.org/10.1007/9783540788393_21.
 24
Pu S, Wong J, Turner B, et al.Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37(3):825–31.
 25
Ruepp A, Brauner B, DungerKaltenbach I, et al.Corum: The comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008; 36:646–50.
 26
Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science. 2002; 296:910–3.
 27
Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, HuertaCepas J, Simonovic M, Roth A, Santos A, Tsafou KP, et al.String v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2014; 43(D1):447–52.
 28
Sahraeian SME, Yoon BJ. Smetana: Accurate and scalable algorithm for probabilistic alignment of largescale biological networks. PLoS ONE. 2013; 8(7):67995.
 29
Andersen R, Chung F, Lang K. Local Graph Partitioning Using PageRank Vectors. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06): 2006. p. 475–86. https://doi.org/10.1109/FOCS.2006.44.
 30
Fan N, Pardalos P. Multiway clustering and biclustering by the ratio cut and normalized cut in graphs. J Comb Optim. 2010; 23(2):224–51.
 31
Wang Y, Qian X. Joint clustering of protein interaction networks through markov random walk. BMC Syst Biol. 2014; 8(suppl 1):9.
 32
Corneil DG, Perl Y. Clustering and domination in perfect graphs. Discret Appl Math. 1984; 9(1):27–39.
 33
Ashburner M, Ball C, Blake J, et al.Gene ontology: Tool for the unification of biology. the gene ontology consortium. Nat Genet. 2000; 25(1):25–9.
 34
Shih YK, Parthasarathy S. Identifying functional modules in interaction networks through overlapping markov clustering. Bioinformatics. 2012; 28(18):1473–9.
Funding
This work was supported by the National Science Foundation (NSF) Grants 1447235, 1553281, and 1812641, as well as the United States Department of Agriculture National Institute of Food and Agriculture competitive grant USDANIFASCRI20175118126834 through the National Center of Excellence for Melon at the Vegetable and Fruit Improvement Center of Texas A&M University. This work was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF2019R1G1A1004803). Publication costs are funded by NSF Grant 1812641.
Author information
Affiliations
Contributions
Conceived the method: YW, HJ, XQ, BJY. Developed the algorithm and performed the simulations: YW, HJ. Analyzed the results and wrote the paper: WY, HJ, XQ, BJY. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
All the authors approve the publication of the presented work.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, Y., Jeong, H., Yoon, BJ. et al. ClusterM: a scalable algorithm for computational prediction of conserved protein complexes across multiple protein interaction networks. BMC Genomics 21, 615 (2020). https://doi.org/10.1186/s12864020070101
Published:
Keywords
 Comparative network analysis
 Multiple network alignment and clustering
 Conserved module identification