The ClusterM algorithm
Topological separability by conductance
One of the major innovations in ClusterM as well as its core strength is that it explicitly considers topological separability when searching for conserved protein complexes. In this paper, we adopt the definition of conductance to measure the topological separability of a subnetwork. Let G=(V,E) represent a PPI network, where V denotes the set of proteins in G and E is the interaction set. A is the corresponding adjacency matrix of G, where the element Aij=1 denotes the protein i interacts with the protein j and Aij=0 otherwise. The degree matrix D of G is a diagonal matrix with Dii=di, where di is the number of interactions connecting to the protein i.
For a subnetwork S as a potential protein complex, the conductance of S in G is defined as
$$ \phi(S) = \frac{|E(S, \bar{S})|}{\textup{min}\left \{ vol(S), vol(\bar{S}) \right \} }, S \cup \bar{S} = V, $$
(1)
where \(E(S, \bar {S})\) denotes the set of edges between S and the rest of the network \(\bar {S}\), and \(vol(S) = \sum _{i\in S} d_{i}\) is the number of interactions in S. As vol(S) is typically much smaller than the total number of interactions in G: vol(S)≪vol(V), indicating \(vol(S) = \textup {min}\left \{ vol(S), vol(\bar {S}) \right \}\), we have
$$ \phi(S) = \frac{|E(S, \bar{S})|}{vol(S)}=\frac{\sum_{i,j\in V_{S}} D_{ij} - A_{ij}}{\sum_{i\in V_{S}} D_{ii}}, $$
(2)
where VS is the vertex set containing all vertexes in the subnetwork S.
The algorithm
ClusterM builds on the intuition that conserved protein complexes should simultaneously possess the following two properties. First, topologically, conserved protein complexes in each PPI network should be well separated from the rest of the network and proteins within the complexes should be densely connected in order to give rise to a unique and specific biological form and function. Second, across species, there should exist many homologous proteins in the conserved complexes, which can be practically reflected by high overall protein sequence similarity. Given k PPI networks \(\mathbf {G}=\{\mathcal {G}_{1}, \mathcal {G}_{2},... \mathcal {G}_{k}\}\), where \(\mathcal {G}_{j}(\mathcal {V}_{j}, \mathcal {E}_{j})\) is the jth network with \(\mathcal {V}_{j}\) and \(\mathcal {E}_{j}\) denoting the corresponding proteins and interactions respectively, we use a binary adjacency matrix Aj to represent \(\mathcal {G}_{j}\) and use a diagonal matrix Dj to represent the degree matrix of \(\mathcal {G}_{j}\) with the number of interactions of each protein on its diagonal.
To identify conserved protein complexes with the aforementioned properties, our ClusterM algorithm takes three major steps:
1. Initial seeds with protein spines from multiple network alignment: In the first step, h homologous seeds can be identified as protein spines \(\mathcal {U} = \left \{u^{1}, u^{2},..., u^{h}\right \}\) by using a multiple network alignment method, where \(u^{i} = \left \{ \left (v_{1}^{i}, v_{2}^{i},..., v_{k}^{i}\right) \mid v_{1}^{i}\in \mathcal {V}_{1}, v_{2}^{i}\in \mathcal {V}_{2},..., v_{k}^{i}\in \mathcal {V}_{k}\right \}\) constitutes the ith protein spine detected through network alignment. In this work, we adopt SMETANA [28], which has been shown to be accurate and scalable. When detecting these conserved protein spines, SMETANA [28] takes both protein interaction and sequence information into account. By iterating through all protein spines in \(\mathcal {U}\), ClusterM adopts a divide-and-conquer strategy to identify potential conserved protein complexes from each protein spine seed across the given networks.
2. Minimum-conductance set for topological separability (Task 1): First, for protein \(v_{j}^{i}\) in protein spine ui, a minimum-conductance protein set \(\mathcal {\hat {H}}_{j}^{i}\) including \(v_{j}^{i}\), well separated from the rest of the network \(\mathcal {G}_{j}\), is identified based on a novel local optimization algorithm. Initially, we acquire a set \(\mathcal {H}_{j}^{i}\) consisting of m proteins obtained in terms of the ranking of the personalized PageRank vector with respect to \(v_{j}^{i}\) in \(\mathcal {G}_{j}\) [29]. We further refine the results to yield \(\mathcal {\hat {H}}_{j}^{i}\) based on the definition of the conductance (2) by solving the following optimization problem:
$$ \begin{aligned} \textup{min:} & \ \frac{\mathbf{x}^{T}\left(D^{\mathcal{H}_{j}^{i}}-A^{\mathcal{H}_{j}^{i}}\right)\mathbf{x}}{\mathbf{x}^{T}D^{\mathcal{H}_{j}^{i}}\mathbf{x}}\\ s.t. & \ \mathbf{x}_{v_{j}^{i}} = 1, \ \mathbf{x}_{i} \in \{0,1\}, \end{aligned} $$
(3)
where x is a binary vector with xi=1 indicating that protein i is assigned into \(\mathcal {\hat {H}}_{j}^{i}\) and xi=0 otherwise; \(A^{\mathcal {H}_{j}^{i}}\) and \(D^{\mathcal {H}_{j}^{i}}\) are adjacency and degree matrices of the induced subnetwork with respect to the protein set \(\mathcal {H}_{j}^{i}\). The problem can be solved by transforming it into a mixed integer program (MIP) [30]. After algebraic manipulations, (3) can be transformed into the following equivalent MIP formulation:
$$ \begin{aligned} \textup{min:}& \ \ z\\ s.t. & \ \ z\sum_{i} \mathbf{x}_{i}d^{\mathcal{H}_{j}^{i}}_{i} - \sum_{i} \sum_{j} (D^{\mathcal{H}_{j}^{i}} - A^{\mathcal{H}_{j}^{i}})\mathbf{x}_{i}\mathbf{x}_{j} \geq 0, \\ & \ \ \mathbf{x}_{v_{j}^{i}} = 1, \mathbf{x}_{i} \in \{0,1\}, \end{aligned} $$
(4)
where \(d^{\mathcal {H}_{j}^{i}}_{i}\) is the ith elements on the diagonal of \(D^{\mathcal {H}_{j}^{i}}\). After using standard linearization techniques [30] to linearize the terms such as zxi and xixj, the optimization problem can be solved exactly by existing MIP solvers. We emphasize that we here obtain the exact minimum-conductance set around the selected seed, which is critical to identify potential protein complexes with high topological separability from the rest of the networks. Because ClusterM searches for the minimum-conductance set locally around each involved protein, and since the size of each protein set \(\mathcal {H}_{j}^{i}\) is much smaller than the size of the entire PPI network in such a divide-and-conquer strategy, we can efficiently obtain the minimum-conductance set \(\mathcal {\hat {H}}_{j}^{i}\) in \(\mathcal {H}_{j}^{i}\) based on (4) [30].
Here, \(\mathcal {\hat {H}}_{j}^{i}\) is the subnetwork including protein \(v_{j}^{i}\) that is well separated from the other proteins in the PPI network \(\mathcal {G}_{j}\). The procedure of identifying \(\mathcal {\hat {H}}_{j}^{i}\) explicitly characterizes the external separability of the subnetwork \(\mathcal {\hat {H}}_{j}^{i}\), which is the main advantage of ClusterM over other existing network comparative approaches [13–15, 17, 18] that allows the algorithm to effectively capture the desired topological property of conserved protein complexes.
3. Conserved protein complexes with desired topological and homologous properties (Task 2): The next task is to collect \(\mathbf {H}^{i} = \left \{ \mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i} \right \}\) for the corresponding protein spine ui with \(\mathbf {S}^{i} = \{ s(p,q) \mid p \in \mathcal {\hat {H}}_{m}^{i}, q \in \mathcal {\hat {H}}_{n}^{i}, \ m\ne n, \ \forall p,q,m,n \}\) denoting potential homologous correspondence within Hi. The homologous correspondence between proteins p and q can be approximated by the protein sequence similarity s(p,q)∈[0,1] [31]:
$$ s(p, q) = \frac{blast(p, q)}{\sqrt{blast(p, p) \times blast(q, q)}}, $$
(5)
where blast(p,q) is the bit score of the sequence similarity obtained by the local sequence alignment tool BLAST [20]. We note that better homologous correspondence by incorporating functional annotations may further improve the performance of conserved protein complex identification. However, we do not explore such directions to avoid biased performance evaluation due to repeated usage of functional annotations in our algorithm and evaluation metrics.
To detect conserved complexes with high interaction density as well as high sequence similarity between proteins across different species, we propose to optimize the following cost function for protein spine ui:
$$ {}\begin{aligned} \mathcal{F}^{i}= -\left (\sum_{j=1}^{k}\frac{\sum\limits_{a,b\in \mathcal{\hat{H}}_{j}^{i}}{A^{\mathcal{\hat{H}}_{j}^{i}}}(a,b)\delta_{a} \delta_{b} }{\sum\limits_{c\in \mathcal{\hat{H}}_{j}^{i}} \delta_{c}}+\lambda \frac{\sum\limits_{j\ne l}\sum\limits_{a\in \mathcal{\hat{H}}_{j}^{i}}\sum\limits_{b\in\mathcal{\hat{H}}_{l}^{i}}s(a,b)\delta_{a} \delta_{b} }{\sum\limits_{j=1}^{k}\sum\limits_{c\in \mathcal{\hat{H}}_{j}^{i}} \delta_{c}} \right), \end{aligned} $$
(6)
where the binary value \(A^{\mathcal {\hat {H}}_{j}^{i}}(a,b)\) indicates whether there is an interaction between proteins a and b in the subnetwork induced by the protein set \(\mathcal {\hat {H}}_{j}^{i}\), and δa is an indicator function with δa=1 if protein a appears in the identified conserved protein complex with respect to ui and δa=0 otherwise. The first term is essentially the summation of the interaction density [32] of each subnetwork, which characterizes the topological cohesiveness of each subnetwork. The second term characterizes the homology correspondence represented by the summation of the protein sequence similarities across subnetworks divided by the total number of proteins in the identified conserved protein complex. The coefficient λ balances the contributions from network topology and protein homology information. We optimize the cost function (6) in a greedy manner by first recruiting all the proteins in Hi, and greedily removing proteins based on their contributions to the cost function until further deletion does not reduce the cost function any more. The greedy algorithm outputs the conserved protein complex \(\mathcal {C}^{i}\) with respect to the protein spine ui.
Figure 5 illustrates how ClusterM handles a protein spine in a pair of networks to search for conserved protein complex locally. Starting from the protein spine in the red dashed line, ClusterM first identifies the proteins in blue based on the personalized PageRank vectors. It further refines the results for separability by MIP (3) and recognize the proteins in the gray shade to be the subnetworks that are well separated from the rest of the corresponding networks. Finally, ClusterM finds the conserved protein complex in golden color based on the cost function (6).
In the post-processing step of ClusterM, we remove duplicated conserved protein complexes, and delete the ith conserved complex \(\mathcal {C}^{i}\) if the summation of the sequence similarity score (\(\sum _{(p,q) \in \mathcal {C}^{i}} s(p,q)\)) within \(\mathcal {C}^{i}\) is lower than β (set to β=0.1 in this work).
Scalability and parallelism
The major computation in ClusterM involves SMETANA for identifying homologous seeds, approximating the PageRank vector near a protein, solving the MIP (3), and the greedy algorithm to optimize (6). The scalability of SMETANA has been demonstrated in [28]. The computational complexity for approximating the PageRank vector near a protein is proportional to the number of involved local neighbors [29]. There is no scalability issue for solving the MIP (3) because we only consider m local proteins near the protein spine for individual PPI networks. Both the time and space complexities of the greedy algorithm based on (6) are O(k2m2), where k is the number of species. Therefore, due to the divide-and-conquer strategy, ClusterM has better scalability for handling multiple species.
For each protein spine ui, the identification of the well-separated subnetworks \(\mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i}\) are independent to each other. Furthermore, the protein spines and the identification of the potential conserved protein complexes are also independent. Therefore, the computation of \(\mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i}\) and the detection of conserved protein complexes can be easily parallelized to further reduce the total computational time for ClusterM.
Datasets
Construction of DIPPPIs and IntActPPIs
We download PPI networks and protein amino acids sequences for human, yeast, fly, and worm from DIP (version 20150101) and IntAct (version 20150120). To obtain a consistent protein symbol for PPI networks and conserved protein complexes, we change all protein names utilized in this study to UniProt protein symbols. The sequence similarity for each protein pair can be obtained using a local sequence alignment tool BLAST and we select the highest BLAST bit score for each protein pair as their sequence similarity score. Then, we normalize the BLAST bit score based on the equation (5) to obtain protein homologous correspondence and threshold the sequence similarity score at 0.1.
The yeast-human reference conserved protein complexes
We align yeast complexes in CYC2008 [24] and human complexes in CORUM (February 2012) based on GO terms to obtain the gold standard for yeast and human conserved protein complexes. CYC2008 provides a specific GO term associated to every collected yeast complex. For human complex in CORUM, we first download the mapping of human genes and proteins to GO terms (version 20150329) [33] and annotate proteins in CORUM with GO terms that also appear in CYC2008. If a human complex contains at least half the number of proteins annotated to a GO term in CYC2008, we align the human complex and the yeast complex with the same GO term annotation.
High-level GO terms
When analyzing biological consistency, we only consider the high-level GO terms, which suggest specific biological functions. A GO term is defined as high-level if its information content is larger than two. The definition of the information content of a GO term g is IC=−log(|g|/|root|) [34], where “root” is the corresponding root GO term (either biological process, molecular function, or cellular component) of g and the operation |·| counts the number of proteins annotated to a specified GO term. Additionally, we remove GO terms “inferred from electronic annotation”, “inferred from protein interactions”, and “inferred from sequence or structural similarity”, because we utilize protein-protein interactions and the protein sequence similarities in our algorithm.
Metrics for conserved protein complex prediction for pairwise PPI networks
The conserved protein complexes from pairwise PPI network alignment contain proteins from two species. Therefore, we need to examine the correspondence between the reference and the identified conserved complexes species by species. For all competing algorithms, we remove the identified conserved protein complexes if they contain fewer than three proteins for any species.
Fraction of the matched reference conserved protein complexes
For two species, given a reference conserved complex A={A1,A2} and an identified conserved complex B={B1,B2}, where A1 and B1 are complexes in the first species and A2 and B2 are complexes in the second species, we consider B matches A if the neighborhood affinity scores for complexes w(A1,B1)≥0.25 and w(A2,B2)≥0.25. The neighborhood affinity score for complexes can be calculated by
$$ w(X,Y) = \frac{\left | X \cap Y\right |}{\left | X \right |} \cdot \frac{\left | X \cap Y\right |}{\left | Y \right |} = \frac{\left | X \cap Y\right |^{2}}{\left | X \right |\left | Y \right |}, $$
(7)
where X and Y are two protein sets. Therefore, we can compute the fraction of the number of matched reference conserved complexes.
Assuming the reference conserved protein complexes set is R and the identified conserved protein complexes set is P, then the fraction of the number of matched reference conserved complexes can be computed as
$$ \textup{frac} = \frac{|C|}{|R|}, $$
(8)
where C={S | w(S1,T1)≥0.25 and w(S2,T2)≥0.25, ∀S∈R, T∈P}.
Accuracy score
Suppose that we have n reference conserved complex \(A^{j} = \left \{A^{j}_{1}, A^{j}_{2}\right \}, j=1,2,...,n\), and m identified complex \(B^{i} = \left \{ B^{i}_{1}, B^{i}_{2}\right \}, i=1,2,...,m\), and each has complexes \(A^{j}_{1}\) and \(B^{i}_{1}\) from the first species and another complexes \(A^{j}_{2}\) and \(B^{i}_{2}\) in the second species. Let tij denotes the number of proteins that exist in both the reference complex Aj and the identified complex Bi, and wj represent the number of proteins in the jth reference complex. That is, \(t_{ij} = \left | A^{j}_{1} \bigcap B^{i}_{1} \right | + \left | A^{j}_{2} \bigcap B^{i}_{2} \right |\) and \(w_{j} = \left | A^{j}_{1} \bigcup A^{j}_{2} \right |\). Then, complex-wise sensitivity (Sn) and positive-predictive value (PPV) can be defined as
$$ \textup{Sn} = \frac{\sum_{j=1}^{n} \underset{i=1,..., m}{max}t_{ij}}{\sum_{j=1}^{n} w_{j}}; \ \ \ \ \textup{PPV} = \frac{\sum_{i=1}^{m} \underset{j=1,..., n}{max}t_{ij}}{\sum_{i=1}^{m}\sum_{j=1}^{n}t_{ij}}. $$
(9)
The geometric accuracy (Acc) score is the geometric mean of Sn and PPV: \(\textup {Acc}=\sqrt {\textup {Sn} \times \textup {PPV}}\).
The maximum matching ratio for conserved protein complexes (MMRC)
We can quantify the overlap between the reference and the identified conserved complexes by the maximum matching ratio for conserved complexes (MMRC). MMRC is derived based on MMR [6]. The maximum matching ratio [6] is the maximum sum of weights of edges in a bipartite graph, where the two sets of nodes represent reference complexes C and identified complexes S. The bipartite graph is represented by a weighted matrix Bn×m, where each weight Bij is the neighborhood affinity score w(ci,sj) introduced earlier for the corresponding edge between complexes ci and sj. For efficiency, we only use Bij≥0.25. The MMR is the solution to the following maximal matching problem.
$$ \begin{aligned} \textup{max:} &\ \ \frac{1}{|C|}\sum_{i=1}^{n}\sum_{j=1}^{m} B_{ij} \sigma_{c_{i}, s_{j}}\\ s.t. & \ \ \sum_{j=1}^{m} \sigma_{c_{i}, s_{j}} \leq 1 \\ & \ \ \sum_{i=1}^{n} \sigma_{c_{i}, s_{j}} \leq 1, \end{aligned} $$
(10)
where σ is an indicator function with \(\sigma _{c_{i}, s_{j}} = 1\) when the edge between complexes ci and sj is selected and \(\sigma _{c_{i}, s_{j}} = 0\) otherwise.
MMRC can be obtained by modifying the overlapping weights w(ci,sj) between reference and identified conserved complexes. The overlapping weight between reference conserved complex A and identified conserved complex B is defined as
$$ o(A,B) = \frac{2w(A_{1},B_{1})\times w(A_{2},B_{2})}{w(A_{1},B_{1})+w(A_{2},B_{2})}. $$
(11)
Metrics for GO term consistency and coverage
We measure the functional consistency for the proteins in the conserved complexes by computing the mean normalized entropy (MNE) [7, 10] scores. We use the high-level GO term set F to annotate each protein in a conserved complex Ri. The union of GO terms used for Ri is Fi={f1,f2,...,fd}. The normalized entropy (NE) of Ri is computed as
$$ \textup{NE}(R_{i}) = \textup{NE}(p_{1}, p_{2},..., p_{d}) = -\frac{1}{\textup{log}(d)} \sum_{j} p_{j} \cdot \textup{log}\left(p_{j} \right), $$
(12)
where pi is the fraction of Ri with respect to the GO term fi. The MNE is the mean value over all NE (Ri). For the coverage, we simply count the number of proteins in all conserved complexes, each of which consists of at least three proteins from each network.
Parameter selection for each method
Among the tested algorithms, ClusterM and NetworkBLAST-M have only a single tuning parameter, while NetworkBLAST and MaWISh have 5 and 7 tuning parameters, respectively. In order to obtain the best performing parameters for fair comparison of the selected algorithms, we have selected the best performing parameter(s) for each algorithm by grid search. Within at most N combinations of k parameters (p1,p2,...,pk), we sample \(n=\left \lfloor \sqrt [k]{N} \right \rfloor \) values for each parameter. The sample values are uniformly distributed between [mini,maxi] for each parameter pi. For all the competing algorithms, we set N=100 and report the results with the best performing parameters.
Implementation details
The MATLAB code and all used data are available at http://www.ece.tamu.edu/~xqian/ClusterM/. ClusterM is computationally efficient for handling multiple genome-scale PPI networks from different species. The memory consumption of ClusterM depends on the size of the subnetworks to be identified and the runtime of ClusterM can be significantly reduced by parallelization. It takes about 40 minutes for ClusterM to handle four PPI networks in IntActPPIs with 50,876 proteins in total on a laptop computer (16 GB memory and Intel i7 2.9 GHz cpu) with 2-core parallelization.