### The ClusterM algorithm

#### Topological separability by conductance

One of the major innovations in ClusterM as well as its core strength is that it explicitly considers topological separability when searching for conserved protein complexes. In this paper, we adopt the definition of conductance to measure the topological separability of a subnetwork. Let *G*=(*V*,*E*) represent a PPI network, where *V* denotes the set of proteins in *G* and *E* is the interaction set. *A* is the corresponding adjacency matrix of *G*, where the element *A*_{ij}=1 denotes the protein *i* interacts with the protein *j* and *A*_{ij}=0 otherwise. The degree matrix *D* of *G* is a diagonal matrix with *D*_{ii}=*d*_{i}, where *d*_{i} is the number of interactions connecting to the protein *i*.

For a subnetwork *S* as a potential protein complex, the conductance of *S* in *G* is defined as

$$ \phi(S) = \frac{|E(S, \bar{S})|}{\textup{min}\left \{ vol(S), vol(\bar{S}) \right \} }, S \cup \bar{S} = V, $$

(1)

where \(E(S, \bar {S})\) denotes the set of edges between *S* and the rest of the network \(\bar {S}\), and \(vol(S) = \sum _{i\in S} d_{i}\) is the number of interactions in *S*. As *v**o**l*(*S*) is typically much smaller than the total number of interactions in *G*: *v**o**l*(*S*)≪*v**o**l*(*V*), indicating \(vol(S) = \textup {min}\left \{ vol(S), vol(\bar {S}) \right \}\), we have

$$ \phi(S) = \frac{|E(S, \bar{S})|}{vol(S)}=\frac{\sum_{i,j\in V_{S}} D_{ij} - A_{ij}}{\sum_{i\in V_{S}} D_{ii}}, $$

(2)

where *V*_{S} is the vertex set containing all vertexes in the subnetwork *S*.

#### The algorithm

ClusterM builds on the intuition that conserved protein complexes should simultaneously possess the following two properties. First, topologically, conserved protein complexes in each PPI network should be well separated from the rest of the network and proteins within the complexes should be densely connected in order to give rise to a unique and specific biological form and function. Second, across species, there should exist many homologous proteins in the conserved complexes, which can be practically reflected by high overall protein sequence similarity. Given *k* PPI networks \(\mathbf {G}=\{\mathcal {G}_{1}, \mathcal {G}_{2},... \mathcal {G}_{k}\}\), where \(\mathcal {G}_{j}(\mathcal {V}_{j}, \mathcal {E}_{j})\) is the *j*th network with \(\mathcal {V}_{j}\) and \(\mathcal {E}_{j}\) denoting the corresponding proteins and interactions respectively, we use a binary adjacency matrix *A*_{j} to represent \(\mathcal {G}_{j}\) and use a diagonal matrix *D*_{j} to represent the degree matrix of \(\mathcal {G}_{j}\) with the number of interactions of each protein on its diagonal.

To identify conserved protein complexes with the aforementioned properties, our ClusterM algorithm takes three major steps:

**1. Initial seeds with protein spines from multiple network alignment:** In the first step, *h* homologous seeds can be identified as protein spines \(\mathcal {U} = \left \{u^{1}, u^{2},..., u^{h}\right \}\) by using a multiple network alignment method, where \(u^{i} = \left \{ \left (v_{1}^{i}, v_{2}^{i},..., v_{k}^{i}\right) \mid v_{1}^{i}\in \mathcal {V}_{1}, v_{2}^{i}\in \mathcal {V}_{2},..., v_{k}^{i}\in \mathcal {V}_{k}\right \}\) constitutes the *i*th protein spine detected through network alignment. In this work, we adopt SMETANA [28], which has been shown to be accurate and scalable. When detecting these conserved protein spines, SMETANA [28] takes both protein interaction and sequence information into account. By iterating through all protein spines in \(\mathcal {U}\), ClusterM adopts a divide-and-conquer strategy to identify potential conserved protein complexes from each protein spine seed across the given networks.

**2. Minimum-conductance set for topological separability (Task 1):** First, for protein \(v_{j}^{i}\) in protein spine *u*^{i}, a minimum-conductance protein set \(\mathcal {\hat {H}}_{j}^{i}\) including \(v_{j}^{i}\), well separated from the rest of the network \(\mathcal {G}_{j}\), is identified based on a novel local optimization algorithm. Initially, we acquire a set \(\mathcal {H}_{j}^{i}\) consisting of *m* proteins obtained in terms of the ranking of the personalized PageRank vector with respect to \(v_{j}^{i}\) in \(\mathcal {G}_{j}\) [29]. We further refine the results to yield \(\mathcal {\hat {H}}_{j}^{i}\) based on the definition of the conductance (2) by solving the following optimization problem:

$$ \begin{aligned} \textup{min:} & \ \frac{\mathbf{x}^{T}\left(D^{\mathcal{H}_{j}^{i}}-A^{\mathcal{H}_{j}^{i}}\right)\mathbf{x}}{\mathbf{x}^{T}D^{\mathcal{H}_{j}^{i}}\mathbf{x}}\\ s.t. & \ \mathbf{x}_{v_{j}^{i}} = 1, \ \mathbf{x}_{i} \in \{0,1\}, \end{aligned} $$

(3)

where **x** is a binary vector with **x**_{i}=1 indicating that protein *i* is assigned into \(\mathcal {\hat {H}}_{j}^{i}\) and **x**_{i}=0 otherwise; \(A^{\mathcal {H}_{j}^{i}}\) and \(D^{\mathcal {H}_{j}^{i}}\) are adjacency and degree matrices of the induced subnetwork with respect to the protein set \(\mathcal {H}_{j}^{i}\). The problem can be solved by transforming it into a mixed integer program (MIP) [30]. After algebraic manipulations, (3) can be transformed into the following equivalent MIP formulation:

$$ \begin{aligned} \textup{min:}& \ \ z\\ s.t. & \ \ z\sum_{i} \mathbf{x}_{i}d^{\mathcal{H}_{j}^{i}}_{i} - \sum_{i} \sum_{j} (D^{\mathcal{H}_{j}^{i}} - A^{\mathcal{H}_{j}^{i}})\mathbf{x}_{i}\mathbf{x}_{j} \geq 0, \\ & \ \ \mathbf{x}_{v_{j}^{i}} = 1, \mathbf{x}_{i} \in \{0,1\}, \end{aligned} $$

(4)

where \(d^{\mathcal {H}_{j}^{i}}_{i}\) is the *i*th elements on the diagonal of \(D^{\mathcal {H}_{j}^{i}}\). After using standard linearization techniques [30] to linearize the terms such as *z***x**_{i} and **x**_{i}**x**_{j}, the optimization problem can be solved exactly by existing MIP solvers. We emphasize that we here obtain the exact minimum-conductance set around the selected seed, which is critical to identify potential protein complexes with high topological separability from the rest of the networks. Because ClusterM searches for the minimum-conductance set locally around each involved protein, and since the size of each protein set \(\mathcal {H}_{j}^{i}\) is much smaller than the size of the entire PPI network in such a divide-and-conquer strategy, we can efficiently obtain the minimum-conductance set \(\mathcal {\hat {H}}_{j}^{i}\) in \(\mathcal {H}_{j}^{i}\) based on (4) [30].

Here, \(\mathcal {\hat {H}}_{j}^{i}\) is the subnetwork including protein \(v_{j}^{i}\) that is well separated from the other proteins in the PPI network \(\mathcal {G}_{j}\). The procedure of identifying \(\mathcal {\hat {H}}_{j}^{i}\) explicitly characterizes the external separability of the subnetwork \(\mathcal {\hat {H}}_{j}^{i}\), which is the main advantage of ClusterM over other existing network comparative approaches [13–15, 17, 18] that allows the algorithm to effectively capture the desired topological property of conserved protein complexes.

**3. Conserved protein complexes with desired topological and homologous properties (Task 2):** The next task is to collect \(\mathbf {H}^{i} = \left \{ \mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i} \right \}\) for the corresponding protein spine *u*^{i} with \(\mathbf {S}^{i} = \{ s(p,q) \mid p \in \mathcal {\hat {H}}_{m}^{i}, q \in \mathcal {\hat {H}}_{n}^{i}, \ m\ne n, \ \forall p,q,m,n \}\) denoting potential homologous correspondence within **H**^{i}. The homologous correspondence between proteins *p* and *q* can be approximated by the protein sequence similarity *s*(*p*,*q*)∈[0,1] [31]:

$$ s(p, q) = \frac{blast(p, q)}{\sqrt{blast(p, p) \times blast(q, q)}}, $$

(5)

where *b**l**a**s**t*(*p*,*q*) is the bit score of the sequence similarity obtained by the local sequence alignment tool BLAST [20]. We note that better homologous correspondence by incorporating functional annotations may further improve the performance of conserved protein complex identification. However, we do not explore such directions to avoid biased performance evaluation due to repeated usage of functional annotations in our algorithm and evaluation metrics.

To detect conserved complexes with high interaction density as well as high sequence similarity between proteins across different species, we propose to optimize the following cost function for protein spine *u*^{i}:

$$ {}\begin{aligned} \mathcal{F}^{i}= -\left (\sum_{j=1}^{k}\frac{\sum\limits_{a,b\in \mathcal{\hat{H}}_{j}^{i}}{A^{\mathcal{\hat{H}}_{j}^{i}}}(a,b)\delta_{a} \delta_{b} }{\sum\limits_{c\in \mathcal{\hat{H}}_{j}^{i}} \delta_{c}}+\lambda \frac{\sum\limits_{j\ne l}\sum\limits_{a\in \mathcal{\hat{H}}_{j}^{i}}\sum\limits_{b\in\mathcal{\hat{H}}_{l}^{i}}s(a,b)\delta_{a} \delta_{b} }{\sum\limits_{j=1}^{k}\sum\limits_{c\in \mathcal{\hat{H}}_{j}^{i}} \delta_{c}} \right), \end{aligned} $$

(6)

where the binary value \(A^{\mathcal {\hat {H}}_{j}^{i}}(a,b)\) indicates whether there is an interaction between proteins *a* and *b* in the subnetwork induced by the protein set \(\mathcal {\hat {H}}_{j}^{i}\), and *δ*_{a} is an indicator function with *δ*_{a}=1 if protein *a* appears in the identified conserved protein complex with respect to *u*^{i} and *δ*_{a}=0 otherwise. The first term is essentially the summation of the interaction density [32] of each subnetwork, which characterizes the topological cohesiveness of each subnetwork. The second term characterizes the homology correspondence represented by the summation of the protein sequence similarities across subnetworks divided by the total number of proteins in the identified conserved protein complex. The coefficient *λ* balances the contributions from network topology and protein homology information. We optimize the cost function (6) in a greedy manner by first recruiting all the proteins in **H**^{i}, and greedily removing proteins based on their contributions to the cost function until further deletion does not reduce the cost function any more. The greedy algorithm outputs the conserved protein complex \(\mathcal {C}^{i}\) with respect to the protein spine *u*^{i}.

Figure 5 illustrates how ClusterM handles a protein spine in a pair of networks to search for conserved protein complex locally. Starting from the protein spine in the red dashed line, ClusterM first identifies the proteins in blue based on the personalized PageRank vectors. It further refines the results for separability by MIP (3) and recognize the proteins in the gray shade to be the subnetworks that are well separated from the rest of the corresponding networks. Finally, ClusterM finds the conserved protein complex in golden color based on the cost function (6).

In the post-processing step of ClusterM, we remove duplicated conserved protein complexes, and delete the *i*th conserved complex \(\mathcal {C}^{i}\) if the summation of the sequence similarity score (\(\sum _{(p,q) \in \mathcal {C}^{i}} s(p,q)\)) within \(\mathcal {C}^{i}\) is lower than *β* (set to *β*=0.1 in this work).

#### Scalability and parallelism

The major computation in ClusterM involves SMETANA for identifying homologous seeds, approximating the PageRank vector near a protein, solving the MIP (3), and the greedy algorithm to optimize (6). The scalability of SMETANA has been demonstrated in [28]. The computational complexity for approximating the PageRank vector near a protein is proportional to the number of involved local neighbors [29]. There is no scalability issue for solving the MIP (3) because we only consider *m* local proteins near the protein spine for individual PPI networks. Both the time and space complexities of the greedy algorithm based on (6) are *O*(*k*^{2}*m*^{2}), where *k* is the number of species. Therefore, due to the divide-and-conquer strategy, ClusterM has better scalability for handling multiple species.

For each protein spine *u*^{i}, the identification of the well-separated subnetworks \(\mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i}\) are independent to each other. Furthermore, the protein spines and the identification of the potential conserved protein complexes are also independent. Therefore, the computation of \(\mathcal {\hat {H}}_{1}^{i},..., \mathcal {\hat {H}}_{k}^{i}\) and the detection of conserved protein complexes can be easily parallelized to further reduce the total computational time for ClusterM.

### Datasets

#### Construction of DIPPPIs and IntActPPIs

We download PPI networks and protein amino acids sequences for human, yeast, fly, and worm from DIP (version 20150101) and IntAct (version 20150120). To obtain a consistent protein symbol for PPI networks and conserved protein complexes, we change all protein names utilized in this study to UniProt protein symbols. The sequence similarity for each protein pair can be obtained using a local sequence alignment tool BLAST and we select the highest BLAST bit score for each protein pair as their sequence similarity score. Then, we normalize the BLAST bit score based on the equation (5) to obtain protein homologous correspondence and threshold the sequence similarity score at 0.1.

#### The yeast-human reference conserved protein complexes

We align yeast complexes in CYC2008 [24] and human complexes in CORUM (February 2012) based on GO terms to obtain the gold standard for yeast and human conserved protein complexes. CYC2008 provides a specific GO term associated to every collected yeast complex. For human complex in CORUM, we first download the mapping of human genes and proteins to GO terms (version 20150329) [33] and annotate proteins in CORUM with GO terms that also appear in CYC2008. If a human complex contains at least half the number of proteins annotated to a GO term in CYC2008, we align the human complex and the yeast complex with the same GO term annotation.

#### High-level GO terms

When analyzing biological consistency, we only consider the high-level GO terms, which suggest specific biological functions. A GO term is defined as high-level if its information content is larger than two. The definition of the information content of a GO term *g* is *I**C*=−log(|*g*|/|*r**o**o**t*|) [34], where “root” is the corresponding root GO term (either biological process, molecular function, or cellular component) of *g* and the operation |·| counts the number of proteins annotated to a specified GO term. Additionally, we remove GO terms “inferred from electronic annotation”, “inferred from protein interactions”, and “inferred from sequence or structural similarity”, because we utilize protein-protein interactions and the protein sequence similarities in our algorithm.

### Metrics for conserved protein complex prediction for pairwise PPI networks

The conserved protein complexes from pairwise PPI network alignment contain proteins from two species. Therefore, we need to examine the correspondence between the reference and the identified conserved complexes species by species. For all competing algorithms, we remove the identified conserved protein complexes if they contain fewer than three proteins for any species.

#### Fraction of the matched reference conserved protein complexes

For two species, given a reference conserved complex *A*={*A*_{1},*A*_{2}} and an identified conserved complex *B*={*B*_{1},*B*_{2}}, where *A*_{1} and *B*_{1} are complexes in the first species and *A*_{2} and *B*_{2} are complexes in the second species, we consider *B* matches *A* if the neighborhood affinity scores for complexes *w*(*A*_{1},*B*_{1})≥0.25 and *w*(*A*_{2},*B*_{2})≥0.25. The neighborhood affinity score for complexes can be calculated by

$$ w(X,Y) = \frac{\left | X \cap Y\right |}{\left | X \right |} \cdot \frac{\left | X \cap Y\right |}{\left | Y \right |} = \frac{\left | X \cap Y\right |^{2}}{\left | X \right |\left | Y \right |}, $$

(7)

where *X* and *Y* are two protein sets. Therefore, we can compute the fraction of the number of matched reference conserved complexes.

Assuming the reference conserved protein complexes set is *R* and the identified conserved protein complexes set is *P*, then the fraction of the number of matched reference conserved complexes can be computed as

$$ \textup{frac} = \frac{|C|}{|R|}, $$

(8)

where *C*={*S* | *w*(*S*_{1},*T*_{1})≥0.25 and *w*(*S*_{2},*T*_{2})≥0.25, ∀*S*∈*R*, *T*∈*P*}.

#### Accuracy score

Suppose that we have *n* reference conserved complex \(A^{j} = \left \{A^{j}_{1}, A^{j}_{2}\right \}, j=1,2,...,n\), and *m* identified complex \(B^{i} = \left \{ B^{i}_{1}, B^{i}_{2}\right \}, i=1,2,...,m\), and each has complexes \(A^{j}_{1}\) and \(B^{i}_{1}\) from the first species and another complexes \(A^{j}_{2}\) and \(B^{i}_{2}\) in the second species. Let *t*_{ij} denotes the number of proteins that exist in both the reference complex *A*^{j} and the identified complex *B*^{i}, and *w*_{j} represent the number of proteins in the *j*th reference complex. That is, \(t_{ij} = \left | A^{j}_{1} \bigcap B^{i}_{1} \right | + \left | A^{j}_{2} \bigcap B^{i}_{2} \right |\) and \(w_{j} = \left | A^{j}_{1} \bigcup A^{j}_{2} \right |\). Then, complex-wise sensitivity (Sn) and positive-predictive value (PPV) can be defined as

$$ \textup{Sn} = \frac{\sum_{j=1}^{n} \underset{i=1,..., m}{max}t_{ij}}{\sum_{j=1}^{n} w_{j}}; \ \ \ \ \textup{PPV} = \frac{\sum_{i=1}^{m} \underset{j=1,..., n}{max}t_{ij}}{\sum_{i=1}^{m}\sum_{j=1}^{n}t_{ij}}. $$

(9)

The geometric accuracy (Acc) score is the geometric mean of Sn and PPV: \(\textup {Acc}=\sqrt {\textup {Sn} \times \textup {PPV}}\).

#### The maximum matching ratio for conserved protein complexes (MMRC)

We can quantify the overlap between the reference and the identified conserved complexes by the maximum matching ratio for conserved complexes (MMRC). MMRC is derived based on MMR [6]. The maximum matching ratio [6] is the maximum sum of weights of edges in a bipartite graph, where the two sets of nodes represent reference complexes *C* and identified complexes *S*. The bipartite graph is represented by a weighted matrix *B*_{n×m}, where each weight *B*_{ij} is the neighborhood affinity score *w*(*c*_{i},*s*_{j}) introduced earlier for the corresponding edge between complexes *c*_{i} and *s*_{j}. For efficiency, we only use *B*_{ij}≥0.25. The MMR is the solution to the following maximal matching problem.

$$ \begin{aligned} \textup{max:} &\ \ \frac{1}{|C|}\sum_{i=1}^{n}\sum_{j=1}^{m} B_{ij} \sigma_{c_{i}, s_{j}}\\ s.t. & \ \ \sum_{j=1}^{m} \sigma_{c_{i}, s_{j}} \leq 1 \\ & \ \ \sum_{i=1}^{n} \sigma_{c_{i}, s_{j}} \leq 1, \end{aligned} $$

(10)

where *σ* is an indicator function with \(\sigma _{c_{i}, s_{j}} = 1\) when the edge between complexes *c*_{i} and *s*_{j} is selected and \(\sigma _{c_{i}, s_{j}} = 0\) otherwise.

MMRC can be obtained by modifying the overlapping weights *w*(*c*_{i},*s*_{j}) between reference and identified conserved complexes. The overlapping weight between reference conserved complex *A* and identified conserved complex *B* is defined as

$$ o(A,B) = \frac{2w(A_{1},B_{1})\times w(A_{2},B_{2})}{w(A_{1},B_{1})+w(A_{2},B_{2})}. $$

(11)

### Metrics for GO term consistency and coverage

We measure the functional consistency for the proteins in the conserved complexes by computing the mean normalized entropy (MNE) [7, 10] scores. We use the high-level GO term set *F* to annotate each protein in a conserved complex *R*_{i}. The union of GO terms used for *R*_{i} is *F*_{i}={*f*_{1},*f*_{2},...,*f*_{d}}. The normalized entropy (NE) of *R*_{i} is computed as

$$ \textup{NE}(R_{i}) = \textup{NE}(p_{1}, p_{2},..., p_{d}) = -\frac{1}{\textup{log}(d)} \sum_{j} p_{j} \cdot \textup{log}\left(p_{j} \right), $$

(12)

where *p*_{i} is the fraction of *R*_{i} with respect to the GO term *f*_{i}. The MNE is the mean value over all NE (*R*_{i}). For the coverage, we simply count the number of proteins in all conserved complexes, each of which consists of at least three proteins from each network.

### Parameter selection for each method

Among the tested algorithms, ClusterM and NetworkBLAST-M have only a single tuning parameter, while NetworkBLAST and MaWISh have 5 and 7 tuning parameters, respectively. In order to obtain the best performing parameters for fair comparison of the selected algorithms, we have selected the best performing parameter(s) for each algorithm by grid search. Within at most *N* combinations of *k* parameters (*p*_{1},*p*_{2},...,*p*_{k}), we sample \(n=\left \lfloor \sqrt [k]{N} \right \rfloor \) values for each parameter. The sample values are uniformly distributed between [*m**i**n*_{i},*m**a**x*_{i}] for each parameter *p*_{i}. For all the competing algorithms, we set *N*=100 and report the results with the best performing parameters.

### Implementation details

The MATLAB code and all used data are available at http://www.ece.tamu.edu/~xqian/ClusterM/. ClusterM is computationally efficient for handling multiple genome-scale PPI networks from different species. The memory consumption of ClusterM depends on the size of the subnetworks to be identified and the runtime of ClusterM can be significantly reduced by parallelization. It takes about 40 minutes for ClusterM to handle four PPI networks in IntActPPIs with 50,876 proteins in total on a laptop computer (16 GB memory and Intel i7 2.9 GHz cpu) with 2-core parallelization.