An integrated model
Sequence information is one of important factors in charactering biological function of genes, RNA and proteins[33]. For example, proteins of a typical family not only share common sequence regions, but also play similar roles in biological processes, molecular function and cellular component. As only a small fraction of a protein sequence is in the functional region, a sequence-based similarity measure is insufficient for the annotation of protein function [34]. PPI network topology can provide complementary information for the prediction of protein function. As used in many other network aligners such as IsoRank, Fuse [35] and Magna, both topology and sequence information are integrated in one similarity measure to search for functionally conserved proteins across species. There are two basic assumptions underlying this methodology: 1) a sequence similarity implies functional conservation; 2) functions are encoded in topology structure of PPI networks.
Sequence-based similarity
Intuitively guided by an assumption that structures determine functions, most of existing network aligners use both amino acid seqeuences and network topology to predict protein functions. Here, we performed an all-against-all sequence comparison using BLASTP [36] on all protein sequences. These protein pairs with significant conserved regions are taken into consideration for further filtrations. Note that e-value is an input parameter to control the coverage of network alignment. Let Ω denote the candidates of homology proteins. Given a protein pair u and v, the sequence similarity s (u,v) can be calculated in the following formula, s h(u, v)=\(\frac {\varepsilon (u, v)-\varepsilon _{min}(u, v)}{\bigtriangleup \varepsilon }\). Here, ε(u,v) can be log(evalue) or bitscore of the protein pair u and v, and △ε is the largest difference between any two pairs of homolog in Ω,△ε= εmax(u,v)−εmin(u,v), which servers as a normalization factor. The most similar one is 1, the least 0.
Topology-based similarity
As protein functions are also encoded in the topology of PPI networks, topological structure can guide us to find functionally conserved proteins. To find the topologically similar protein pairs, a similarity measure is necessary for evaluating the topological similarity for each pair of nodes. The mathematical question is how to calculate a similarity of a pair of nodes, which are from two different networks [37]. In the aligner of IsoRank, it was calculated based on the principle that if two nodes are aligned, then their neighbors should be aligned as well. Our method works on a principle that if two nodes are aligned, then the local induced-subgraphs should be similar.
Given a network G=(V,E),V={v1,v2,...,vn}, we design a 5-tuple-feature vector (γ,σ,τ,η,θ) for each node in V to represent local connections of its corresponding node. Without loss of generality, we denote the adjacent matrix of G as M n×n. Since M is real and symmetric, there must exist a major normalized eigenvector K=(k1,k2...k n). In another words, K is the normalized eigenvector of the largest eigenvalue. Then, ki,1≤i≤n represents the reputation of the node v i. The greater the reputation is, the more important the node is. Therefore, we use k i as the first element of the 5-tuple-feature vector (i.e. γ) to character the node v i. Let us denote the neighbor of v as N v. Then, we use |Nv| as the second element of the 5-tuple-feature vector (i.e. σ), the sum of the reputation of these nodes \(\sum _{x\in N_{v}}k_{x}\) as the third element (i.e. τ). Let us denote these nodes that are 2-step away from v as N\(_{v}^{2}\). It notes that all nodes in \(N_{v}^{2}\) are not directly connected to v. Then, we use \(|N_{v}^{2} |\) as the fourth element (i.e. η). The last element η is calculated by the formula \(\frac {1}{2}\sum _{x\in N_{v}^{2}}k_{x}p_{xv}\). Here, we denote the number of the shortest paths from x to v as p xv. As shown in Fig. 1a, there are two networks G1 and G2. Based on the definition stated above, the 5-tuple-feature vector of a1,a2,a3,a4,a5 in G1 are (1,3,2.63,1,0.16),(0.88,3,2.33,1,0.75),(0.33,1,0.88,2,1),(0.75,2,2,1,0.88),(1,3,2.63,1,0.16), respectively. They are the same for b1,b2,b3,b4,b5 in G2. The vector of each element of all nodes should be normalized in the following step as shown in Fig. 1b. With the normalized 5-tuple-feature vector, the node similarity of any two nodes st(u,v) can be calculated with the Gaussian function \(s_{t}(u, v)= exp(-\frac {1}{2}x^{2})\), where x represents the Euclidean distance between the 5-tuple-feature vector of node u and v. For instance, as shown in Fig. 1a, the vector of ai and bi are the same. Therefore, the diagonal of the similarity matrix is (1,1,1,1,1).
Simulated annealing
To find an optimal network alignment, we applied a linear model to integrate both sequence and topology information. The alignment score can be formulated as \(f(\mathbbm {A})=\sum _{m\in \mathbbm {A}}s_{m}\), where \(\mathbbm {A}\) and m is refer to a global alignment and a matchset, respectively. Suppose m={m1,m2,...,mv}, the alignment score of the matchset is \(s_{m}= \sum _{i=m_{1}}^{m_{v-1}} \sum _{j=i}^{m_{v}} \alpha s_{h}(i,j) + (1-\alpha)s_{t}(i,j)\). By default, α=0.5. User can increase α when he consider the sequence similarity is more important and decrease α when he consider the topological similarity is more important. Therefore, the problem of global network alignment can be modeled as an optimization problem, which is to search for an optimal alignment \(\mathbbm {A}^{*}\), such that \(\mathbbm {A}^{*}=arg \max \limits _{\mathbbm {A}}f(\mathbbm {A})=\sum _{m\in \mathbbm {A}}s_{m}\).
To solve this problem, we used a simulated annealing algorithm [38] to search for an approximately optimal solution. Simulated annealing is a commonly used approach in the discovering of network alignment solutions, as it can rapidly converge in a favorable time complexity [39]. As shown in the pseudocode of simulated annealing, the alignment A was firstly initialized to an empty set \(\varnothing \). Then we repeatedly perturb the current alignment A with a Metropolis scheme P(Δf)=\(e^{\frac {\Delta f}{(Ti*s)}}\) as the equilibrium distribution till the alignment score converges.