Similarity measure of gene expression patterns
Calculating the similarity between gene expression patterns (co-expression degree) by using gene expression data has an important guiding function in understanding the relationship between the corresponding proteins of the gene, and can help to identify whether different proteins have same or similar functions and whether they can be composed as protein complexes or functional modules. At present, there are multiple similarity measurement methods for different data types. Methods such as Euclidean distance, Cosine similarity and Pearson correlation coefficient are usually used to calculate the similarity of gene expression patterns.
(1) Euclidean distance
Euclidean distance is often used to measure the similarity of a pair of gene expression data, that is, a n-dimensional vector. If given the genes u and v, the Euclidean distance between u and v is shown in formula 1:
$$ {d_{euc}}(u,v) = {\left({\sum\limits_{j = 1}^{n} {{{({u_{j}} - {v_{j}})}^{2}}}} \right)^{1/2}} $$
(1)
In above formula, uj and vj are the expression components of gene u and gene v in dimension j.
But Euclidean distance is not suitable for calculating similarity between gene expression patterns with different dimensions. Therefore, it must be standardized to meet the requirements as mean equal zero and variance equal one when using Euclidean distance to measure the similarity of gene expression data.
(2) Cosine similarity, formula 2 as follow:
$$ {}\cos (\theta) = \frac{{A \cdot B}}{{\left\| A \right\|\left\| B \right\|}} = \frac{{{\sum\nolimits}_{i = 1}^{n} {{A_{i}} \times {B_{i}}} }}{{\sqrt {{\sum\nolimits}_{i = 1}^{n} {{{({A_{i}})}^{2}}}} \times \sqrt {{\sum\nolimits}_{i = 1}^{n} {{{({B_{i}})}^{2}}}} }} $$
(2)
The larger the cosine value, the greater the similarity of gene expression patterns. When the cosine similarity is one, the gene expression patterns are completely consistent.
(3) Pearson correlation coefficient:
PCC is also an extensive used method for calculating the similarity of gene expression data. Given a gene u and a gene v, the calculation formula of the Pearson correlation coefficient between the two genes is shown in formula 3:
$$ {r_{pea}}(u,v) = \frac{{\sum\limits_{j = 1}^{n} {({u_{j}} - \overline u)({v_{j}} - \overline v)} }}{{\sqrt {\sum\limits_{j = 1}^{n} {{{({u_{j}} - \overline u)}^{2}}}} \sqrt {\sum\limits_{j = 1}^{n} {{{({v_{j}} - \overline v)}^{2}}}} }} $$
(3)
In above formula, the definition of \({\overline u}\) and \({\overline v}\) are as follow:
$$\overline u = \frac{1}{n}\sum\limits_{j = 1}^{n} {{u_{j}}},\qquad \overline v = \frac{1}{n}\sum\limits_{j = 1}^{n} {{v_{j}}} $$
Since the Pearson correlation coefficient is sensitive to outlier data, false positive data is likely occur in the results, giving higher similarity values to dissimilar gene pairs, which will cause errors in the results. To avoid that, this paper measures the similarity of gene pairs by calculating the Jackknife correlation coefficient. Given n gene expression data samples under different conditions, the expression value of gene u under condition j is expressed as uj, given gene u and gene v, the Jackknife correlation coefficient GEC between the two genes can be obtained by the following formula 4:
$$ GEC(u,v) = min\{ {r_{pea}}({u^{{\mathrm{(}}j{\mathrm{)}}}},{v^{{\mathrm{(}}j{\mathrm{)}}}}):j = 1,2,...,n\} $$
(4)
In the above formula, rpea(·,·) is defined in formula 3, the definition of u(j) and v(j) :
$$\begin{aligned} {u^{(j)}} &= {({u_{1}},...,{u_{j - 1}},{u_{j + 1}},...,{u_{n}})^{T}},\\ {v^{(j)}} &= {({v_{1}},...,{v_{j - 1}},{v_{j + 1}},...,{v_{n}})^{T}} \end{aligned} $$
In above formula, j=1,2,…,n.
Network reconstruction
Wang X [19] proposed the small world and scale-free network characteristics of complex networks such as PPI networks. Goldberg D S [20] et al. proposed the concept of edge-based mutual clustering coefficient based on the small world network characteristics of the PPI network to quantify the network structure. After calculating the MCC values of all edges in the network, setting a threshold and selecting a reliable structure which above the set threshold. Samanta MP [21] et al. found through experiments that if the number of adjacent junctions where two proteins act together is large, they have a close functional relationship. Segura J et al. [22] proposed a new method of using neighborhood cohesion to infer the interaction between protein interaction networks. Experimental results show that this method has good performance and can effectively predict PPI network interaction pairs. Based on those, we use topology coefficient PTC as a quantitative representation of PPI network topological structure feature PTC is obtained by parameter α adjustment with topological coefficient T(u,v) which representing the number of neighboring nodes of a node and a clustering factor Cn which representing the sharing of interaction nodes with other nodes. The calculation formula of PTC is shown in formula 5.
Combining the similarity of the PTC representing the network topology with gene expression patterns, the weight w(u,v) of the protein interaction pair in the PPI network is re-assigned and defined as the product of T(u,v) and GEC(u,v), as shown in formula 6:
$$ PTC(u,v) = \alpha {C_{n}} + (1 - \alpha)T(u,v) $$
(5)
$$ \omega (u,v) = PTC(u,v)*GEC(u,v) $$
(6)
The weight w(u) of node u is presented by the sum of node u and its edge in the PPI network, the formula is as follow:
$$ \omega (u) = \sum\limits_{(u,v) \in E} {\omega (u,v)} $$
(7)
In the networks, the clustering factor indicates the strength of the connecting edges between the neighboring nodes of a node, and the topology factor indicates the strength of the neighboring nodes of the node. The clustering factor and the topological factor are assigned weights through parameters and combined, then the topological structure of the network can be fully expressed. PTC measures the density of adjacent nodes between a node and its neighboring nodes, and the value of the coefficient ranges from 0 to 1.The larger the PTC value, the more likely the neighboring nodes of the node will appear in the same cluster. GEC represents the corresponding gene expression similarity of protein interaction pair, that is, gene expression correlation measures the correlation between two proteins, and its value is between -1 and 1,the higher the GEC value, the higher the degree of protein co-expression, the greater the probability of appearing in the same functional module. Therefore, we weight the protein interaction by combining the topological structure of the PPI network and the correlation of gene expression, and the network distance between two nodes is a re-weighting of the topological distance in the network. Comprehensively consider PTC and GEC to calculate the probability that a node and its neighbor nodes appear in a cluster.
After integrating the topological coefficient PTC of the PPI network and the gene expression correlation GEC to calculate the w of all nodes in the graph, sorting w value of all nodes, and then choosing the highest weight as starting point.
Algorithm description
Figure 1 shows the ECTG process, ECTG decomposes the PPI network into closely connected subgraphs to detect functional modules. The process is mainly divided into four steps. The first step is to construct a PPI network diagram with attributes based on the PPI network and gene expression data. The second step is to construct a weighted attribute PPI graph using PTC and GEC, given the attributed PPI network graph obtained in the first step, ECTG determines the weight of each edge in the graph according to the topological coefficient and the similarity of gene expression. In the third step, given a weighted graph, EA maximizes the connection weight to produce a compact graph clusters. In the fourth step, given graph clusters, a breadth-first search strategy is adopted, and searching subgraphs in each graph cluster according to the homogeneity of the attribute values of the connected nodes. The vertices of these subgraphs have similar attribute values and are relatively dense, and have a good correspondence with protein complexes in real life.
ECTG searches PPI pairs with higher values in each subgraph, and then continuously absorbs seed nodes to form modules. After ECTG has calculated all the values of w in the PPI network, the breadth-first search method BFS (breadth-first search) is used to extend the seeds, and form a protein complex finally. BFS can be divided into two stages, the first step: select an edge with the maximum w value wmax first, and then incorporate the two end points vi and vj connecting the edge into the seed node set of a protein complex; the second step: on the basis of wmax, search for all adjacent nodes of vi and vj and extend all the nodes whose w value is greater than the threshold λ into the protein complex. The extended node definition is shown in formula 8:
$$ e(seed:{v_{k}}) = \left\{ \begin{array}{l} e \cup {v_{m}}{\mathrm{ }}\qquad if\ {\mathrm{ }}{w_{km}} \ge \lambda \\ e \cup \Phi {\mathrm{ }}\qquad otherwise \end{array} \right. $$
(8)
In the above formula, vk represents the node in the seed set, and vm represents the node adjacent to the node vk. Only points whose w value is greater than the threshold can be merged into the set. The second stage of the search process will continue until no new nodes are added to the seed set. When a cluster completes the above search, ECTG will use the protein in the seed set to form a protein complex. Until all nodes are traversed, ECTG stops absorbing nodes. Due to the high probability of appearing small-scale modules using the above search strategy, ECTG will delete those modules that have been identified as containing less than 3 nodes. In order to reduce the redundancy of proteins in the recognition module, ECTG calculates the overlap score between any module and all others. The definition of overlap score is shown in formula 9:
$$ O{v_{r}} = \max \frac{{\left| {e \cap P{C_{I}}} \right|}}{{\left| {e \cup P{C_{I}}} \right|}} $$
(9)
where e and PCI respectively refer to the module obtained after a search and any other modules in the result set. ECTG then uses a threshold OvMax to exclude those modules whose overlap score is higher than the threshold. In order to explain the ECTG method in more detail, we give its pseudo code, as shown in Algorithm 1.
The input information of ECTG includes: PPI network, gene expression data, parameter α used to control the weight of topological coefficients, used to filter out threshold λ that do not meet similarity, and used to filter the nodes with higher repeated nodes between the obtained modules.