Protein functional module identification method combining topological features and gene expression data

Background The study of protein complexes and protein functional modules has become an important method to further understand the mechanism and organization of life activities. The clustering algorithms used to analyze the information contained in protein-protein interaction network are effective ways to explore the characteristics of protein functional modules. Results This paper conducts an intensive study on the problems of low recognition efficiency and noise in the overlapping structure of protein functional modules, based on topological characteristics of PPI network. Developing a protein function module recognition method ECTG based on Topological Features and Gene expression data for Protein Complex Identification. Conclusions The algorithm can effectively remove the noise data reflected by calculating the topological structure characteristic values in the PPI network through the similarity of gene expression patterns, and also properly use the information hidden in the gene expression data. The experimental results show that the ECTG algorithm can detect protein functional modules better.

for development in the method of using two information sources for analysis.
We find that topological structure and attribute information are very effective in identifying protein complexes by analyzing the existing mainstream PPI network methods for identifying protein functional modules [12,13], even though there are not much approaches take both information into consideration. Moreover, many algorithms for detecting protein functional modules use some special optimized attributes to find clusters, obviously, the process of detecting protein functional modules can be regarded as an optimization problem [14,15]. Therefore, this paper proposes a new protein complex recognition algorithm. ECTG(Evolutionary Clustering Algorithm Based on Topological Features and Gene expression data for Protein Complex Identification). This method is based on evolutionary algorithm (EA), which effectively fuses protein topology and gene expression data. It has an advantage of dispensing with working under linear constraints like a typical numerical optimization problem. It can also find multiple solutions and be executed in parallel, so it can solve big data source problem quickly and efficiently. In order to verify the performance of ECTG, we conducted experiments on three real PPI network data sets [16][17][18]: DIP, Krogan, and Gavin. The used compound standard set was the CYC2008 data set. The experimental results show that the algorithm proposed in this paper has more obvious advantages in multiple indicators.

Similarity measure of gene expression patterns
Calculating the similarity between gene expression patterns (co-expression degree) by using gene expression data has an important guiding function in understanding the relationship between the corresponding proteins of the gene, and can help to identify whether different proteins have same or similar functions and whether they can be composed as protein complexes or functional modules. At present, there are multiple similarity measurement methods for different data types. Methods such as Euclidean distance, Cosine similarity and Pearson correlation coefficient are usually used to calculate the similarity of gene expression patterns.

(1) Euclidean distance
Euclidean distance is often used to measure the similarity of a pair of gene expression data, that is, a n-dimensional vector. If given the genes u and v, the Euclidean distance between u and v is shown in formula 1: In above formula, u j and v j are the expression components of gene u and gene v in dimension j.
But Euclidean distance is not suitable for calculating similarity between gene expression patterns with different dimensions. Therefore, it must be standardized to meet the requirements as mean equal zero and variance equal one when using Euclidean distance to measure the similarity of gene expression data.
(2) Cosine similarity, formula 2 as follow: The larger the cosine value, the greater the similarity of gene expression patterns. When the cosine similarity is one, the gene expression patterns are completely consistent.

(3) Pearson correlation coefficient:
PCC is also an extensive used method for calculating the similarity of gene expression data. Given a gene u and a gene v, the calculation formula of the Pearson correlation coefficient between the two genes is shown in formula 3: In above formula, the definition of u and v are as follow: Since the Pearson correlation coefficient is sensitive to outlier data, false positive data is likely occur in the results, giving higher similarity values to dissimilar gene pairs, which will cause errors in the results. To avoid that, this paper measures the similarity of gene pairs by calculating the Jackknife correlation coefficient. Given n gene expression data samples under different conditions, the expression value of gene u under condition j is expressed as u j , given gene u and gene v, the Jackknife correlation coefficient GEC between the two genes can be obtained by the following formula 4: In the above formula, r pea (·, ·) is defined in formula 3, the definition of u (j) and v (j) : In above formula, j = 1, 2, . . . , n.

Network reconstruction
Wang X [19] proposed the small world and scale-free network characteristics of complex networks such as PPI networks. Goldberg D S [20] et al. proposed the concept of edge-based mutual clustering coefficient based on the small world network characteristics of the PPI network to quantify the network structure. After calculating the MCC values of all edges in the network, setting a threshold and selecting a reliable structure which above the set threshold. Samanta MP [21] et al. found through experiments that if the number of adjacent junctions where two proteins act together is large, they have a close functional relationship. Segura J et al. [22] proposed a new method of using neighborhood cohesion to infer the interaction between protein interaction networks. Experimental results show that this method has good performance and can effectively predict PPI network interaction pairs. Based on those, we use topology coefficient PTC as a quantitative representation of PPI network topological structure feature PTC is obtained by parameter α adjustment with topological coefficient T(u, v) which representing the number of neighboring nodes of a node and a clustering factor C n which representing the sharing of interaction nodes with other nodes. The calculation formula of PTC is shown in formula 5.
Combining the similarity of the PTC representing the network topology with gene expression patterns, the weight w(u, v) of the protein interaction pair in the PPI network is re-assigned and defined as the product of T(u, v) and GEC (u, v) , as shown in formula 6: The weight w(u) of node u is presented by the sum of node u and its edge in the PPI network, the formula is as follow: In the networks, the clustering factor indicates the strength of the connecting edges between the neighboring nodes of a node, and the topology factor indicates the strength of the neighboring nodes of the node. The clustering factor and the topological factor are assigned weights through parameters and combined, then the topological structure of the network can be fully expressed. PTC measures the density of adjacent nodes between a node and its neighboring nodes, and the value of the coefficient ranges from 0 to 1.The larger the PTC value, the more likely the neighboring nodes of the node will appear in the same cluster. GEC represents the corresponding gene expression similarity of protein interaction pair, that is, gene expression correlation measures the correlation between two proteins, and its value is between -1 and 1,the higher the GEC value, the higher the degree of protein co-expression, the greater the probability of appearing in the same functional module. Therefore, we weight the protein interaction by combining the topological structure of the PPI network and the correlation of gene expression, and the network distance between two nodes is a re-weighting of the topological distance in the network. Comprehensively consider PTC and GEC to calculate the probability that a node and its neighbor nodes appear in a cluster. After integrating the topological coefficient PTC of the PPI network and the gene expression correlation GEC to calculate the w of all nodes in the graph, sorting w value of all nodes, and then choosing the highest weight as starting point. Figure 1 shows the ECTG process, ECTG decomposes the PPI network into closely connected subgraphs to detect functional modules. The process is mainly divided into four steps. The first step is to construct a PPI network diagram with attributes based on the PPI network and gene expression data. The second step is to construct a weighted attribute PPI graph using PTC and GEC, given the attributed PPI network graph obtained in the first step, ECTG determines the weight of each edge in the graph according to the topological coefficient and the similarity of gene expression. In the third step, given a weighted graph, EA maximizes the connection weight to produce a compact graph clusters. In the fourth step, given graph clusters, a breadth-first search strategy is adopted, and searching subgraphs in each graph cluster according to the homogeneity of the attribute values of the connected nodes. The vertices of these subgraphs have similar attribute values and are relatively dense, and have a good correspondence with protein complexes in real life.

Algorithm description
ECTG searches PPI pairs with higher values in each subgraph, and then continuously absorbs seed nodes to form modules. After ECTG has calculated all the values of w in the PPI network, the breadth-first search method BFS (breadth-first search) is used to extend the seeds, and form a protein complex finally. BFS can be divided into two stages, the first step: select an edge with the maximum w value w max first, and then incorporate the two end points v i and v j connecting the edge into the seed node set of a protein complex; the second step: on the basis of w max , search for all adjacent nodes of v i and v j and extend all the nodes whose w value is greater than the threshold λ into the protein complex. The extended node definition is shown in formula 8:

Fig. 1 Schematic overview of our proposed ECTG model
In the above formula, v k represents the node in the seed set, and v m represents the node adjacent to the node v k . Only points whose w value is greater than the threshold can be merged into the set. The second stage of the search process will continue until no new nodes are added to the seed set. When a cluster completes the above search, ECTG will use the protein in the seed set to form a protein complex. Until all nodes are traversed, ECTG stops absorbing nodes. Due to the high probability of appearing small-scale modules using the above search strategy, ECTG will delete those modules that have been identified as containing less than 3 nodes. In order to reduce the redundancy of proteins in the recognition module, ECTG calculates the overlap score between any module and all others. The definition of overlap score is shown in formula 9: where e and PC I respectively refer to the module obtained after a search and any other modules in the result set. ECTG then uses a threshold OvMax to exclude those modules whose overlap score is higher than the threshold. In order to explain the ECTG method in more detail, we give its pseudo code, as shown in Algorithm 1.
The input information of ECTG includes: PPI network, gene expression data, parameter α used to control the weight of topological coefficients, used to filter out threshold λ that do not meet similarity, and used to filter the nodes with higher repeated nodes between the obtained modules. 4: compute the weight of v, w(v); 5: for each cluster c i do 6: for each vertex v i do 7: find w max ; 8: create a new protein complex e; 9: create a new link list P visiting ; 10: P visiting = P visiting ∪ v i ; 11: P visiting = P visiting ∪ v j ; 12: while P visiting > 0 do 13: v k =head of P visiting ; 14: P visiting -v k ; 15: e = e ∪ v k ; 16: search v m : neighbors of v k ; 17: if ω km ≥ λ then 18: 19: if Ov r ≤ OvMax then 20: PC=PC ∪ e; 21: return PC;

Experimental data set
The experimental process is to link the PPI network and gene expression, and apply the ECTG algorithm to the Saccharomyces cerevisiae data set, which is downloaded from the 2013 version of the DIP database. The network contains 4579 points and 20845 edges after process. And the Krogan and Gavin data sets, the specific information is shown in Table 1. Obviously, there are great differences of the datasets in the number of proteins and proteinprotein interactions. This can increase the credibility of the results obtained by ECTG algorithm and prove to have better generalization ability of propose algorithm. The gene expression data is selected from the publications of Rintala et al. [23], this gene expression data is the data sequence of yeast response to sudden hypoxia [17], that is, the glucose-limited cultivation analysis after the transition from fully aerobic (20.9% O 2 or restricted oxygen (1.0% O 2 ) to anaerobic state. 79 hours (20.9% O 2 ) or 72 hours (1.0% O 2 ) after shifting. These data provide insights into the adaptive mechanism of the transition from respiration to fermentation growth. After processing, the gene expression data has 5664 unique non-empty genes, and each gene expression includes 28 time courses. Comparing the two information, there are 4936 proteins in PPI network and 4616 proteins have gene expression.

Experimental design
When testing method performance, ECTG is compared with different algorithms, including ClusterONE [24], DPClus [25], COACH [26] and CFinder [27]. We use these five methods to detect functional modules in the above three data sets. ClusterONE, DPClus, COACH and CFinder detecting functional modules only based on the topological structure of the PPI network, not make full use of node attribute information. Such as MCL, ClusterONE can be used for weighted PPI network data, which can be compared with the method ECTG using a weighted network. For the above methods, their respective parameter settings are shown in Table 2.  of ECTG is 0.49, which is slightly lower than that of the MCL algorithm, but its recall rate is 0.65, which is much higher than that of MCL, and its F-measure is also about 15% higher than other methods. The situation is similar on the Gavin and Krogan data sets. ECTG obtained the best F-measure values on the 3 data sets. Although ECTG has not always obtained the best Precision and Recall values, has always obtained better F-measure values than other methods, indicating that the performance of this method for detecting functional modules is better than other methods. At the same time, the algorithm results will be affected by the difference of datasets. ECTG can always maintain advanced performance on one or more indexes on three data sets. From experimental results we can conclude that the functional modules obtained by the ECTG method may more accurately represent the real modules in the standard set and have better generalization ability. Regarding the size and coverage of the detected modules, the number of modules identified by ECTG in each set of data is relatively small compared to MCL, the false positives are low, and the coverage is relatively large, so its coverage is relatively high. In order to check whether other algorithms obtain the same or better performance when using the same weighted PPI network data, we compare the results of those algorithms that can process weighted network data, including Clus-terONE and MCL. The results are shown in Table 4. As shown in the table, ECTG's accuracy rate is 0.68 on the Gavin data set, which is slightly lower than the MCL algorithm, but the Recall has increased by nearly 20%, so its F-measure value has increased by about 15% compared with the other two algorithms. When dealing with weighted networks, ClusterONE and MCL use weighted network data generated by combining topology and gene expression data, the performance has varying degrees of improvement. But ECTG is still superior to these two algorithms, and the results show that considering the topological and attribute factors, ECTG's performance is better than the algorithm that only considers the network topology. In short, ECTG performs better in detecting functional modules. It obtains better F-measure results in most data sets. The result is affected by the difference of data sets, but ECTG can always maintain advanced performance on one or more indicators.Therefore, ECTG can achieve better results when regard the task of functional module detection as the problem of considered gene expression data and topology optimization.

Parameter settings
As mentioned earlier, there are three parameters in the ECTG execution process that determine the result of the detection module: α, λ and OvMax. In order to understand how these parameters affect the experimental  It can be seen from figures that overall precision value, recall value and Fmeasure increased by about 12%, 8% and 7% respectively when α equal 0.5 than α equal 0.2. But the number of protein complexes decreased by nearly 50. Comparing with α equal 0.5 When α equal 0.8, the precision value increased by about 14%, the recall value increased by nearly 4%, the F-measure value increased by about 9%, and the number of protein complexes decreased by nearly 20. As α increases, the value of the index is also increasing, and the increment in the range of 0.1-0.5 is lower than the increment in the range of 0.5-1.0. Although the value obtained near α equal 1.0 is relatively high, many complexes that actually exist but do not meet the filter conditions are filtered out, so that the number of modules is relatively small, the Recall value is relatively increased, and the Fmeasure value is relatively increased. This will omit part of the real modules, which is not the best experimental result. Therefore, the best value of α in this experiment is 0.8. Shown in Fig. 4a-c, when α equal 0.8, the changing trends of precision and F-measure are similar when λ and OvMax change, simply setting λ and OvMax near 0 or 1, the obtained results are not optimal. For example, when λ is set to 0.2, no matter how you adjust the value of OvMax, the precision obtained by ECTG is a relatively low value. When a smaller value is used, ECTG includes more nodes with lower similarity, resulting in a larger gap between the clustered modules and the real modules. Although when λ and OvMax are set near 1, ECTG cannot identify those modules that contain more nodes so that some real modules are lost. Considering these conditions, it is necessary to set appropriate values of λ and OvMax for the experimental performance of the ECTG method. As shown in Fig. 4d, ECTG can identify more modules in the PPI network with higher λ and OvMax values, so this method can obtain more protein complexes in the standard set and achieve a higher recall value. Therefore, we expect a method to accurately detect relatively more nodes. In general, we recommend that the values of λ and OvMax are between 0.6 and 0.9 when the ECTG detects the module. When λ and OvMax is properly set in this range, ECTG may perform better. This is why we used the parameter settings shown in Table 2 in the ECTG experiment.

Functional enrichment analysis
The probability of functional homology of actual protein functional modules is very high. This part uses the three kinds of annotation information contained in the GO database [28] and GO: TermFinder to calculate the P-value of the module obtained by the algorithm to determine its biological function significance [29], and mark it's functional annotations, so the P-value [30] of inside modules protein co-occurrence probability need be calculated. The concept of P-value is described as Fig. 3 Results of precision, Recall, F-measure and the number of protein complexes identified by ECTG using α=0.5 and different settings of λ and OvMax follows: given ontology d, we use N to represent the protein quality annotated in ontology d. Given a notes, we will denote the total number of proteins covering a by M. Given a cluster b, n represents the number of protein contained therein, and x represents the number of proteins with a annotations in it. When the ontology d and the term a are randomly given, the probability that the number of proteins is greater than or equal to x in b is represented by P-value. The definition is shown in Fig. 5, and the calculation method is shown in formula 10: In order to verify the effectiveness of ECTG, we calculated the P-values of all modules on the DIP network by comparing the probability of actual exist modules obtained by various algorithms. Using Go: TermFinder, a web-based service that can search for important shared GO items in the obtained module proteins. In our experiment, the threshold range of P-value is set from 1E-15 to 1E-2. That means those GO items whose Pvalue is less than or equal to the threshold are considered to be meaningful. Not all modules with significant GO items have been discovered, that is, they can be found in such as MIPS/CYGD and CYC2008, but they can be considered as true module candidates because of their functional enrichment analysis. After obtaining the P-value of each module, we count protein complexes in the detected modules that contain at least one GO item with a P-value lower than different thresholds.
In addition to analyzing the modules obtained through ECTG, we also calculated GMFTP, MCL, ClusterONE, CFinder, DPClus and COACH. GMFTP has proven to be very effective when considering network topology and functionality. And MCL, IPCA, ClusterONE proved to be more effective for the recognition module after considering the topology. The above-mentioned methods are chosen as the comparison method of ECTG because they all show better robustness on the three data sets mentioned above. The experimental results of ECTG, GMFTP, MCL, ClusterONE, CFinder, DPClus and COACH are shown in Table 5. Obtained from the table: ECTG detects proteins with more significant GO items than other methods, especially when the P-value threshold is low, such as P-value < 1E-15. At the same time, perform specific GO biological process(BP) annotation and GO molecular function(MF) annotation analysis on the identified functional modules on the DIP data set. BP stands for a collection of molecular events that begin and end. These events are closely related to the functions of integrated   In this experiment, if P < ρ, and ρ is the threshold of Pvalue, the predicted cluster P is significant. The figure shows the percentage of important clusters for several ρ values. Figures 6 and 7 show the P-values calculated by BP and MF respectively. It can be seen from the figure that the ECTG method obtains more clusters with lower P-values when detecting modules. This data shows that ECTG can detect more modules rich in biological significance than other methods. No matter how many protein complexes are currently known, they have a higher probability of becoming a real complex that is identified through biological experiments in the future. Based on the results of the P-value experiment, it can be seen that ECTG performs better when detecting functional modules, and is a better method for detecting and predicting protein functional modules. One of the protein functional modules obtained by the ECTG method, and the module size is more than 5 and the matching degree is more than 0.4, the topological structure and biological significance are analyzed as well, and 10 functional modules are selected for listing, as shown in Table 6. They not only have a lower P-value, but also have a higher consistency with known protein functional modules, and their P-value is smaller than 0.001.

Algorithm prediction example analysis
The experimental results show that the functional modules predicted by combining the topological structure of the PPI network and gene expression data can match more modules rich in biological functional significance, and provide beneficial help for predicting protein functional modules and the proteins whose functions have not yet been revealed in the modules. As shown in Fig. 8, the ECTG detection scale is 12 modules, of which 11 belong to the molecular functional group heterocyclic compound binding, so the protein YNL189W may also have this function.

Conclusion
This paper proposes a functional module detection method ECTG that combines topological structure and gene expression data. The idea is expressing the topological structure of the PPI network with quantified topological features, and then using the gene expression data to calculate the similarity of gene expression patterns. Combining the two to re-weight the PPI pairs in the network, reconstructing the PPI network, and finally performing clustering on the PPI network through the EA algorithm. First, the reason and goal of introducing gene expression data and topology structure are explained. Second, the ECTG clustering algorithm is described in detail. Finally, perform clustering experiments on three different yeast mutual data sets to detect functional modules. The analysis of experimental results shows that compared with other methods, the clustering algorithm proposed in this paper has a greater improvement in recall rate and F-measure value, which shows that the combination of topological structure and gene expression data is effective. The advantages of ECTG are reflected in two aspects: (1) It can effectively remove the noise data reflected by calculating