FunGeneNet: a web tool to estimate enrichment of functional interactions in experimental gene sets

Tiys, Evgeny S.; Ivanisenko, Timofey V.; Demenkov, Pavel S.; Ivanisenko, Vladimir A.

doi:10.1186/s12864-018-4474-7

Volume 19 Supplement 3

Selected articles from Belyaev Conference 2017: genomics

Software
Open access
Published: 09 February 2018

FunGeneNet: a web tool to estimate enrichment of functional interactions in experimental gene sets

Evgeny S. Tiys^1,2,
Timofey V. Ivanisenko^1,2,
Pavel S. Demenkov¹ &
…
Vladimir A. Ivanisenko¹

BMC Genomics volume 19, Article number: 76 (2018) Cite this article

3412 Accesses
6 Citations
4 Altmetric
Metrics details

Abstract

Background

Estimation of functional connectivity in gene sets derived from genome-wide or other biological experiments is one of the essential tasks of bioinformatics. A promising approach for solving this problem is to compare gene networks built using experimental gene sets with random networks. One of the resources that make such an analysis possible is CrossTalkZ, which uses the FunCoup database. However, existing methods, including CrossTalkZ, do not take into account individual types of interactions, such as protein/protein interactions, expression regulation, transport regulation, catalytic reactions, etc., but rather work with generalized types characterizing the existence of any connection between network members.

Results

We developed the online tool FunGeneNet, which utilizes the ANDSystem and STRING to reconstruct gene networks using experimental gene sets and to estimate their difference from random networks. To compare the reconstructed networks with random ones, the node permutation algorithm implemented in CrossTalkZ was taken as a basis. To study the FunGeneNet applicability, the functional connectivity analysis of networks constructed for gene sets involved in the Gene Ontology biological processes was conducted. We showed that the method sensitivity exceeds 0.8 at a specificity of 0.95. We found that the significance level of the difference between gene networks of biological processes and random networks is determined by the type of connections considered between objects. At the same time, the highest reliability is achieved for the generalized form of connections that takes into account all the individual types of connections. By taking examples of the thyroid cancer networks and the apoptosis network, it is demonstrated that key participants in these processes are involved in the interactions of those types by which these networks differ from random ones.

Conclusions

FunGeneNet is a web tool aimed at proving the functionality of networks in a wide range of sizes of experimental gene sets, both for different global networks and for different types of interactions. Using examples of thyroid cancer and apoptosis networks, we have shown that the links over-represented in the analyzed network in comparison with the random ones make possible a biological interpretation of the original gene/protein sets. The FunGeneNet web tool for assessment of the functional enrichment of networks is available at http://www-bionet.sscc.ru/fungenenet/.

Background

At present, the reconstruction of molecular genetic networks (gene networks) is one of the most widely used approaches for studying the mechanisms of the functioning of complex biological processes. The use of this approach is often a necessary requirement for solving many problems in the field of biology, medicine, and pharmacology, among others [1,2,3,4,5,6,7].

Around the world, many databases containing molecular genetic networks describing metabolic processes, diseases, phenotypic traits, etc. have been developed – for example, KEGG PATHWAY [8], BioCyc [9], BioGRID [10] and IntAct [11].

There are systems that allow the reconstruction of gene networks for a given set of genes/proteins including FunCoup [12], STRING [13], Pathway Studio [14], Ingenuity Pathway Analysis [15], PINA [16], GeneMANIA [17] and ReactomeFIViz [18]. These systems use various information sources on interactions of molecular genetic objects, including scientific publications and factual databases. FunCoup is one such system containing more than 37 million interactions that include mRNA/protein co-expression, protein–protein interaction, similarity by phylogenetic profile, binding of shared transcription factors, sub-cellular co-localization and others. STRING is another example of such systems, containing information about protein–protein associations, information obtained from curated databases, predictions (gene neighborhood, gene fusions, gene co-occurrence), text-mining, co-expression, etc.

Earlier, we developed the ANDSystem, which has a wide range of tools for the reconstruction of associative gene networks [19]. The knowledge base of ANDSystem contains more than 14 million interactions between proteins, genes, metabolites, microRNAs, diseases, biological processes, etc. Information on interactions was extracted from PubMed abstracts using a text-mining method and was also extracted from various molecular genetic databases. Interactions were subdivided into physical interactions, catalytic reactions, chemical transformations, associations, regulation of expression, activity, transport/release, stability/degradation, etc. The ANDSystem was used to solve a wide range of tasks related to the reconstruction of gene networks – in particular, for the interpretation of data of proteomic experiments [20,21,22], the analysis of the tissue-specific effect of gene knockout [23], the analysis of the hepatitis C virus interaction [24, 25], the identification of genes susceptibility to tuberculosis [26] and analysis of molecular mechanisms of comorbidity of diseases [27, 28].

Another well-known approach to the study of functional linkages in gene sets is analysis of over-representation of the Gene Ontology (GO) biological processes, KEGG pathways and diseases. There are several computer tools aimed at facilitating this task, such as DAVID [29], BINGO [30], GO-function [31] and others. These programs are widely used to interpret the experimental sets of genes obtained in transcriptome analysis, genome-wide association studies, mass spectrometric experiments, etc. [22, 32,33,34,35]. However, such methods do not take into account a structure of the networks, which describe interactions between genes. Due to this, for the last ten years, several methods allowing to perform an analysis of gene networks were developed [36,37,38,39]. One such method is EnrichNet [37], which uses a random walk procedure for the estimation of the distance between experimentally obtained and predefined functional gene sets inside a network. Comparison of gene networks with random networks is an alternative approach for determining functional connectivity in experimental sets of genes/proteins [40,41,42]. In the work of McCormack et al. [43], a stand-alone tool, CrossTalkZ, was developed to assess the statistical significance of inter and intra-connectivity (crosstalk enrichment) between or within gene sets. CrossTalkZ uses the FunCoup database for the reconstruction of the gene networks, while random networks are generated by the permutations of all edges or nodes in a global network [12].

In this paper, we describe a web tool that allows evaluation of the functional relationship between genes using the STRING and ANDSystem databases, which differ from FunCoup by types of interactions between objects as well as information sources. Based on the analysis of the gene sets involved in GO biological processes, it is shown that the sensitivity of the method exceeds 0.8 at a specificity of 0.95 for both STRING and the ANDSystem. This study identified that the significance of the difference between gene networks of biological processes and random networks depends on the type of interactions (protein-protein interaction, co-expression, expression regulation, etc.). In particular, networks constructed for apoptosis (GO), including separate types of links, such as “activity and transport regulation”, “catalysis”, “co-expression” and “interaction”, were statistically significantly different from random networks. However, as a rule, the greatest reliability was observed for networks that included not individual types of links, but a general type of connection – that is, a type of connection in which two objects are considered to be connected if there is a link between them of any particular form. The FunGeneNet web tool allows users to upload a list of human gene/protein identifiers as an input. The output data is an associative gene network built either by the ANDSystem or STRING, as well as the evaluation of network functionality, expressed as the significance of the network enrichment with links of a given type. FunGeneNet is available at URL: http://www-bionet.sscc.ru/fungenenet/.

Implementation

FunGeneNet algorithm

In the first step, the network is automatically reconstructed for the input list of genes/proteins, using the ANDSystem or STRING base of knowledge. The networks used by FunGeneNet are subnetworks obtained from of the global ANDSystem or STRING networks. In the STRING networks, vertices correspond only to the proteins, linked by a generalized type of interaction. In the ANDSystem, genes and proteins are represented by separate objects, which can be linked by various types of interactions, including protein-protein interactions, protein-DNA interactions, regulation of gene expression, activity regulation, etc. In the next step, a filtration of the subnetwork by user-specified interaction type is performed. There are two operation modes in FunGeneNet. The first mode is applied when a user selects “all types” for the interaction. In this case, all interactions presented in the FunGeneNet network are considered as a generalized type of interaction. The second mode is used when a user selects a specified type of interaction (for example, “activity regulation and transport”, “catalysis”, etc.). In this case, the system employs only interactions of the specified type, while any others are removed from the network. It should be noted that in the case of STRING, only the generalized interactions are used.

The method for assessing the functional enrichment consists of comparing the number of links between the analyzed and random networks. For this purpose, the connectivity of 100 random networks is calculated and the parameters of the normal distribution are evaluated for this sample to use a one-sided single-sample t-test (pnorm function of R language). In the absence of connections in both the analyzed and all random networks (edgeless networks), the p-value is taken to be 0.5, since in the case of a small non-zero number of edges in the sample of random networks, the p-value for an edgeless network is close to 0.5.

For the reconstruction of the random networks we used the node permutation approach proposed in [44]. The main difference of our algorithm is that labels of vertices were swapped in the global network, not in the local one. Other randomization methods were not considered because they are significantly inferior in performance to the method of node permutation and do not yield a significant gain in accuracy [43]. Performance in this study was critical because FunGeneNet is a web-application.

Random networks were built according to the following rules: (1) For each protein of the analysed set, the vertex degree in the global network was counted and the set of proteins of the global network with the same vertex degree was determined; (2) One protein was randomly selected from this set, which served as the starting vertex for the reconstruction of the random network; (3) The network reconstruction for the starting vertices was performed as for the network being analyzed.

Thus, each random network contained the same number and type of vertices as the original network, and the link types were also the same, while the number of links in random networks and the original were different due to permutations.

Restriction (1) on the degrees of selectable vertices in the global network is aimed at reducing the study bias described by Jensen et al. [45] as a tendency to study, in various aspects, primarily well-studied molecules. In this connection, we assume that vertices with relatively large degrees (hubs) accumulate more false-positive interactions than vertices with lower degrees. As can be seen from Fig. 1, the vertex degrees in the global gene network can be roughly described by a power law with the coefficient γ = 1.39. Therefore, the probability of choosing at random a vertex with a small degree is significantly higher than the probability of choosing a hub. Thus, if in the studied group of genes/proteins the hubs predominate for some reason, then such a network is likely to be more connected than the networks with randomly selected genes. The presence of well-studied genes in the analyzed sample can lead to a systematic error in random sampling, which was also noted in other works [40, 43].

FunGeneNet input data

A list of protein IDs for the following databases is supplied to the input: UniProt, Ensemble. The program also understands NCBI gene identifiers. In a case where genes are fed to the input of the tool, the list of encoded proteins is first determined, and then the reconstruction is performed. The user has the opportunity to select the STRING system or the ANDSystem, through which the gene network will be reconstructed. In the case of using STRING, the user can select one of the standard thresholds for the presence of a connection in the global network: 150, 400, 700 and 900. In the case of the ANDSystem, the user can select the type of interaction from the list (activity and transport regulation, catalysis, coexpression, expression regulation, interaction and all types).

FunGeneNet output data

The FunGeneNet output is a file containing an interactive network in ANDSystem/tab-delimited format and the t-test p-value, which characterizes the difference between the analysed network and random networks. The given t-test p-value assumes the normal distribution of the number of links in random networks and can be biased from the true probability values. Therefore, in addition to the network being analyzed, the ROC curve p-value is calculated as the proportion of negative sample networks having a t-test p-value less than or equal to that for the network being analyzed (coords function of the pROC package of R language).

Accuracy estimation of the FunGeneNet method

To analyze the accuracy of the FunGeneNet method, we applied the ROC analysis technique [46]. Networks constructed for GO biological processes were considered as a positive sample. Information on the involvement of proteins in the processes was taken from the UniProt-GOA database (Submission date: 3/16/2016) [47]. GO networks were divided into two groups according to the number of proteins. The first group included processes for which 2 to 50 proteins were annotated, and the second group included processes with more than 50 proteins according to UniProt-GOA (Additional file 1: Table S1).

As a negative sample of networks, four types of random networks were used, for which it was assumed that they include functionally unrelated genes. Networks of the first type (simply random) were constructed by randomly selecting proteins from the whole set of human proteins, each of which had at least one connection in the global ANDSystem network. This restriction, to exclude proteins not participating in the formation of the global network, is also applied to other types of random networks. To build networks of the second type (well-studied), a random selection was made from proteins, mentioned in at least 50 PubMed publications. Thus, this group was represented by the relatively well-studied proteins. This group was created in order to take into account the possible FunGeneNet misclassification bias introduced by the level of scrutiny of proteins [45]. Networks of the third type (GO-based) were built using a random selection of proteins from a variety of proteins annotated in the GOA database (Additional file 2: Table S2). The reconstruction of these networks was carried out in such a way that one network did not contain the proteins involved in the same biological process. Networks of the fourth type (identical degree distribution [IDD]) were constructed with a restriction on the vertex degrees, so that each set of proteins from the positive sample corresponded to a set of the negative sample. The selection procedure consisted of three steps: (1) the vertex degree in the global network is determined for each protein of a positive sample, (2) the list of all proteins with the same degree as for a particular protein of a positive sample is extracted from the global network, (3) the starting protein for IDD network reconstruction is selected at random from this list. This method of reconstruction guaranteed equal vertex degree distributions in positive and negative samples. When considering characteristics of FunGeneNet – depending on the size and completeness of the networks, the STRING score, and the t-test/permutation option – networks of the type “simply random” (Additional file 2: Table S2) were used.

To construct the ROC curves, the number of random networks in a negative sample, as well as the distribution of the number of proteins in the random networks were specified to be equal to those in the positive sample. The same positive and negative samples of proteins were used to reconstruct networks for the ANDSystem and STRING (version 9.1).

The ROC curve classifier score was taken to be equal to 1 − p-value, where p-value characterized the statistical significance of the differences between the analysed networks and random networks, given out in the output data of the program. The area under the ROC curve (AUC) was calculated using the “roc” function of the pROC package of R language. As the “roc” argument “auc”, a “predictor” vector consisting of values of 1 − p-value for functional and random networks was fed. The argument “response” was a vector, with the coordinate values equal to 1 for functional networks and 0 for random networks.

To analyze the performance of the method depending on the type of interactions, the ANDSystem types were combined into larger types: (1) “activity and transport regulation”, which included the following types of interactions: “activity downregulation”, “activity regulation”, “activity upregulation”, and “transport regulation”; (2) “catalysis”, including “catalyze”, “cleavage”, “degradation downregulation”, “degradation regulation”, and “degradation upregulation”; (3) “coexpression”, which was taken as a separate type; (4) “expression regulation”, consisting of “up-”, “down-”, and “expression regulation” itself; (5) “interaction”, which was taken as a separate type; and (6) “all types”, including all of the above types, as well as the type “expression” and the type “association”.

To estimate how the completeness of genes of the studied process, presented in the experimental set of genes, would affect the obtained results, the following analysis was performed. At the first step, all GO biological processes were divided into five main groups according to the number of genes involved in each process: (1) processes, involving 10 genes; (2) from 20 to 22 genes; (3) from 40 to 50 genes; (4) from 100 to 200 genes; (5) from 400 to 1000 genes. Next, for each process, 10 genes were randomly selected from its entire set of genes. Thus, the completeness for the first group was 100% (the experimental set contained all genes of the process), for the second it was 45–50%, for the third it was 5–10%, etc. The selected lists of proteins are given in Additional file 1: Table S1. At the next step, an ROC curve was constructed for each range of the completeness.

The significance of the difference in the AUC of the ROC curves was estimated using the two-sided unpaired DeLong’s test, through the roc.test function of R language.

The p.adjust function of R language was used for the Benjamini Hochberg multiple testing correction.

Results

Method assessment

We consider two method variants, based on 1000 permutations as well as the t-test, using parameters of normal distribution estimated from 100 permutations. To assess any decrease in accuracy in the case of using the t-test instead of permutations, we build ROC curves for these variants (Fig. 2). Figure 2 shows that the AUC for these variants is nearly the same for both the ANDSystem and STRING. Due to this, and based on the fact that the method variant using a t-test reduces the number of calculations by approximately 10 times, below we show ROC curves constructed by the method based on the t-test.

An interesting question about the FunGeneNet applicability is the dependence of the quality of the functional/non-functional network classification on the size of the gene set. Figure 3 shows that FunGeneNet performs non-random classification even in cases of small network sizes.

Interactions between the genes contained in the global network have a different degree of reliability. Therefore, in the STRING system, a special score is used, which describes the weight of interactions. The STRING score is the threshold for eliminating noise information. Increasing the score for STRING networks can reduce the share of false interactions and decrease the completeness of networks. For this reason, a decision was made to check how the accuracy of FunGeneNet depends on the STRING score. Figure 4 shows the ROC curves for the standard values of the STRING score.

The use of ANDSystem networks in FunGeneNet allows analysis of different types of interactions, including all types (generalized type), activity and transport regulation, catalysis, co-expression, expression regulation, and interaction. Figure 5 shows the ROC curves for the different interaction types from the ANDSystem according to the different network sizes.

Another important issue to assess the quality of the method is the appropriate sampling of non-functional networks. We proposed four models of non-functional networks: “simply random” — random selection of a set of proteins, from having at least one connection in the global network; “well-studied” — the choice is the same as in “simply random”, but from proteins found in more than 50 publications; “GO based” — random selection is made from GOA, so that all the proteins in the sample do not have common GO biological processes in the direct GOA annotation; “the same degree of distribution” (IDD) — with this choice of negative control, the vertex degree distributions (vertex degrees are counted using the global network) in negative protein samples are exactly the same for those of positive samples. Figure 6 illustrates the ROC curves for the ANDSystem for various models of negative control.

Since, in an experimental gene/protein set that can be analyzed with the help of FunGeneNet, for some reason only a small part of the biological process under investigation may appear, we explored how much the accuracy of the method depends on the completeness of the data on the observed process. Figure 7 shows the ROC curves for different portions of GO biological processes for which the network is built. It can be observed from the figure that, as expected, with a decrease in the proportion of proteins over which the network is built, the area under the ROC curve decreases. For protein sets composed of 5–10% of all proteins assigned to the GO biological process, the classification is weaker, but not yet random, and for sets of 1–2.5%, it is close to random.

Thyroid cancer network

Papillary thyroid cancer is the most common form of thyroid cancer [48]. In the dbDEPC database, we identified data from three experiments on papillary thyroid cancer: EXP00039 (E39), EXP00050 (E50) and EXP00051 (E51). E39 contained a list of 30 differentially expressed proteins [49]. E50 and E51 were conducted within the same work and gave an identical list of 16 proteins for two different variants of cancer cell types [50]. At the intersection between the E39 and E50 lists, there were five proteins: ANXA1, Beta-actin, Moesin, FTL and Galectin-3.

Using FunGeneNet, we reconstructed networks for E39 (Additional file 3) and E50 (Additional file 4), as well as for intersection (Additional file 5) and union (Additional file 6) of the protein lists. The results of comparing networks E39 and E50 with random networks are listed in Table 2.

Apoptosis network

As an example, we considered a functional network formed by genes/proteins participating in the GO apoptotic process [GO: 0006915]. Apoptosis is known to be necessary for the normal development and functioning of the organism and is also of key importance in mechanisms of many diseases, such as neurodegenerative and cancer diseases [51,52,53]. A wide range of interactions is involved in this process, including the protein–protein interaction and regulatory links that determine the regulation of gene expression, as well as the regulation of protein activity and transport, etc. The identification of the significance of different connection types in the gene network of the apoptotic process can help to better understand the mechanisms of functioning and the role of participants of this complex biological process.

The protein list of apoptosis according to UniProt-GOA included 593 proteins (Additional file 7: Table S4). The network included 591 proteins, 585 genes and 12,529 interactions (Additional file 8: apoptotic process.andz).

FunGeneNet established the apoptosis network as functionally enriched by the types of “activity and transport regulation” (p-value = 3.95e-09), “catalysis” (p-value = 3.06e-06), “coexpression” (p-value = 3.09e-02), “interaction” (p-value = 3.24e-76) and “all types” (ANDSystem p-value = 1.46e-30, STRING p-value = 0). All networks for these types of links generally correspond to the power law of vertex degree distribution (Additional file 7: Table S4). This means that a small fraction of vertices aggregates most of the connections and these vertices can be of considerable interest.