Development of the GEne set topological impact analysis (GESTIA)
To assess the relationships between gene sets or pathways, the significance of the enrichment of the overlapping genes is often used as an indicator. However, the number of overlapping genes between pathways does not reflect the upstream/downstream interactions of the pathways. As observed in the example signaling pathways (REACTOME_MAPK1_ERK2_ACTIVATION and REACTOME_MAPK3_ ERK1_ACTIVATION from Molecular Signature DataBase, MSigDB, Fig. 1a) [6, 17], it is possible for two pathways to share a high proportion of genes, yet exhibit nearly equal dominant effects on each other. The “dominant effects” here means a “tail to head” interaction of the upstream pathway to the downstream pathway. On the other hand, even if two pathways share few or even no common genes, they might still be closely correlated with each other. This is illustrated in Fig. 1b, which displays REACTOME_MAPK1_ERK2_ACTIVATION and REACTOME_PI3K_AKT_ ACTIVATION signaling pathways from MSigDB, and in Fig. 1c, which displays BIOCARTA_MTOR_PATHWAY and REACTOME_PI3K_AKT_ACTIVATION from MSigDB. The two pathways in Fig. 1a and Fig. 1b show an almost symmetrical topology, which is also true in the biological sense. The yellow pathway in Fig. 1c is clearly located upstream of the red pathway. In order to quantitatively assess the upstream/downstream relationships of two pathways/modules, especially the relationships exemplified by Fig. 1b and Fig. 1c, we developed a novel algorithm named “GEne Set Topological Impact Analysis” or GESTIA (Fig. 1d).
Since genes may have different functions or interactions in different pathways/modules, interrogating the interactions between pathways in the context of a global network enables comprehensive assessment of the gene interactions beyond the individual pathways being studied (see Fig. 1b as an example). Suppose Gene A and Gene B are from two different pathways, and they do not interact with each other according to these two pathways. However, they may interact with each other in a third pathway. This interaction will only be shown if we take the gene interactions from the third pathways into consideration when we study the relationship of the first two pathways. And this the reason that two pathways sharing no common genes are still possible to interact with each other. Considering the situations like this, we constructed a global network from the KEGG pathway database before calculating GESTIA scores. The code for merging and filtering the global network can be found in Additional file 3.
To quantitatively assess the upstream/downstream relationships of pathways/modules, we designed the GESTIA score to reflect the relative influence of one pathway/module on the other, since “upstream” usually implies that one pathway/module acts on another pathway/module. Therefore, we converted the calculation of GESTIA score into the comparison (subtraction) of the two calculable influence scores, pathway/module A on pathway/module B and the reverse, B on A. To calculate the influence score of A on B, we took the sum of all the influences of the genes in A on B as the raw influence score. The influence score of B on A was calculated similarly.
However, with any given topological structure of two pathways/modules, there is a chance for random interactions between genes that belong to different pathways/modules. The influence scores of these random interactions, therefore, form the null distribution. Because of the variability of the topological structure of pathways, it is difficult to construct a universal null distribution for all the combinations of different pathways. Therefore, we permutated the interactions between the two pathways/modules being studied to calculate an empirical null distribution of the influence score, while maintaining the internal topological structure of each of the two pathways/modules.
For pathways with overlapping genes, the interactions between the shared genes and other genes can be considered as both intra-pathway/module and inter-pathway/module interactions, which impede the permutations, because if we perturb these inter-pathway/module interactions, we will have to change the pathways’/modules’ topological structure, since these interactions are also intra-pathway/module. To overcome this dilemma, we create pseudo-vertices in the network that duplicate the shared genes (Fig. 1d). In this way, the two previously overlapping pathways/modules are split into two non-overlapping pathways/modules, which enables the same permutation strategy as for non-overlapping pathways/modules. At the same time, since this splitting process does not perturb the topological structure of individual pathways/modules, the raw influence scores remain the same. This overlap-splitting strategy successfully solved the difficulty of permutating pathways/modules with overlapping genes, enabling the unbiased modeling of the null distribution of the influence scores. Meanwhile, similar to GSEA [6], we estimated the significance of the relative influence scores by calculating the chance of randomly permutated interactions getting a more significant relative influence score than the real one.
Measurement of the upstream/downstream position of two pathways by GESTIA score
We applied GESTIA first on the example pathway pairs in Fig. 1. We calculated a GESTIA score of 0.013 and a p-value of 0.49 for Fig. 1a, which is consistent with the demonstrated topological structure of the two parallel pathways (red and blue vertices && yellow and blue vertices). For Fig. 1b, we got a positive GESTIA score of 0.011, which indicated “slight” upstream activity of the pathway REACTOME_MAPK1_ERK2_ACTIVATION (yellow vertices) to the pathway REACTOME_PI3K_AKT_ ACTIVATION (red vertices). Indeed, although all the arrows between the two pathways end in the REACTOME_PI3K_ AKT_ACTIVATION pathway, almost all of the endpoint genes (IRS1, IRS2 etc.) of these interactions are downstream of this pathway, meaning that if the interactions are randomly chosen between the two pathways while maintaining the directions, there is a large chance that the resulted random raw GESTIA score is higher than the real one, because the randomly chosen interactions are likely to end in more upstream genes of REACTOME_PI3K_AKT_ACTIVATION. Therefore, the significance of these upstream/downstream relationships is low, reflected by a p-value of 1.0. For Fig. 1c, pathway BIOCARTA_MTOR_PATHWAY (red and blue vertices) clearly sits downstream of pathway REACTOME_PI3K_AKT_ACTIVATION (yellow and blue vertices); therefore, the strong negative GESTIA score (− 1.4) and a small p-value of 0.001 is as expected. In general, GESTIA score and its p-value can correctly indicate upstream/downstream relationships between two pathways.
To compare GESTIA with commonly used pathway analysis tools, we applied GESTIA on the DNA repair pathways and oncogenic pathways, most of which were selected from Chi et al. 2019 [18]. We first calculated the matrix of the Jaccard Index, which is an indicator of similarity, for the oncogenic and DNA repair pathways (Fig. 2a, Additional file 4). The oncogenic pathways and the DNA repair pathways can be clearly separated into two groups by the number of shared genes between them. Using the Jaccard Index matrix as an adjacency matrix, we constructed an undirected weighted network of these pathways, which shows a similar tendency (Fig. 2c).
The heatmap of GESTIA score matrix, however, shows a distinct structure of relationships between these pathways (Fig. 2b, Additional file 5). Since the sign of the GESTIA score reflects the upstream/downstream relationship of the two pathways, we arranged the matrix to be the GESTIA scores of row pathways over column pathways; therefore, a positive score indicates that the row pathway is upstream of the column pathway. As expected, the DNA repair pathways shows fewer interactions because of the distinct mechanisms these pathways are involved in. On the other hand, all of the growth factor receptor pathways are upstream of PID_P53_REGULATION _PATHWAY, as expected. However, there are two special cases that showed unexpected relationships with other pathways. One is PID_ERBB2_ ERBB3_PATHWAY, which is downstream of all other oncogenic pathways. Examination of this pathway and its interactions with other pathways showed that, unlike other growth factor receptor pathway collections, this pathway contains more downstream pathway genes, e.g. genes from the PI3K/AKT and MAPK signaling pathways, as exemplified by the comparison of PID_ERBB2_ERBB3_PATHWAY and REACTOME_SIGNALING_BY_ERBB2 (Additional file 1). Although the names of these two pathways may create an impression that they should be similar, or at least partially similar because they are both about ERBB2’s pathway, the significant differences in the gene members of these two pathways suggest that the annotations and the components of the pathways might differ from database to database, hence researchers need to check the components of the pathway collections in detail to accurately interpret the enriched pathways. The other unexpected pathway is REACTOME_PI3K_EVENTS_IN_ERBB2_ SIGNALING, which is upstream of other growth factor receptor pathways. This is because it contains fewer downstream genes than other pathways, as exemplified by the interactions of this gene set with REACTOME_SIGNALING_BY_ERBB2 (Additional file 2). To better visualize the upstream/downstream relationships of these pathways, we constructed a directed weighted network of these pathways to demonstrate the interactions between these pathways (Fig. 2d). Negative values and values less than 1 were excluded before the construction of the adjacency matrix. This network clearly shows the upstream/downstream relationships of the oncogenic pathways and PID_P53_REGULATION_PATHWAY.
Assembly of the enriched pathways based on GESTIA scores
Next, we demonstrate the ability of GESTIA to analyze the upstream/downstream relationships of enriched pathways in real datasets. We first analyzed the up/down-regulated pathways enriched by GSEA analysis of the RNA-seq data of TERThigh Vs. TERTlow hepatocytes from Lin et al. 2018 [19]. In this paper, GSEA revealed enriched gene sets associated with cell division and receptor tyrosine kinase activity in the TERTHigh population, and enriched gene sets associated with ribosome components, mitochondrial proteins, the electron transport chain and hepatocyte metabolic activities in the TERTLow population. However, the paper did not further explore the relationships between these enriched pathways.
After conversion of the GESTIA matrix (Additional file 6) to a directed weighted network (Fig. 3a), the assembly identified one super module consisted of two sub-graphs, with the GO_CELL_CYCLE functioning as a hub between them. We hereby define “GESTIA analysis” to refer to the whole processes of calculating GESTIA matrix, filtering and assembly of the super-module(s). Further demonstrations of the interactions between BIOCARTA_MAPK_PATHWAY and HALLMARK_ XENOBIOTIC_METABOLISM (Fig. 3b) and between GO_CELL_CYCLE and GO_MITOCHONDRIAL_MEMBRANE_PART (Fig. 3c) show that GESTIA analysis correctly detected the upstream/downstream relationships within these two pair of pathways. Notably, most of the genes in GO_MITOCHONDRIAL_MEMBRANE_PART do not form an interaction network, but GESTIA analysis still detects the relationships, which is mainly caused by the regulatory effects of Prkaca and Prkacb in GO_CELL_CYCLE on the Nduf gene family in GO_MITOCHONDRIAL_ MEMBRANE _PART (Fig. 3c). We further checked the source of the interactions between Prkaca/Prkacb and the Nduf gene family, and we found that Prkaca/Prkacb exhibits inhibitory effects on the Nduf genes in the Endogenous Cannabinoid Signaling Pathway (hsa04723). Although the mRNA level of Prkaca and Prkacb was not significantly changed, the genes Adcy6 and Adcy9, which catalyze the synthesis of cAMP, were up-regulated (adjusted p-value 0.049 and 0.095), increasing the activation of Prakaca and Prakacb, which would consequently inhibit the Nduf genes, consistent with the up-regulation of GO_CELL_CYCLE and the down-regulation of GO_MITOCHONDRIAL_MEMBRANE_ PART.
We then calculated the GESTIA matrix for the significantly enriched (adjusted p-value < 0.05) hallmark gene sets comparing remission BCR-ABL− stem cells against remission BCR-ABL+ stem cells from another published work [20] and converted the GESTIA matrix (Additional file 7) to a weighted interaction network (Fig. 3d). Of the 13 enriched gene sets, 9 showed significant upstream/downstream relationships in a super module. The network clearly shows two distinct sub-networks: one is the signaling network of PI3K, AKT, mTOR, TNF-alpha, E2F and G2M check point; the other is the adipogenesis gene set and the network of oxidative phosphorylation, fatty acid and xenobiotic metabolism. The gene sets of G2M check point and DNA repair are the terminal vertices of the network, which is consistent with the elevated cell cycle/proliferation and DNA repair activities in remission BCR-ABL− stem cells. Taken together, GESTIA analysis identified the upstream/downstream relationships of the enriched pathways and revealed the orchestrated regulation of pathways.
Assembly of the refined pathways/modules based upon GESTIA scores
As revealed by Donato et al. 2013, unrelated pathways may gain significant p-value due to the overlapping genes with genuinely affected/causal pathways. The proposed “crosstalk analysis” can identify and filter these false positive and redundant pathways, at the same time create functional modules. Assembling the filtered pathways and newly created functional modules may generate a more concise and accurate map of how these pathways/modules interact with each other.
Since the examples presented in Donato et al. 2013 were well annotated with detailed explanation of the rationality of the results, we followed the works and applied GESTIA on both the unfiltered and filtered pathway analysis results. Figure 4a and b shows the assembled pathways and modules identified in the first example in Donato et al. 2013, which profiled the transcriptome of mice’s white fat tissue during transition into resembling brown fat tissue. The vertices in red are the pathways that were known to be unrelated to this biological process, while the green ones are supported by experimental evidences, and the blank ones are uncertain. Applying GESTIA analysis on the enriched pathways without crosstalk analysis keeps all of the three known related pathways, and excluded Parkinson’s Disease, Cardiac Muscle Contraction, Lysosome, and Complement and Coagulation Cascades, first two of which are known to be unrelated pathways (Fig. 4a). In Donato et al. 2013, crosstalk analysis shortened the list of the significantly enriched pathways and produced four additional functional modules. GESTIA analysis of these filtered pathways and functional modules shows there are at least two distinct super-modules in this context, involving the three known related pathways/modules and one previously uncertain module: Cytokine-Cytokine receptor interaction (Fig. 4b). The other two uncertain modules were left out in this assembly, which indicates that there were no significant upstream/downstream relationships between them and other pathways/modules. This is probably because the current pathway collections are still incomplete, hence there are unprofiled pathways which may link these two modules with other pathways/modules.
We also applied GESTIA analysis on the second example in Donato et al. 2013, which described cross talk analysis results of an experiment on cervical ripening. Figure 4c shows the GESTIA analysis result of the significantly enriched pathways without crosstalk analysis. The crosstalk analysis produced an independent functional module, extracted from the three pathways: ECM Receptor Interaction, Amoebiasis, and Focal Adhesion, named “Integrin Mediated ECM Signal” (Fig. 4d). Interestingly, the GESTIA analysis of the filtered pathways/modules (Fig. 4d) maintains the basic topological structure in Fig. 4c, with the replacement of the false positive pathway Small Cell Lung Cancer by Leukocyte Transendothelial Migration, and an additional downstream pathway Dilated Cardiomyopathy, which is labeled in red because of the lack of evidences. However, the strong GESTIA score between Dilated Cardiomyopathy and Leukocyte Transendothelial Migration led us to investigate the possibility of real involvement of this Dilated Cardiomyopathy pathway. We plotted the gene interactions between Dilated Cardiomyopathy pathway and Leukocyte Transendothelial Migration pathway (Additional file 8), which clearly showed three sub-modules in Dilated Cardiomyopathy pathway separated by Leukocyte Transendothelial Migration pathway. The three sub-modules correspond to 1. Sarcomere, 2. ECM-receptor interaction, and 3. Sarcoplasmic Reticulum derived calcium signaling. Each of these three sub-modules contains significantly changed genes, suggesting that they may all be involved in the process. The ECM-receptor interaction module has been detected by crosstalk analysis, as shown in Fig. 4d at the upstream of the super-module. The other two sub-modules are related to the muscle contraction, which was reasonable since 10 to 15% of the uterine cervix is constituted of smooth muscle [10].