We proposed the USGSA method with implemented R package of snpGeneSets to provide a convenient and fast tool for study of pathway association from GWAS data. The USGSA applies the uniform score to unify four different gene measures of minP, 2ndP, simP and fishP, and measures pathway association by hypergeometric test. The pathway analysis by USGSA is based on test of MSigDB gene sets. The MSigDB annotates 10,722 genes sets with 32,364 genes, and the number is much smaller than the number of GWAS SNPs. Therefore, the pathway analysis will alleviate the burden of multiple-test adjusting and improve testing power. Application of USGSA successfully identified 7 significant gene sets associated with all studied diabetes traits, indicating common genetic regulations shared among different traits.
Four gene measures of USGSA are proposed to summarize gene effects with different characteristics: the minP and the 2ndP measure gene effects based on a single-SNP association; while the simP and the fishP assess gene effects based on multiple SNP associations with accounting for the number of GWAS SNPs in a gene. USGSA applies a uniform score to unify these gene measures for comparability with the same interpretability. The score ranges from 0 to 1, and it is explained as top percentage of the gene associations over genome. The USGSA calculates empirical p
e
of pathway association from gene measures based on hypergeometric distribution, which accounts for both the number of significant genes and the size of the pathway. The pathway adjusted p-value (P
adj
) can be calculated by permutation test to account for pathway dependence and multiple testing, and an independent pre-generated permutation table is directly used to facilitate the calculation.
USGSA gene measures can better facilitate replication studies of a gene effect that may have inconsistent SNP associations from different GWAS due to genetic heterogeneity. USGSA can also help to identify a significant pathway shared by different traits which however may be activated through different mechanisms. For example, expression study showed that T1D and T2D likely share a common pathway which however has different regulated genes (e.g. MYC) [17]. For this study, the gene set of FOXO4 (pid: ‘2247’) has total ~2,000 genes, and contained 131 ~ 149 and 167 ~ 214 genes with uniform scores ≤ 0.05 in stage I and stage II, respectively, which are about 9.7% ~ 11.8% of component genes among top 5% of the gene associations over genome; the gene set had only 10 significant genes over all studies. These results indicate that multiple diabetes traits may be influenced by a common pathway activated through different genes.
The USGSA does not require access to individual-level SNP genotypes and the analysis is based on GWAS p-value only. For existing pathway analysis of GWAS p-values, ALIGATOR preselects a p-value criterion to define a list of significantly associated SNPs [9]; the i-GSEA4GWAS [4], GeSBAP [5] and gamGWAS [6] all select the best p-value of SNP associations to measure gene effects; and the GSA-SNP enables selection of the kth (k = 1, 2, 3, 4, or 5) best p-values as the gene measure [7]. Compared to these analyses, the USGSA provides broader measures to summarize gene effects with different characteristics as described above. The existing pathway analysis, e.g. ALIGATOR [9], generally requires selection of a p-value cut-off to identify significant SNPs and genes for pathway test. The selection may be arbitrary and study-dependent, and the results may not be comparable between different studies. In contrast, USGSA selects a uniform-score cut-point (0 ≤ α
G
≤ 1) for all four gene measures, which is hypothesized as α
G
proportion of genes associated with the study trait. Significant genes identified from different GWAS data by different USGSA gene measure will have the same interpretation, explained as top 100*α
G
% of the gene associations over genome. Comparing with many existing pathway analyses that rely on a time-consuming permutation to adjust for complex genetic structures, the USGSA implements the adjustment by corresponding gene measures, hypergeometric test and permutation test. In addition, benefiting from the random uniform distribution of uniform score, the USGSA provides a pre-generated permutation distribution table, which facilitates a computationally efficient calculation of pathway adjusted p-value (P
adj
) for different gene measures.
The USGSA calculates both empirical (p
e
) and adjusted (P
adj
) p-values to measure pathway association. Due to multiple testing issue, the p
e
can result in high FWER (Figure 1). The P
adj
was shown to have well-controlled FWER and high specificity for all four gene measures (GPR ≤ 2.3%), especially for the simP measure (Figures 1 and 2). GSA-SNP analysis implements a computationally efficient measure of pathway association from GWAS SNP p-values by Z statistic. However, the measure assumes independence of gene sets and the test may result in inflated FWER which is evidenced in simulation study of dataset I (GRP = 11.8%). By comparison, the power is 0.1% ~ 1.9% for 4 gene measures of USGSA with P
adj
as significance indicator (Figure 1). Therefore, the permutation-based P
adj
not only adjusts for multiple testing but also alleviates potential issues due to complex genetic structure.
The power of USGSA based on P
adj
depends on selected gene measures, and an inappropriate measure can result in a contrary conclusion. For dataset IV, component genes of KEGG_T2D pathway are simulated to have multiple SNP associations each, and the gene measures of 2ndP and fishP have extremely high power of 98% and 100%, whereas the gene measures of minP and simP have low power of 24% and 9% respectively. For dataset V, 11 component genes of KEGG_T2D pathway are simulated to have an extremely strong SNP association each, and the gene measures of minP and simP presented high power of 84% and 100%, whereas 2ndP and fishP showed low power of 1% and 58% respectively. Therefore, different gene measures have their adaptive tests of pathway associations: 2ndP and fishP are more fitting for a gene containing multiple SNP effects, while minP and simP are more suitable for a gene having extreme SNP effects. The results also indicate that the fishP is tolerant to gene effects characterized with different types of SNP associations.
USGSA was successfully applied to identify common pathways associated with different diabetes traits. GWAS SNP associations of diabetes traits were identified from the dbGaP and classified using 11 FHS GWAS in stage I and 5 non-FHS GWAS in stage II. GWAS analyses of the FHS samples are dependent, leading to the possibility that SNP associations are potentially correlated among different GWAS and the identified pathways in stage I may be false-positive. Therefore, we analyzed the non-FHS GWAS, based on independent samples, to validate the candidate gene sets through stage II. Although the FHS and non-FHS GWAS are respectively based on lower- and higher-resolution genotyping from different platforms, SNP-Gene mapping of USGSA (Figure 4) applies the recent genome build and makes a consistent map of SNPs and genes to perform comparable pathway studies across heterogeneous GWAS.
USGSA with the fishP measure successfully identified 7 common gene sets associated with diabetes traits. Component genes of these gene sets have significantly higher probability of association with diabetes traits among the top 5% of genes than a random gene. These component genes have common binding motifs in their promoter regions of [-2kb, 2kb] around transcription start sites. The motifs include targets for 5 TFs of FOXO4, NFAT, TCF3, VSX1 and POU2F1, 1 microRNA of MIR-218 and one unknown binding factor. There are 25 common component genes with uniform score ≤ 0.05 over all GWAS (Table 3). These genes and their binding factors suggest potential regulatory genetic mechanisms underlying diabetes pathogenesis. For example, CNTN4 belongs to the gene sets of VSX1 (pid: ‘2239’) and TCF3 (pid: ‘2240’). Association tests and mouse experiments have indicated that CNTN4 is an obesity–insulin targeted gene [22]. However, the gene function related to diabetes pathogenesis remains unclear. TCF3 is highly expressed in islets of T2D patients and is associated with Wnt signaling in diabetes pathogenesis [21]; and VSX1, expressed in ocular tissues, is associated with eye diseases [23]. Therefore, our findings and existing published evidences can lead to the hypothesis that mutations in the CNTN4 gene modify its binding with TCF3 in the pancreas and VSX1 in the brain and activate related genetic regulations for diabetes and its complications. NPAS3 is the consistent significant gene of all GWAS with target motifs for TFs of FOXO4 (pid: ‘2247’) and POU2F1 (pid: ‘1551’). FOXO4 [19] and POU2F1 [24] have been shown to regulate insulin signaling and glucocorticoid expression, respectively. Therefore, the component gene of NPAS3 and TFs of FOXO4 and POU2F1 can form another hypothesis of potential genetic pathways related to diabetes pathogenesis.
Our literature and gene annotation review showed that nearly all of the 25 common genes are highly expressed in the brain, and most of them are known as susceptibility genes for neural development and neurological disorders (Additional file 2: Table S3). These findings suggest that the CNS may have a critical role in diabetes pathogenesis. This conclusion is supported by previous studies of NPAS3, where gene mutations are related to the aerobiology of psychiatric illness [25,26] and the gene can induce susceptibility to diabetes for psychiatric patients [27]. Therefore, although pathway analysis at this study cannot provide direct evidences of regulatory mechanisms, the findings can help to form new hypotheses and initiate follow-up studies to ascertain pathogenetic changes of diabetes progress.
SNP associations for the analysis were retrieved from public GWAS results stored in the dbGaP. However, the number of available GWAS for diabetes traits is limited and the genotyping has low resolution (≤500K SNPs), especially for the FHS GWAS, which may cause missed identifications of some important gene and pathway associations. Therefore, more GWAS of diabetes traits are required to improve the power and replicate the findings in future studies. In addition, the gene set analysis depends on the selection of uniform-score cut-point α
G
, which is 5% for this study. The MSigDB database includes 32,364 genes from 10,722 genes sets and this selection assumes ~1,620 genes of them associated with every diabetes trait. Pathway analysis of diabetes traits showed that these genes are significantly enriched in the seven identified gene sets. A smaller/larger α
G
will cause a decreased/increased number of target genes for enrichment test, which consequently affects identification of significant pathways.
The current study aims to identify common pathways significantly associated with different diabetes traits. The findings will help to discover shared genes and genetic pathways related to different diabetes symptoms and complications. A significant common gene set is observed only if it has P
adj
≤ 0.05 over all GWAS in stage II, and the probability of making type I error is far less than 0.05. The 7 significant identifications by USGSA account for 0.07% of total MSigDB gene sets. These gene sets contain 25 common genes, which are among the top 5% of gene association over genome for all GWAS and account for <1.5% of total hypothesized susceptibility genes in enrichment test. However, it may not be accurate for the hypothesis that 5% of genes in the genome (~1,620 susceptibility genes) are associated with diabetes traits, and 1% deviation of the hypothesis will result in changes of ~320 susceptibility genes for diabetes traits. Therefore, although there are strongly significant evidences that component genes from the 7 gene sets are associated with diabetes traits, these 25 common genes are only susceptibility candidate genes for diabetes pathogenesis, which requires further replications and experiment validations in the future study.