In this study, we have developed a new pairwise SNP-interaction prioritization algorithm for GWAS. We hypothesized that by first accounting for pairwise marker dependencies among case and control groups, it would be possible to observe true disease interactions above the noise of dependent markers unrelated to disease, as was proposed in earlier studies of LD contrast (see Background).
In GWAS data, it is well known that LD generates strong pairwise dependency signals that are used to identify disease associated SNPs by imputation. However, this type of signal predominates pairwise markers in analysis of gene interactions. For example, in the approach used by Wan et al. , the majority of the interactions identified for all seven WTCCC datasets can be attributed to LD effect, i.e., the interacting SNPs are within 1Mb of each other in the same genomic region. To validate our approach correcting for pairwise dependencies unrelated to disease SNP interactions, extensive tests were performed on simulated data. For a simple model with only one interacting pair, the top ranked iLOCi pair is correctly identified as the disease marker pair. When testing for multiple interacting pairs, iLOCi has high accuracy under the conditions of high heritability and informativeness, i.e., low MAF. On the other hand, low heritability and/or informativeness leads to type I error as observed by ROC plot. In general, the ρdiff scores reflect the degree of heritability and informativeness. Hence, it is not possible to use a single ρdiff cutoff for identifying disease interactions in the real case when the heritability and informativeness are unknown.
From analyses of real GWAS data, it was found that the ρdiff distributions for all seven diseases could be represented by a single kernel density function with Weibull distribution. However, the range of ρdiff values varies among the diseases and follow the known heritability pattern, i.e., HT has the lowest heritability and lowest top ρdiff score, while T1D has the highest heritability and highest top ρdiff score (Table 2). Although it is possible to calculate P-values of the interacting pairs and use them as cutoffs for prioritization, we consider the use of P-value cutoffs inappropriate. For example, a P-value of 1e-5 (corresponding to ρdiff values of approximately 0.2 or greater) would give approximately 16 million significant pairs for T1D and 200,000 pairs for HT. The same phenomenon of unacceptable type I error was found by others when using FastEpistasis for analysis of real datasets. It is debatable whether Bonferroni correction is valid since the tests are not independent, as shown by the heavy-tailed distributions of ρdiff . Current methods for correction of type I error by false discovery rate are also likely to be impractical because of the requirement for permutation testing.
Instead of using P-value significance thresholds, we used the top ranked 1000 SNP pairs for prioritization, which account for a very small portion (<0.0001%) of all possible pairs. Rather than attempting to identify all gene interactions, which practically can not be found , we limit the prioritization to the top ranked pairs that are most likely to contain the genetic interactions which are informative of the disease etiology, i.e., disease pathways. From the full SNP set analysis, several hub SNPs were identified for each disease which interact with many other SNPs. For some diseases such as T1D, these hub SNPs map to well-known disease associated genes. However, hub SNPs for BD, HT, and CD do not map to genes. These hub SNPs may mediate interactions at an unknown gene regulatory level, e.g. as non-coding RNAs, miRNAs or cis-regulatory elements. Since our knowledge of gene regulation is far from complete , we repeated the iLOCi analysis on the gene-only SNPs subset. By restricting the analysis to SNP pairs in genes only, the ToppGene systems approach for gene prioritization was appropriate, as used by others for GWAS data [37–39].
Gene-based prioritization of the interacting SNP pairs revealed significant representation of previously described disease associated genes. Therefore, we are confident that the novel genes found among the prioritized SNP pairs are novel disease-associated genes. For each disease, hub genes were found which pair with many other genes. Some of these disease hub genes are known and have been replicated as disease genes by conventional single-SNP GWAS, including the MHC gene HLADQB1 for T1D and TCF7L2 for T2D. However, some hub genes have not been reported previously, e.g. the CACNG1 gene for RA. This gene's SNP shows a modest P-value (>1e-4) for association by single SNP analysis ; therefore, the disease association of this SNP is dependent on multiple interactions with other loci. For each disease, including those with low heritability such as HT, we are able to suggest novel genes and pathways for further investigation, including re-analysis of other GWAS datasets for the same diseases.