 Methodology Article
 Open Access
 Published:
Genotype distributionbased inference of collective effects in genomewide association studies: insights to agerelated macular degeneration disease mechanism
BMC Genomics volume 17, Article number: 695 (2016)
 The Erratum to this article has been published in BMC Genomics 2016 17:752
Abstract
Background
Genomewide association studies provide important insights to the genetic component of disease risks. However, an existing challenge is how to incorporate collective effects of interactions beyond the level of independent single nucleotide polymorphism (SNP) tests. While methods considering each SNP pair separately have provided insights, a large portion of expected heritability may reside in higherorder interaction effects.
Results
We describe an inference approach (discrete discriminant analysis; DDA) designed to probe collective interactions while treating both genotypes and phenotypes as random variables. The genotype distributions in case and control groups are modeled separately based on empirical allele frequency and covariance data, whose differences yield disease risk parameters. We compared pairwise tests and collective inference methods, the latter based both on DDA and logistic regression. Analyses using simulated data demonstrated that significantly higher sensitivity and specificity can be achieved with collective inference in comparison to pairwise tests, and with DDA in comparison to logistic regression. Using agerelated macular degeneration (AMD) data, we demonstrated two possible applications of DDA. In the first application, a genomewide SNP set is reduced into a small number (∼100) of variants via filtering and SNP pairs with significant interactions are identified. We found that interactions between SNPs with highest AMD association were epigenetically active in the liver, adipocytes, and mesenchymal stem cells. In the other application, multiple groups of SNPs were formed from the genomewide data and their relative strengths of association were compared using crossvalidation. This analysis allowed us to discover novel collections of loci for which interactions between SNPs play significant roles in their disease association. In particular, we considered pathwaybased groups of SNPs containing up to ∼10, 000 variants in each group. In addition to pathways related to complement activation, our collective inference pointed to pathway groups involved in phospholipid synthesis, oxidative stress, and apoptosis, consistent with the AMD pathogenesis mechanism where the dysfunction of retinal pigment epithelium cells plays central roles.
Conclusions
The simultaneous inference of collective interaction effects within a set of SNPs has the potential to reveal novel aspects of disease association.
Background
A key focus of modern genetic research is the relationship between genomic variations and phenotypes, including susceptibilities to common diseases [1]. Recent advances in genomewide association studies (GWAS) have greatly enhanced our understanding of such genotypephenotype relationships [2–9]. In many cases, however, a large portion of the expected heritability information remains to be discovered. It has recently been shown that metaanalyses involving increasingly large sample sizes can yield many additional loci of statistical significance [10, 11]. Another potential source of such ‘missing heritability’ is the contribution of rare variants not detected by populationbased genotyping platforms. Recent studies based on exome and wholegenome sequencing data combined with statistical tests including burden tests [12], Calpha test [13], and sequence kernel association test [14] are beginning to address such possibilities. It is also expected, however, that the limitation of independent single nucleotide polymorphism (SNP) analyses, where each locus is considered separately to evaluate its association with disease using trend tests or logistic regression models [15], and possible effects of epistasis also contribute to the limited degree of biological effects uncovered so far.
Many studies have addressed the issue of incorporating such intervariant interactions, or epistasis, in GWAS [16, 17]. Main approaches include machinelearning techniques [18–23], entropybased methods [24], principal component analyses [25–27], and the genomewide interaction analysis considering all distinct pairs of SNPs [28–31]. One useful strategy, in particular, is to extend parametric models to many SNPs that have been suitably selected, and inferring interaction effects under a multivariate statistical setting. Previous works within this framework include those based on lassopenalized logistic regression [32, 33]. Under the setting of inference on many interacting SNPs, the dimensionality of the underlying model is of the order of m ^{2}, where m is the number of SNPs that are considered simultaneously, with m=1 and 2 corresponding to the independentSNP and pairwise tests, respectively. To prevent model overfitting, highdimensional inference with limited sample sizes requires regularization, whose values can be determined by crossvalidation. Ayers and Cordell performed a comprehensive study of the performance of different penalizer choices on noninteracting SNP inferences [34].
This class of methods within the context of GWAS so far exclusively used logistic (or linear) regression analyses for casecontrol (or quantitative phenotype) data, which parallels their similarly widespread adoption in the general statistical learning literature. One may note, however, that the actual training data sets in GWAS are collected from case and control populations with distinct genotype distributions. The likelihood of the data to be maximized for inference is given by the joint probability of both genotypes (predictor variables) and phenotypes (response variables). In (logistic) regression, this joint probability is replaced by the probability of phenotypes conditional to genotypes, and the marginal probability of genotypes is assumed to be uniform.
In statistical learning, discriminant analysis is another widely used option for classifying continuous random variables in addition to logistic regression [35–37]. This class of inference methods offers alternative approaches that fully model the joint distribution of predictor and response variables (Section 4.4.5 in Hastie et al. [37]) at the expense of assuming specific predictor distributions (usually multivariate normal distributions). It has been estimated that, for continuous variables, the accuracy of logistic regression models can be lower by ∼30 % than that of discriminant analyses for a given sample size [35, 37].
Genotype distributions within populations from which GWAS samples are collected are also far from uniform, and it is of interest to examine the utility of discriminant analysistype approaches to disease association inference under highdimensional settings, which is the main focus of this paper. The standard discriminant analysis, however, is applicable only for continuous variable predictors. A related approach, the discriminant analysis of principal components by Jombart et al. [38], applies discriminant analysis to principal components (continuous variables) of allele frequencies for unsupervised learning of population structures. We report here, as a major innovation, an adaptation of discriminant analysis to the case of discrete genotype data (discrete discriminant analysis; DDA).
Our inference includes the causal effects of both marginal singleSNP terms and their interactions. These effects are estimated simultaneously, rather than separately as in independentSNP and pairwise analyses. We refer to such combined effects of singleSNP and interaction contributions as the collective effects of disease association. This level of description is analogous to that of the logistic regression inference performed by Wu et al. [33] in terms of the nature of SNP effects included in the modeling. Association studies have two distinct but related goals: inference and prediction. In inference (also known as feature selection), one aims to identify a subset of SNPs that are deemed to be causal, while in prediction, the goal is to apply the trained model and predict the disease status of unknown samples. IndependentSNP analyses widely performed in GWAS, either based on trend tests or logistic regression models with marginal SNP effects only, are mainly geared toward inference. In contrast, the penalized logistic regression including collective effects [33] is more suited to prediction, because the disease risk parameters are optimized directly via maximum likelihood without reference to population structures.
Our method offers a comprehensive approach achieving both inference and prediction by training models to genotype distributions of case and control groups separately under penalizers. The regularization using crossvalidation optimizes prediction capability, while for inference, we derived effective pvalues of the overall singleSNP (we use this terminology to refer to the contribution each locus makes by itself to the overall association, usually in the presence of interactions) and interaction effects using likelihood ratio tests. To our knowledge, the performance comparison of interaction effect detection between pairwise tests and (logistic regression) collective inference has not been reported yet. Our results based on simulated data indicate that collective inference provides far higher sensitivity for interactions than pairwise tests. Compared to penalized logistic regression, DDA yielded further advantages in sensitivity and specificity.
Our current collective inference implementation allows for the maximum likelihood inference of systems containing up to ∼10^{4} SNPs. However, evaluating interaction pvalues of SNP pairs by permutation resampling increases the computational cost by orders of magnitude, limiting the number of SNPs that can be considered in practice. In addition, the requirement for preselection of variants based, for example, on independentSNP pvalues, limits the possibility for discovering novel loci whose effects are significant only when interactions are taken into account. To deal more directly with genomewide data in an unbiased fashion, we describe a second mode of DDA application where ∼10^{6} SNPs are grouped into (∼10^{3} or more) subsets based on phenotypeindependent criteria (e.g., biological pathways), the collective inference is applied to each subset, and their relative importance in disease association is evaluated based on crossvalidation prediction score. This protocol significantly expands the power of SNPbased pathway analysis beyond existing enrichmentbased methodologies [39] by allowing for the incorporation of collective interaction effects within each pathway.
By applying our algorithm to the disease data of agerelated macular degeneration (AMD) [40, 41], we demonstrate that the enhanced ability to account for interaction effects can translate into novel biological findings. AMD is a progressive degenerative disease affecting individuals in old age, characterized by the accumulation of deposits (drusen) in the retina or choroidal neovascularization, which can lead to vision loss. Genomewide studies of AMD constitute one of the earliest and most successful applications of GWAS [2, 3, 40–43], where strong associations were detected and later validated at variants including those near complement pathway genes CFH, C2, and C3, in addition to the ARMS2/HTRA1 loci. However, direct molecular mechanisms tying these associated loci into disease pathogenesis remain unclear. Using AMD casecontrol data, we first analyzed detailed interaction patterns within SNPs selected based on independentSNP association strengths. These interactions were enriched in loci epigenetically active in tissues including adipocytes, mesenchymal stem cells, and the liver. We then applied DDA to pathwaybased groups formed from genomewide data and found high association with pathways involved in phospholipid synthesis, cellular stress response, apoptosis, and complement activation.
Results and discussion
Our algorithm (DDA) extends the discriminant analysis to discrete genotype data. Its overall steps are summarized in Fig. 1 and described in Methods (see Additional file 1: Text S1 for more indepth details).
Independent SNPs
When interactions between the loci are turned off, DDA can be solved analytically (see Additional file 1: Text S1), whereas logistic regression is always numerical. We first compared this special case of DDA and logistic regression without interaction and found the odds ratio and power to be identical for all conditions for binary models (Additional file 2: Figure S1), which implies that the effect of marginal genotype distributions ignored in logistic regression is negligible for a single noninteracting locus. However, since DDA can yield pvalues of each locus without numerical optimization, it leads to considerable computational speedup even when interactions are not included.
Simulation
We compared pairwise tests, logistic regression, and DDA in similarly wellcontrolled but highdimensional conditions in which collective effects can play important roles. In the following, unless otherwise specified, logistic regression refers to the collective inference including both marginal and interaction terms and a penalizer (see Methods). We used simulated data that faithfully reflected prescribed genotype distributions of given sample sizes. A genotype distribution for a binary model with m loci has m singleSNP and m(m−1)/2 interaction parameters. We specified these parameters randomly from normal distributions, generated genotype samples of size n based on these distributions, performed pairwise marginal inference, logistic regression, and DDA, and compared inferred parameters with the true values (Additional file 3: Figure S2 shows examples for the dominant and genotypic models). Our simulated data include linkage disequilibrium (LD): if one approximates the genotype distribution as a continuousvariable normal distribution, the correlation within a single group (case or control) would be proportional to the matrix inverse of interaction parameters specified, and the overall r ^{2} would correspond to the sample sizeweighted average over case and control groups.
For a given sample, we first determined the optimal penalizer (λ) value by crossvalidation. With increasing λ, the mean square error and the area under the curve (AUC) of the receiver operating characteristics generally showed a minimum and maximum, respectively, at λ ^{∗} (Additional file 4: Figure S3). The value of λ ^{∗} decreased as the sample size n increased. This trend implies that for small n, an aggressive regularization is needed (large λ ^{∗}) to minimize overfitting, while for larger n, more interaction terms are inferred with sufficient significance, leading to smaller λ ^{∗}.
Accuracy of inference We compared results of pairwise marginal tests using PLINK [44], logistic regression, and three versions of DDA [exact enumeration; EE, pseudolikelihood; PL, and mean field; MF (see Methods)] in two different simulation settings. In the first case (Fig. 2 a–b), we used m=10 SNPs with parameters chosen such that all sites had relatively large and strong singleSNP and interaction effects. We used the dominant model in these simulations in order to facilitate sampling, which requires exhaustive enumeration of all genotypes (2^{m} and 3^{m} for binary and genotypic models, respectively). Pairwise tests derive odds ratios and pvalues for each SNP pair separately, and the logarithm of the interaction odds ratio corresponds to the interaction parameter. The mean square error of pairwise inference decreased slightly from sample size n=10^{2} to 10^{3} but showed little improvement for larger sample sizes. Outcomes from logistic regression and DDA exhibited AUC values (maximized with respect to λ for each sample) increasing with n for n≤10^{3}. The AUCs from logistic regression were slightly lower for n≤10^{3} than DDA and comparable in larger sample sizes. The mean square error of logistic regression and DDA steadily decreased (approximately linearly in loglog scale) over all n ranges examined. Error levels of DDA from three methods were similar to one another. When compared to logistic regression, the accuracy of DDA was comparable at larger n and better at smaller n. However, the logistic regression results showed much larger variances (with respect to different realizations of samples) for small n than DDA.
In the second setting (Fig. 2 c–d), we enlarged the system to m=20 SNPs (EE omitted due to computational costs), and set the parameters such that only 4 SNPs contributed to disease association. The AUC values were smaller in comparison to the first setting for smaller n, which reflects a weaker overall strength of disease association, but otherwise showed similar trends. The accuracy of pairwise tests, logistic regression, and DDA also exhibited trends similar to simulations in Fig. 2 a: both logistic regression and DDA significantly outperformed pairwise tests, while DDA consistently showed slightly better accuracy than logistic regression. The variances in mean square error were smaller than in the first setting, which suggests that these variances correlate with the number of causal SNPs. For n=10^{2}, logistic regression results had a variance much larger than DDA for small n.
These simulations demonstrate that when both marginal singleSNP and interaction effects are included, the accuracy of collective inference approach is much higher than that of pairwise tests. The DDA generally provides a further edge for smaller sample sizes in comparison to logistic regression. The comparison of two different simulation settings in Fig. 2 a–b and c–d demonstrates that this trend is not significantly altered with increases in the number of SNPs and the fraction of causal SNPs among them. The accuracy (inferred model parameters versus true values) remained at similar levels when the underlying model was changed from dominant to genotypic models (Additional file 3: Figure S2).
Statistical tests We then examined the performance of collective inference methods in evaluating the statistical significance of individual interactions. In GWAS, the significance of SNPs and their interactions are tested either by contingency table or likelihood ratio tests [15]. The presence of the penalizer λ complicates this approach in collective inference. In their study of lassopenalized logistic regression collective inference, Wu et al. [45] adopted the approach of first selecting significant SNPs of a certain size using regularization, and then calculating pvalues of interactions with the penalizer turned off. A disadvantage of this approach is that the information of the relative importance of each interactions reflected in the penalized model is lost when λ is set to zero.
The (analytic) likelihood ratio tests rely on the asymptotic distribution of the likelihood ratio statistic q (q _{ i } and q _{ ij } for a site i and pair i,j): as n→∞, the distribution of q under the null hypothesis approaches the χ ^{2}distribution with degrees of freedom (d.f.) given by the change in the number of parameters between the null and alternative hypotheses [46]. In practice, however, with a finite n, the deviation from this asymptotic limit can be significant. We found the null distribution to show increasingly large deviations from the asymptotic limit as λ increased. We therefore based our statistical tests in the presence of a nonvanishing penalizer on empirical null distributions of q _{ ij } constructed by permutation resampling (Additional file 5: Figure S4).
We then sought to evaluate the sensitivity of causal interaction identification within different inference methods using simulations. We created simulated data of m=10 SNPs, this time with random parameters with mean values that were identical for both case and control groups, except a single SNP pair for which the case group had stronger interactions than the control (Fig. 2 e–f). For collective inference (logistic regression and DDA), we first performed crossvalidation for each sample to determine optimal λ ^{∗}, and then constructed the empirical null distribution under this λ ^{∗} (Additional file 5: Figure S4) to calculate pvalues of the causal and noncausal interaction pairs (Fig. 2 e and f, respectively). We selected the simulation parameters such that the SNPs were fairly strongly coupled via LD in each of case and control groups, but these interaction effects were expected to cancel out except for the single causal pair. The pairwise test results remained insensitive to this causal interaction for all sample sizes. The logistic regression, DDA PL, and EE methods detected this interaction fairly robustly for n≥10^{3}. In all cases, DDA had higher sensitivity than logistic regression. The pvalues for the noncausal interactions mostly followed the expected null distribution qualitatively. However, the distributions from logistic regression were significantly broader (higher false positive rates; lower specificity) than DDA for all sample sizes.
In applications to actual disease data, where one aims to identify statistically significant pairs of interactions based on pvalues, the enhanced sensitivity and specificity of detection shown in Fig. 2 e are of more interest than the parameter prediction accuracy in Fig. 2 a, c. Our results suggest that the sensitivity of detecting diseaseassociated interactions among mostly noncausal SNP pairs from noisy data is significantly higher with collective inference than with pairwise tests. The DDA inference furthermore allowed for consistently higher sensitivity and specificity than logistic regression.
Agerelated macular degeneration
IndependentSNP We first analyzed AMD data under the independentSNP assumption and compared the logistic regression and DDA results. Analytic expressions are available for the odds ratio and pvalues for DDA [Eqs. (S24), (S27) and (S28) in Additional file 1: Text S1]. Genomewide pvalues derived from DDA (Additional file 6: Figure S5) were consistent with published results [41]. The pvalues from independentSNP logistic regression using PLINK and those from DDA for three main associated genomic regions (CFH, C2/CFB, and ARMS2 gene groups in chromosomes 1, 6, and 10, respectively) were the same for most loci except those with strongest associations, for which pvalues from DDA were slightly smaller (Additional file 7: Figure S6). Differences in pvalues were larger with the genotypic model than with the dominant model (Additional file 1: Table S1). Thus, when interactions are not included, DDA gives nearly the same results as the logistic regression inference. This feature allows one to directly interrogate how collective interactions modify association.
Collective inference for 20 SNPs We then examined the performance of DDA on AMD data under the first mode of application, where detailed interaction patterns within a relatively small set of preselected SNPs are inferred. We selected m=20 AMD SNPs using the variable selection program GWASelect [47] (see Additional file 1: Table S1), which covered most regions previously identified as strongly associated with AMD risks (Additional file 6: Figure S5 and Additional file 7: Figure S6). The independentSNP pvalues of this set are shown in Fig. 3 c. For the majority of loci (18 out of 20), the risk allele was the major allele, and odds ratios were smaller than 1. As stated above, under this condition of no interaction, the pvalues from logistic regression (from PLINK) and those from DDA (analytic) were nearly the same.
We applied collective inference (including interactions) to this 20SNP set using logistic regression and DDA. We first performed crossvalidation to determine the optimal penalizer λ ^{∗} (Fig. 3 a–b). It should be noted that because the preselection of SNPs in this case used phenotype information of the entire sample, the crossvalidation prediction score is not an unbiased estimate of the true AUC and is generally higher in value [37]. In our application, this procedure merely allows for the identification of optimal regularization levels for collective inference. We denote the prediction score derived after such preselection using sample phenotypes as pseudoAUC (pAUC) in order to distinguish it from the true estimate of AUC. Unbiased estimates of AUC, if desired, can be obtained, for example, by performing independentSNP pvaluebased filtering based only on training sets of each crossvalidation subdivision [37] (see below) or by using selection criteria unrelated to sample phenotypes (e.g., pathways).
As observed with the simulated data, when DDA was used, the pAUC values with varying regularization levels showed a maximum (Fig. 3 a–b), which corresponds to the optimal degree of interaction effects to be included in genotype distributions. For PL (Fig. 3 a), the maxima were located at λ ^{∗}=0.05 (pAUC=0.751) and λ ^{∗}=0.02 (pAUC=0.765) for the dominant and genotypic models, respectively. The slightly higher pAUC suggests that for this data set, the genotypic model provides a better fit. For DDA, we verified that in the large λ limit, the inference outcome approaches the independentSNP result. The difference between this limit and the maximum pAUC is a measure of the relative importance of interactions in disease association.
We also applied logistic regressionbased collective inference to the same data set. Crossvalidation yielded similar differences between the dominant and genotypic models (Fig. 3 a). However, pAUC did not exhibit pronounced maximum, instead approaching a large λ limit nearly monotonically. This limit was slightly higher than the corresponding DDA maximum, which is consistent with the expectation that logistic regression can yield better prediction performance because it maximizes the prediction score [Eq. (8) in Methods]. On the other hand, the absence of pronounced maximum in pAUC as a function of λ indicates a loss in sensitivity in logistic regression for the detection of interaction effects. This conclusion can be rationalized in terms of the algorithmic difference between logistic regression and DDA: in DDA, case and control group genotype distributions are fit separately in terms of their respective singleSNP and interaction parameters, whereas logistic regression optimizes the prediction score with respect to the net differences in those parameters. With more flexibility to account for differential population structures, DDA has higher sensitivity to detect interaction effects.
Figure 3 b shows the analogous model optimization under the DDA MF method, where regularization parameter values ε=0 and ε=1 each correspond to independentSNP and full interaction limits, respectively. The maximal pAUC values from MF were similar but slightly lower in comparison to PL. On the other hand, MF is more computationally efficient that PL and allows for larger SNP sets.
We used the optimal penalizer value to determine the parameters and pvalues for this 20SNP data set under the genotypic model using DDA PL. The pvalues, representing the statistical significance of the individual terms included in the model, consist of two classes: singleSNP and interactions. The singleSNP pvalues represent the significance of marginal singlesite effects (associated with parameters \(h_{i}^{(y)}\) or β _{ i }). They are analogous to the independentSNP pvalues of each SNP, but having been inferred in the presence of interactions, they also indirectly reflect interaction effects. Strictly speaking, the presence of penalizer λ also affects the distribution of the likelihood ratio statistics q _{ i } and it is desirable to estimate their pvalues using permutation resampling. However, since we did not impose penalty to singleSNP terms directly [Eq. (4) in Methods], we expect this effect to be moderate. In practice, these pvalues tend to be much smaller than 1 for SNPs selected based on independentSNP properties, and they are difficult to estimate using resampling. We chose to use the asymptotic χ ^{2}distribution to estimate these singleSNP pvalues under collective inference. These are therefore expected to be upperlimits (i.e., actual pvalues are expected to be smaller) based on the observation that the penalizer tends to suppress null distributions to lower qregion.
Figure 3 c shows the collective inference singleSNP pvalues of the m=20 AMD data from DDA. They largely retained the relative strengths of significance in independentSNP results, while in absolute magnitudes the − log10p values were mostly reduced. This feature indicates that in comparison to the independentSNP model where singleSNP parameters also contain average effects of interactions, when collectively inferred, these terms make reduced contributions to association because they represent singlesite effects only. We also performed analogous calculations using logistic regression, adopting the lowest value of penalizer λ=0.1 at which pAUC reached the limiting value in Fig. 3 a. The singlesite pvalues showed larger deviations from the independentSNP results (Fig. 3 c), with values for many sites becoming insignificant.
We then performed resamplings of this data set to obtain interaction pvalues (Fig. 3 d), which indicated strong interactions within the CFH gene group in chromosome 1, C2 in chromosome 6, and ARMS2/HTRA1 group in chromosome 10. In contrast, pairwise test pvalues detected strong interactions only within the last group of loci, between rs6585827/rs2280141 and rs2014307/rs2248799 (p∼10^{−9}). These shortrange interactions suggested by DDA tended to be correlated with LD: because the net disease association is related to the difference in SNP correlation patterns between case and control groups (Fig. 1), we interpret these shortrange interactions as the consequence of differential LDpatterns in case and control individuals. The absence of such signals in pairwise test pvalues suggests that such differences get averaged out when represented only by marginal SNPpair distributions.
We sought to further test whether such increased sensitivity toward interactions was achieved with adequate control for false discovery rates. The selection of m=20 SNPs in Fig. 3 c–d comprises SNPs with highest association. For comparison, we made a random selection of m=20 SNPs from the genomewide data and performed DDA as well as pairwise test. The quantilequantile plot (Fig. 4 a) showed that pvalues for interactions between these random SNPs were distributed close to the null distribution. In particular, DDA and pairwise outcomes were similar, except for one pair for which DDA predicted p∼10^{−3}. In contrast, the distribution of interaction pvalues for the highly associated m=20 SNPs (Fig. 3) from DDA deviated significantly from the null (Fig. 4 b), whereas the pairwise test outcome remained similar to random SNPs except for ∼10 SNP pairs. These results suggest that DDA achieves increased sensitivity for interactions while adequately controlling for false positive rates.
Largescale collective inference The analysis described above used a fixed number of preselected SNPs to perform crossvalidation and inferences. We next enlarged the size of SNP sets by controlling it using an independentSNP pvalue cutoff p _{ c }; with the cutoff specified, in each crossvalidation run, the training set was used to obtain independentSNP pvalues, filter SNPs, and perform inferences, and the test set was used for prediction. The prediction score derived under this protocol is an unbiased estimate of the true AUC. The AUC values (Fig. 5 a) showed qualitative trends similar to Fig. 3 a; the AUC maximum relative to the noninteracting limit was more pronounced, while its height showed a moderate decrease with increasing SNP numbers: inclusion of lesssignificant SNPs diluted the overall effects. We chose a SNPset size of m=96 (p _{ c }=10^{−5} without crossvalidation) for interaction pattern analysis. The interaction pvalue computation for m SNPs entails a multiplication of the singleinference computing time by m(m−1)/2 (the number of pairs) times the necessary random resampling size (∼10^{3} or more) for pvalue estimation, thus limiting model sizes that can be considered to m∼100.
The resulting singlesite and interaction pvalues are shown in Fig. 6, where the bottom panel compares the independentSNP/collective singlesite pvalues. As in Fig. 3 d for the smaller SNP set, the collective singlesite significance of strongly associated SNPs was generally reduced in strength compared to the noninteracting case. However, rs932275 in chromosome 10 had a comparable pvalue (strongest within the collective inference results) and many SNPs originally with weaker associations in the noninteracting case became stronger under collective setting. The interaction landscape shown on the top panel of Fig. 6 retained the qualitative trend of the results from the smaller data set in Fig. 3 d, but with much more detail; we confirmed the presence of local interactions within the CFH, C2, and ARMS2 gene groups. In addition, we observed numerous ‘longrange’ interactions that were absent in the m=20 results: rs2284664 in CFH interactions with rs511294 and rs544167 in C2, and there were additional distributed interactions between the CFH loci and the ARMS2 gene group. The distribution of interaction pvalues was similar to that for m=20 in the quantilequantile plot (Fig. 4 b). The pairwise test pvalue landscape for the same data (Additional file 8: Figure S7) was also qualitatively similar to the m=20 case (Fig. 3 d, bottom).
The overall picture of diseaseassociated epistatic interactions from our small and largerscale collective inferences in Figs. 3 d and 6 provides a plausible explanation of the recent observation by Hemani et al. [31], who detected many epistatic SNP pairs leading to differential gene expressions within the human genome by exhaustive searches using pairwise tests. Wood et al. [48] then observed that many of these effects could be explained by single untyped third SNPs in LD with the interacting pairs. Here, we observed both from simulations and AMD SNP analyses that pairwise tests (Fig. 3 d, bottom and Additional file 8: Figure S7) detect only a subset of statistically significant interactions, and a portion of the interaction patterns uncovered from collective inference parallels that of the underlying LD (Fig. 6 and Additional file 9: Figure S8): SNPs with strong overall correlations often also have differential LD between case and control groups. It is thus understandable that interacting pairs of SNPs identified in marginal pairwise tests turn out to be in LD with other hidden variants. Our results in Fig. 6, however, demonstrate a strong presence of interactions beyond both the population LD (Additional file 9: Figure S8) and the reach of pairwise tests (Additional file 8: Figure S7).
Diseaseassociated epigenomes Most of the diseaseassociated loci from GWAS reside in noncoding regions, presumably exerting their effects through modifications of gene regulatory action [49]. The overlap of epigenetic signatures with diseaseassociated SNPs and their interactions can provide important biological insights to the underlying disease mechanism. We sought to identify tissue and cell typespecific interaction patterns associated with AMD phenotypes using the SNP interaction map we derived above (Fig. 6). We used the recently published NIH Roadmap Epigenomics Consortium data [50] to first calculate the enrichment pvalues of the transcribed/enhancer states among the selected 96 AMD SNPs within each of the 111 reference epigenomes (Fig. 7). We combined the actively transcribed and enhancer states of the 15 chromatin state annotations of the reference epigenomes to define the ‘active’ state. For each AMDassociated SNP, we identified the group of all known SNPs that were strongly correlated (high LD), obtained the distribution of epigenetic states over these SNPs within a given epigenome, and tested the overrepresentation of the active state against the background distribution. This enlarged search over the set of all known SNPs in LD with the locus included in inference is crucial to address the issue of the incomplete coverage of genotype data.
The most prominent feature in Fig. 7 is the strong enrichment of active epigenetic states among AMD SNPs within the liver tissue (E066), followed by ovary (E097). Two additional epigenomes, embryonic stem cellderived neuronal progenitor cultured cells (E009) and bone marrowderived mesenchymal stem cells (MSCs; E026), were also notable, but their enrichment pvalues on the singleSNP level were comparable to other tissues.
We then hypothesized that the statistically significant interactions between SNPs identified in Fig. 6 would provide additional information of the celltype specificity of epigenetically active states and their interactions. We selected the SNP pairs with interaction p<10^{−3} from Fig. 6 and, assuming that each groups of LDcorrelated SNPs came from specific cell types (111×112/2 possible pairs, including selfinteractions), tested the enrichment of active state pairs. The pvalues derived then reflect the statistical significance of the epigenetic modifications enabling the interactions occurring between two cell types that are diseaseassociated.
The resulting landscape shown in Fig. 8 exhibited strong interactions involving the liver tissue, consistent with the singleSNP result in Fig. 7. However, clear patterns not seen on the singleSNP level also emerged: bonemarrow derived MSCs (E026) and adipose nuclei (E063) also featured prominently in the interaction landscape; the bulk of interactions involving the liver tissue was accounted for by those with MSC, adipocytes, and muscle tissues. Embryonic stem cell H1derived MSCs (E006) showed interactions that were weaker but similarly distributed in comparison to bonemarrow derived counterparts. The ovary followed patterns similar to the liver but was less pronounced than in Fig. 7. In addition, lung (E096) and placenta (E091) showed some interactions with adipocytes and MSCs. All of these tissues strongly interacted with themselves: interacting SNPs in these tissues are highly likely to be active epigenetically.
SNP selection from genomewide data Collective inference without interaction pvalue computation can be applied to SNP sets of sizes up to m∼10^{4}. The prediction AUC as the main outcome for each SNP selection then allows for the comparison of the relative strengths of disease association of different SNP groups. In such applications, the performance of DDA could depend on (phenotypeindependent) processing applied to genomewide data in selecting SNP sets for analysis. We evaluated this second mode of DDA application and assessed how its performance varied depending on the degree of LD within SNP sets. We generated different subsets of genomewide SNPs by pruning, removing variants that had LD with neighboring SNPs higher than a threshold (Fig. 5 b). The AUC obtained with SNPs selected from the full genomewide data peaked around the mean number of SNPs \(\bar {m}~\sim 50\), as suggested also by Fig. 5 a. With LDbased pruning, the position of maximum shifted to levels up to \(\bar {m}\sim 10\), which suggests that about 10 SNPs in linkage equilibrium account for the bulk of the association. The height of the AUC first increased with the data pruned with r ^{2}<0.5 compared to the full set and then decreased with r ^{2}<0.3, indicating that there is an optimal level of pruning beyond which key causal SNPs begin to be removed. Overall, the relatively small model size ranges where collective inference performance is maximized in Fig. 5 b suggests that AMD is only weakly polygenic with dominant contributions from a few loci. Analyses of the type demonstrated in Fig. 5 b thus allows one to assess the polygenicity of the phenotype under consideration and choose suitable strategies for SNP selection.
Pathwaybased SNP selection An obvious criterion for grouping genomewide SNPs into subsets for collective inferencebased evaluation is the proximity to gene sets belonging to known biological pathways. From the full AMD genomewide data, we generated 1,732 SNP sets corresponding to 1,732 Reactome pathways [51], each containing from 20 to ∼10^{4} SNPs. We then applied DDA MF inference and derived optimized AUC values for each pathway (Fig. 9 a). The majority of the pathways had association levels [median AUC: 0.514±0.021 (95 % C.I.)] close to the null value of 0.5. The differences in collective inference AUC relative to independentSNP results ranged from 0 to ∼0.06. Reflecting the dominance of the complementrelated genetic loci, Regulation of complement cascade (AUC: 0.688±0.018, m=448) and Complement cascade (AUC: 0.684±0.018, m=869) showed top association levels clearly separated from the rest. These AUC values were similar to the levels observed in p _{ c }based sampling in Fig. 5 b adjusted to their corresponding SNP numbers. We used a selection of pathways with low AUC values to sample their null distributions and connect AUC (as the statistic for each pathway) and pvalues corresponding to the overall association of each SNP set (Fig. 9 b). The − log10p values monotonically increased from 0 as AUC increased from 0.5, and became highly linear for AUC>0.52 (r ^{2}=0.94). We used this linear regression formula to translate AUC into pvalues. The Bonferroni correction with 1,732 pathways to the nominal false discovery rate indicated a threshold of AUC>0.567, which led to 13 pathways above the threshold shown in Table 1.
AMD disease mechanism We sought to gain insights to molecularlevel disease mechanisms of AMD by examining the pathways in Table 1 along with additional pathways near the threshold and grouping them into hierarchical classes (Fig. 9 c–e). There are two primary types of AMD, the ‘dry’ and ‘wet’ forms [52]. The dry AMD more commonly occurs in earlier stages, where retinal pigment epithelium (RPE) cells supporting photoreceptors in the retina undergo degeneration, often accompanied by the accumulation of drusen in the area between RPE and the Bruch’s membrane separating the retina from the choroid. The wet AMD is characterized by invasive choroidal neovascularization of the retina. In both forms, cellular stress factors exacerbated by aging are the primary causes leading to RPE dysfunction. The normal functioning of photoreceptors, bombarded by light and highly susceptible to oxidative damage, relies on continual recycling of their spent outer segments via phagocytosis by RPE cells [53]. Peroxidation products of phospholipids, the key ingredients of photoreceptors, often end up as major components of drusen, and serve as damageassociated molecular patterns activating innate immune receptors, including tolllike receptors (TLRs) as well as complement factor H (CFH) [54]. The latter has been shown to bind malondialdehyde (MDA) derived from docosahexaenoic acid [55]. In addition, phosphatidylserine is the main ‘eatme’ signal toward phagocytes when displayed on the extracellular membrane of dying cells [56]. Consistent with these aspects of AMD pathogenesis, we found associations with Phospholipid metabolism pathways (Fig. 9 e), and in particular, Synthesis of phosphatidic acid, which suggests that disease risk is affected by genetic variants modifying the ability to supply these phospholipids.
Phospholipid synthesis requires the supply of fatty acids, synthesized in the liver. The causal link to this process of lipogenesis is suggested in Fig. 9 e by the pathway Carbohydrate response elementbinding protein (ChREBP) activates metabolic gene expression. ChREBP is a key transcription factor in hepatocytes, responding to glucose uptake and activating genes involved in lipogenesis [57, 58]. Fatty acids thus synthesized are transported into the bloodstream in the form of very low density lipoproteins and stored as triacylglycerols in adipocytes [57]. The suggested AMD risk association of the fatty acid supply from the liver and adipocytes for phospholipid synthesis provides an explanation of our earlier finding in Fig. 8 that SNP interactions associated with AMD are epigenetically active in the liver and adipocytes.
Oxidative stress is often accompanied by disruptions to protein folding, which can lead to protein aggregation and autophagy when refolding by chaperones proves inadequate [59]. We found association in Protein folding pathways (Fig. 9 e), and in particular Chaperoninmediated protein folding, which primarily targets actins and tubulins, the major ingredients of cytoskeletal networks [60]. This observation suggests that RPE stress from protein misfolding affects AMD risk via its effect on phagocytic function, which relies heavily on actin filament and microtubule remodeling dynamics [56]. Also closely related is the Regulation of heat shock factor (HSF) 1mediated heat shock response in Fig. 9 e, which describes the transcriptional activation of heat shock protein (chaperone) expression under stress. The latter pathway belongs to the Cellular responses to stress group, in which we also found association with Senescenceassociated secretory phenotype (SASP). Senescence is one of the possible fates of cells under stress, where normal cellular growth is arrested in preparation for clearance by phagocytes [61]. SASP refers to a complex suite of inflammatory cytokines, chemokines, and growth factors facilitating the process, and we infer that senescence in RPE cells under oxidative stress plays a part in AMD.
Apoptosis, or controlled cell death [62], is another major stressedcell response, and was also represented in our results (Fig. 9 e). A large body of direct evidence points to apoptosis as one of the main routes of RPE degeneration in AMD [63]. Induction of apoptosis upon stress is dictated by the action of master regulator p53, and it was recently shown that aging increases the activity of p53 in RPE cells and the likelihood for apoptotic cell death [64]. Consistent with this evidence, we found association with pathways in Transcriptional regulation by TP53 group (Fig. 9 d). In particular, Regulation of TP53 activity through methylation was among the top pathway in our association analysis (Table 1), suggesting that p53 modification by methylation and the closely related histone modifications [Protein lysine methyltransferases (PKMTs) methylate histone lysine in Fig. 9 e] play important roles in RPE apoptosis regulation. In the intrinsic apoptotic pathway induced by oxidative stress, cytochrome c is released from mitochondria into the cytosol, binding and activating caspases, the main proteases central to apoptotic action. We found association in pathways involving ‘inhibitor of apoptosis’ (IAP) and its negative regulator ‘second mitochondrial activator of caspases’ (SMAC) [65], which suggests that disruption to regulatory mechanisms preventing apoptosis in RPE cells may play roles in AMD.
RPE degeneration and drusen formation can lead to inflammation, the main innate immune response involving a wide range of patternrecognition receptors (PRRs) and complement activation [66]. Most of known PRRs were represented in our results (Fig. 9 c), including TLRs, advanced glycosylation endproduct receptors, RIGIlike receptors, and cytosolic DNA sensors [66]. Complement factors constitute the soluble counterparts of PRRs, and Regulation of complement cascade showed the highest association due to the contribution of CFH, as well as Activation of C3 and C5. CFH normally protects self tissues from complementinduced destruction by binding to a range of surface signals including glycoproteins and Creactive protein. In addition to the binding of CFH to MDA noted above, it was also reported that CFH inhibits lipoprotein binding toward Bruch’s membrane [67]. The breach of Bruch’s membrane and the intrusion of blood vessels into the retina are the hallmarks of wet AMD [52], which are consistent with our finding of high association in Degradation of extracellular matrix and Common pathway of fibrin clot formation in Fig. 9 e.
Conclusions
In this paper, we first described and tested discriminant analysisbased algorithms inferring collective disease association effects applied to intermediatesized SNP sets. Using simulated and actual disease data, we provided evidence suggesting that collective inference methods outperform pairwise tests and logistic regression in incorporating interaction effects in disease association.
We demonstrated two different modes of applying DDA in the analysis of actual disease data: one in which detailed interaction patterns within a relatively small set of SNPs are inferred, and the other where genomewide SNP data are grouped into different subsets of SNPs and collective inference is used to compute the degrees of disease association of each subset. Our results applied to AMD in Fig. 9 based on pathwaybased SNP selection, in particular, show that the latter protocol allows us to identify pathways encompassing a large fraction of disease mechanisms previously established by nongenetic means. Based on current study, we propose the following approach to deal with novel GWAS casecontrol data using DDA: first, characterize the degree of polygenicity of the data set with independentSNP and collective inferences employing p _{ c }based SNP selection and optimize the density of SNPs using LDbased pruning (Fig. 5). Second, classify SNPs into pathwaybased groups, score them using collective inference, and seek insights to the underlying disease mechanisms by analyzing the results within the pathway hierarchy.
Methods
Genotype distribution of casecontrol groups
Our algorithm is best understood in comparison to the classical continuous variable discriminant analysis. Table 2 outlines the similarities and differences of classical (continuous variable) and discrete (our adaptation) versions of discriminant analyses. In the continuous variable case, one aims to classify data into two known groups, case and control (denoted by y=1 and y=0, respectively), based on predictor a, a vector of dimension m. Classification (and inference) are performed by maximizing the likelihood of model parameters given the training data of known class identities, i.e., the joint probability
where p _{ y } is the marginal probability of group membership. One then finds the classspecific mean vectors μ _{ y } and covariance matrices Σ _{ y }. These quantities define the predictor distribution within each class, which are assumed to follow a multivariate normal distribution: a∼N(μ _{ y },Σ _{ y }), or
where the superscript t denotes transpose. In this formulation, the maximum likelihood condition for Eq. (1) is equivalent to maximizing the discriminant function δ _{ y }(a) given by the exponent of Eq. (2) plus lnp _{ y }, which is used to classify an arbitrary data a into case if δ _{1}(a)>δ _{0}(a) and control otherwise [37]. It is also useful to note that if we assume that a is a scalar, this framework reduces to ttests for the null hypothesis μ _{0}=μ _{1}.
For our application, the predictor a is the collection of genotypes, which is discrete. The description here applies to the binary model (dominant or recessive), such that a _{ i }=0,1 represent aa and Aa/AA for SNP i for the dominant model, and aa/Aa and AA for the recessive model (see Additional file 1: Text S1 for the genotypic model). Figure 1 illustrates the general spirit of the DDA algorithm. Training data of known phenotypes can be used to obtain allele frequency vectors and covariance matrices with elements \({\hat f}_{i}^{(y)}\) and \({\hat f}_{ij}^{(y)}\), respectively, where i,j=1,⋯,m are SNP indices. Note that these quantities are the exact counterparts of the continuous variable mean μ _{ y } and covariance Σ _{ y }. We model the genotype distribution within each class in a form analogous to Eq. (2) [68]:
where \(h_{i}^{(y)}\) and \(J_{ij}^{(y)}\) are model parameters of the distribution that we refer to as singleSNP and interaction parameters, respectively. Comparing Eqs. (2) and (3), one can observe that these parameters \(\psi _{y}\equiv \{h_{i}^{(y)},J_{ij}^{(y)}\}\), each multiplying predictor a in linear and quadratic fashion, respectively, are expected to be related to frequencies \({\hat f}_{i}^{(y)}\) and \({\hat f}_{ij}^{(y)}\). In contrast to the continuous case, however, the exact form of this relationship is unknown due to the discrete nature of a, except for the special case of independent SNPs (see Section S1.5 in Additional file 1: Text S1; we refer to the special case of no interaction as the independentSNP case).
The inference of this relationship is the major computational component of DDA, and is based on maximizing the loglikelihood (L _{ y }) per individual,
where the first summation is over genotype data of n _{ y } individuals in group y, and λ denotes a regularization parameter (penalizer) that controls the contribution of the SNP interactions in comparison to the independentSNP case. The independentSNP limit is reached with λ→∞, where optimal values of \(J_{ij}^{(y)}\) all become zero due to high penalty. We opted for an l _{2}penalizer rather than l _{1}; the latter generally exerts stronger effects [69] but l _{2} is analytic and facilitates nonlinear optimization. In Additional file 1: Text S1, we show implementations of three possible ways to perform this inference of varying computational cost and reliability: exact enumeration (EE), mean field (MF) [68], and pseudolikelihood (PL) [70, 71] methods. The EE is essentially exact, but requires enumerations of all possible genotypes, and can only be used for m∼25 or less. We used this option to assess the reliability of the other two methods. Both MF and PL are approximate and can be used for m∼10^{3} or larger. The MF option involves matrix inversion and requires a different regularization: instead of λ, we used ε∈[ 0,1], where ε=0 corresponds to the independentSNP limit with no interaction. The PL method uses λ>0 and has the advantage that it can be easily parallelized. We implemented parallel computations of PL using the message passing interface protocol.
Disease risk
Once genotype distributions of case, control, and pooled (whole sample) groups have been inferred, Bayes’ theorem allows one to obtain disease risk:
One can show that (Additional file 1: Text S1)
In other words, the singleSNP and interaction disease risk parameters θ={β _{ i },γ _{ ij }} are given by differences in genotype distribution parameters between case and control groups. In addition, the overall likelihood ratio statistic is given by the sum of L _{ y } subtracted by the pooled value (see Fig. 1). The parameter α is related to disease prevalence p _{1}=1−p _{0} (see Additional file 1: Text S1).
We used crossvalidation to determine the penalizer λ in Eq. (4). We first formed five training and test sets (of 4:1 size ratios) from the data and used the training set to select variants with independent SNP pvalues below a cutoff. We calculated the AUC for different λ values and found an optimal choice. Even when the actual AUC values obtained are not high enough for a reasonable risk prediction, this procedure still allows us to identify optimal ranges of the model size (the role of interactions).
We used disease prevalence p _{1}=n _{1}/n for crossvalidation because the training and test sets have the same sampling biases. In predicting risks with a prospective sample, however, this probability would have to be adjusted to known population phenotype frequencies. We implemented a software feature such that the disease prevalence, which affects the disease risk parameter α, can be respecified when a parameter set inferred from casecontrol data is applied to an independent test set.
Logistic regression
In contrast to DDA outlined above, the logistic regression uses
instead of Eq. (1), assuming that the marginal genotype distribution is uniform. The parameters α,β _{ i }, and γ _{ ij } are then directly determined by maximizing the likelihood of Pr(ya) only:
where n=n _{0}+n _{1}, with respect to α, β _{ i }, and γ _{ ij }. In general, these disease risk parameter values from logistic regression are different from those obtained via genotype distribution parameters ψ _{ y } in DDA with Eq. (6); the latter contains the effects of the nonuniform marginal genotype distribution Pr(a) of the sample, while logistic regression does not. For comparison, we also implemented this multivariate logistic regression with an l _{2}penalizer. The logistic regression can yield higher prediction AUC than DDA because by maximizing Eq. (8), one optimizes prediction directly. However, the quantity maximized in DDA given by Eq. (4) (or in fact the sum L _{0}+L _{1}; see Additional file 1: Text S1), rather than the prediction score, is the true likelihood.
Significance tests
We performed likelihood ratio tests to assess the statistical significance of the overall collective inference and individual loci/interactions. The pvalues derived are conditional to the number of loci m and penalizer value λ (determined from crossvalidation). The statistic was obtained by adding the loglikelihood values of case and control groups and subtracting that of the pooled group (see Text S1). We tested the significance of the singlelocus contribution to disease association from site i by calculating the likelihoods \(L_{y}[\!h_{i}^{(y)}=h_{i}]\), where the singleSNP parameters of site i were prescribed as their pooled values (restricted model), and evaluating the differences against the likelihood of the full model (all parameters optimized without restriction). Analogous tests were performed for SNP pair i,j with \(L_{y}[\!J_{ij}^{(y)}=J_{ij}]\). The restricted model corresponds to the null hypothesis that the parameters belonging to a particular locus or interaction in case and control groups are the same as for those in the pooled group. For interaction statistics, we used the approach of constructing the empirical null distribution of the statistics under a given λ by permutation resampling: the phenotype data of a given sample with a certain penalizer λ value was randomly reshuffled to obtain realizations of the likelihood ratio statistics. This sampling was repeated multiple times (up to ∼10^{4}) to construct empirical cumulative distribution functions of the statistic for each site, or SNPpair, from which the pvalues were estimated. For the singlelocus contribution statistics, we calculated pvalues using the asymptotic χ ^{2}distribution.
Simulation
In testing the inference algorithms using simulated data, samples of casecontrol genotypes containing m=10 or 20 loci and n=2n _{0} individuals were generated under randomly assigned parameters {ψ _{0},ψ _{1}}. The model parameters were chosen with \(h_{i}^{(y)}\sim N({\bar h}_{y},{\sigma _{h}^{2}})\), and \(J_{ij}^{(y)}\sim N({\bar J}_{y},{\sigma _{J}^{2}})\). To generate simulated data from these distributions, we evaluated summations over all (2^{m} for binary models) possible genotypes to calculate their probability distribution using Eq. (3). For a given sample, crossvalidation was first performed with λ values ranging from 10^{−4} to 10^{2} to determine the optimal value λ ^{∗} that maximizes AUC. The parameters θ were then derived using the full sample and λ ^{∗}. For DDA, the singleSNP and interaction parameters for case and control groups (ψ _{0} and ψ _{1}, respectively) were obtained and used to derive θ. In contrast, in logistic regression, θ was obtained directly. The mean square error was calculated for inferred θ relative to true values of all distinct singleSNP and interaction parameters. We performed different inferences using a common set of data for each realization of parameters. The mean square error was then averaged over 100 realizations of parameters. We also compared pairwise test results using PLINK [44] epistasis module. We used PLINK version 1.9 with logistic regression and either dominant or genotypic coding options.
Agerelated macular degeneration data
We obtained AMD casecontrol genotype data from the National Eye Institute Study of AMD Database (dbGaP accession number phs000684.v1.p1), which consisted of 2, 159 case and 1, 150 control individuals. Autosomal SNPs were filtered with the criteria of MAF >0.01, HardyWeinberg equilibrium pvalue >10^{−6}, and genotyping rate >0.05 [41] to yield 324, 713 SNPs in total. IndependentSNP DDA analyses were performed using Eq. (S24) and (S27) in Additional file 1: Text S1 and compared with logistic regression results from PLINK [44] in addition to our numerical logistic regression implementation. For each SNP, the minor allele was identified from the allele frequencies over the pooled sample.
Except otherwise stated, inferences on AMD data used the genotypic model. In all cases, λ (or ε for MF) was first determined from crossvalidation (optimal value λ ^{∗} with the maximum AUC) and later used consistently for parameter estimation and pvalue calculation. Interaction pvalues were obtained for a given SNP selection and λ ^{∗} by resampling.
To generate SNP sets with reduced LD for p _{ c }based selection (Fig. 5 b), we used the pairwise LDbased pruning option of PLINK with window size of 50 kb and 5 SNPs for shifts along with r ^{2}thresholds of 0.5 and 0.3. The two threshold choices yielded SNP sets with m=180,354 and 117,976, respectively.
In performing the epigenetic state enrichment analysis, for each SNP considered, we used the 1000 Genomes reference haplotypes [72] of European individuals to build the list of correlated SNPs (LD r ^{2}>0.5). We then used the NIH Roadmap reference epigenome chromatin state annotations [50] to construct the distribution of epigenetic states within each group of LDcorrelated SNPs. We used the hidden Markov modelbased 15 stateannotations of 111 reference epigenomes, selecting 8 states [active transcription start site (TSS), flanking active TSS, transcription at gene 5’ and 3’, strong transcription, weak transcription, zinc fingerassociated, genic enhancers, and enhancers] to define the ‘active’ state, which contained the transcribed, promoter, and enhancer regions. For each SNP location, we calculated the fraction of LDcorrelated locations in active states within each cell type. This fraction was summed over the list of associated SNPs (m=96 in Fig. 7) to give the effective number of active states observed, and compared with the background active state frequency estimated over the whole genome for each epigenome. The overrepresentation pvalues were calculated by the binomial test.
Analogous calculations were performed for the SNP interaction enrichment analysis. We first selected statistically significant SNP pairs with p<10^{−3} from Fig. 6 (310 in total). We then considered each unique combination of two epigenomes (including selfinteractions) and, for each SNP pair selected, obtained the fraction of active stateactive state pairs with the two groups of LDcorrelated SNPs assumed to belong to the two epigenomes. This fraction was summed over the list of SNP pairs and compared with the background expected pair number (product of background active state frequencies from two tissues). Overrepresentation pvalues were calculated using the binomial test.
Pathwaybased SNP sets were generated for human pathways in Reactome database [51]. We compiled the list of all genes, assigned SNPs in the AMD genomewide set within 50 kb of the coding region to each gene, and collected SNPs corresponding to the gene set belonging to each pathway. Only those pathways with 20 or more SNPs were considered (1,732 in total). For most pathways with m<6×10^{3}, DDA independentSNP and collective inference (εoptimized MF) inferences were applied to each SNP set without further filtering to derive 5fold crossvalidation AUC. For larger pathways, p _{ c }based filtering was incorporated into crossvalidation to reduce the model sizes.
Abbreviations
 AMD:

agerelated macular degeneration
 AUC:

area under the curve
 CFH:

complement factor H
 ChREBP:

carbohydrate response elementbinding protein
 DDA:

discrete discriminant analysis
 EE:

exact enumeration
 GeDI:

genotype distributionbased inference
 GWAS:

genomewide association study
 LD:

linkage disequilibrium
 MDA:

malondialdehyde
 MF:

mean field
 MSC:

mesenchymal stem cell
 pAUC:

pseudoAUC
 PL:

pseudolikelihood
 PRR:

patternrecognition receptor
 RPE:

retina pigment epithelium
 SASP:

senescenceassociated secretory phenotype
 SNP:

singlenucleotide polymorphism
 TLR:

tolllike receptor
 TSS:

transcription start site
References
 1
Kim YA, Wuchty S, Przytycka TM. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput Biol. 2011; 7:e1001095.
 2
Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, Gallins P, et al.Complement factor H variant increases the risk of agerelated macular degeneration. Science. 2005; 308:419–21.
 3
Edwards AO, Ritter R, Abel KJ, Manning A, Panhuysen C, Farrer LA. Complement factor H polymorphism and agerelated macular degeneration. Science. 2005; 308:421–4.
 4
Wang WY, Barratt BJ, Clayton DG, Todd JA. Genomewide association studies: theoretical and practical concerns. Nat Rev Genet. 2005; 6:109–18.
 5
The Wellcome Trust Case Control Consortium. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447:661–78.
 6
Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, et al.Genomewide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci USA. 2008; 105:4340–5.
 7
BergeronSawitzke J, Gold B, Olsh A, Schlotterbeck S, Lemon K, Visvanathan K, et al. Multilocus analysis of agerelated macular degeneration. Eur J Hum Genet. 2009; 17:1190–9.
 8
Manolio TA. Bringing genomewide association findings into clinical use. Nat Rev Genet. 2013; 14:549–58.
 9
Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al.The NHGRI GWAS Catalog, a curated resource of SNPtrait associations. Nucleic Acids Res. 2014; 42:D1001–1006.
 10
Ripke S, Neale BM, Corvin A, Walters JT, Farh KH, Holmans PA, et al. Biological insights from 108 schizophreniaassociated genetic loci. Nature. 2014; 511:421–7.
 11
Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015; 518:197–206.
 12
Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010; 34:188–93.
 13
Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, OrhoMelander M, et al. Testing for an unusual distribution of rare variants. PLoS Genet. 2011; 7:e1001322.
 14
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89:82–93.
 15
Clarke GM, Anderson CA, Pettersson FH, Cardon LR, Morris AP, Zondervan KT. Basic statistical analysis in genetic casecontrol studies. Nat Protoc. 2011; 6:121–133.
 16
Cordell HJ. Detecting genegene interactions that underlie human diseases. Nat Rev Genet. 2009; 10:392–404.
 17
Wei WH, Hemani G, Haley CS. Detecting epistasis in human complex traits. Nat Rev Genet. 2014; 15:722–33.
 18
Yu K, Xu J, Rao DC, Province M. Using treebased recursive partitioning methods to group haplotypes for increased power in association studies. Ann Hum Genet. 2005; 69:577–89.
 19
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001; 69:138–47.
 20
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006; 241:252–61.
 21
Roshan U, Chikkagoudar S, Wei Z, Wang K, Hakonarson H. Ranking causal variants and associated regions in genomewide association studies by the support vector machine and random forest. Nucleic Acids Res. 2011; 39:e62.
 22
Pan Q, Hu T, Moore JH. Epistasis, complexity, and multifactor dimensionality reduction. Methods Mol Biol. 2013; 1019:465–77.
 23
Zhang Q, Long Q, Ott J. AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol. 2014; 10:e1003627.
 24
Fan R, Zhong M, Wang S, Zhang Y, Andrew A, Karagas M, et al.Entropybased information gain approaches to detect and to characterize genegene and geneenvironment interactions/correlations of complex diseases. Genet Epidemiol. 2011; 35:706–21.
 25
Gauderman WJ, Murcray C, Gilliland F, Conti DV. Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007; 31:383–95.
 26
Gao Q, He Y, Yuan Z, Zhao J, Zhang B, Xue F. Gene or regionbased association study via kernel principal component analysis. BMC Genet. 2011; 12:75.
 27
Cai M, Dai H, Qiu Y, Zhao Y, Zhang R, Chu M, et al. SNP set association analysis for genomewide association studies. PLoS ONE. 2013; 8:e62495.
 28
Herold C, Steffens M, Brockschmidt FF, Baur MP, Becker T. INTERSNP: genomewide interaction analysis guided by a priori information. Bioinformatics. 2009; 25:3275–81.
 29
Wu X, Dong H, Luo L, Zhu Y, Peng G, Reveille JD, et al. A novel statistic for genomewide interaction analysis. PLoS Genet. 2010; 6:e1001131.
 30
Ueki M, Cordell HJ. Improved statistics for genomewide interaction analysis. PLoS Genet. 2012; 8:e1002625.
 31
Hemani G, Shakhbazov K, Westra HJ, Esko T, Henders AK, McRae AF, et al. Detection and replication of epistasis influencing transcription in humans. Nature. 2014; 508:249–53.
 32
Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics. 2008; 9:30–50.
 33
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009; 25:714–21.
 34
Ayers KL, Cordell HJ. SNP selection in genomewide and candidate gene studies via penalized logistic regression. Genet Epidemiol. 2010; 34:879–91.
 35
Efron B. The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc. 1975; 70:892–8.
 36
Press SJ, Wilson S. Choosing between logistic regression and discriminant analysis. J Am Statist Assoc. 1978; 73:699–705.
 37
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction, 2nd ed. New York: Springer; 2011. http://statweb.stanford.edu/~tibs/ElemStatLearn.
 38
Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010; 11:94.
 39
Wang K, Li M, Hakonarson H. Analysing biological pathways in genomewide association studies. Nat Rev Genet. 2010; 11:843–54.
 40
Swaroop A, Chew EY, Rickman CB, Abecasis GR. Unraveling a multifactorial lateonset disease: from genetic susceptibility to disease mechanisms for agerelated macular degeneration. Annu Rev Genomics Hum Genet. 2009; 10:19–43.
 41
Chen W, Stambolian D, Edwards AO, Branham KE, Othman M, Jakobsdottir J, et al. Genetic variants near TIMP3 and highdensity lipoproteinassociated loci influence susceptibility to agerelated macular degeneration. Proc Natl Acad Sci USA. 2010; 107:7401–6.
 42
Gold B, Merriam JE, Zernant J, Hancox LS, Taiber AJ, Gehrs K, et al. Variation in factor B (BF) and complement component 2 (C2) genes is associated with agerelated macular degeneration. Nat Genet. 2006; 38:458–62.
 43
Fritsche LG, Chen W, Schu M, Yaspan BL, Yu Y, Thorleifsson G, et al. Seven new loci associated with agerelated macular degeneration. Nat Genet. 2013; 45:433–9.
 44
Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007; 81:559–75.
 45
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009; 25:714–21.
 46
Wilks SS. The largesample distribution of the likelihood ratio for testing composite hypotheses. Ann Math Stat. 1937; 9:60–2.
 47
He Q, Lin DY. A variable selection method for genomewide association studies. Bioinformatics. 2011; 27:1–8.
 48
Wood AR, Tuke MA, Nalls MA, Hernandez DG, Bandinelli S, Singleton AB, et al.Another explanation for apparent epistasis. Nature. 2014; 514:3–5.
 49
Ward LD, Kellis M. Interpreting noncoding genetic variation in complex traits and human disease. Nat Biotechnol. 2012; 30:1095–106.
 50
Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, HeraviMoussavi A, et al.Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518:317–30.
 51
Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, et al.The Reactome pathway Knowledgebase. Nucleic Acids Res. 2016; 44:D481–487. http://www.reactome.org.
 52
Ambati J, Fowler BJ. Mechanisms of agerelated macular degeneration. Neuron. 2012; 75:26–39.
 53
Sun M, Finnemann SC, Febbraio M, Shan L, Annangudi SP, Podrez EA, et al.Lightinduced oxidation of photoreceptor outer segment phospholipids generates ligands for CD36mediated phagocytosis by retinal pigment epithelium: a potential mechanism for modulating outer segment phagocytosis under oxidant stress conditions. J Biol Chem. 2006; 281:4222–30.
 54
Weismann D, Binder CJ. The innate immune response to products of phospholipid peroxidation. Biochim Biophys Acta. 2012; 1818:65–2475.
 55
Weismann D, Hartvigsen K, Lauer N, Bennett KL, Scholl HP, Charbel Issa P, et al.Complement factor H binds malondialdehyde epitopes and protects from oxidative stress. Nature. 2011; 478:76–81.
 56
Flannagan RS, Jaumouille V, Grinstein S. The cell biology of phagocytosis. Annu Rev Pathol. 2012; 7:61–98.
 57
Postic C, Dentin R, Denechaud PD, Girard J. ChREBP, a transcriptional regulator of glucose and lipid metabolism. Annu Rev Nutr. 2007; 27:179–92.
 58
Wang Y, Viscarra J, Kim SJ, Sul HS. Transcriptional regulation of hepatic lipogenesis. Nat Rev Mol Cell Biol. 2015; 16:678–89.
 59
Ferrington DA, Sinha D, Kaarniranta K. Defects in retinal pigment epithelial cell proteolysis and the pathology associated with agerelated macular degeneration. Prog Retin Eye Res. 2016; 51:69–89.
 60
Leroux MR, Hartl FU. Protein folding: versatility of the cytosolic chaperonin TRiC/CCT. Curr Biol. 2000; 10:R260–264.
 61
MunozEspin D, Serrano M. Cellular senescence: from physiology to pathology. Nat Rev Mol Cell Biol. 2014; 15:482–96.
 62
Taylor RC, Cullen SP, Martin SJ. Apoptosis: controlled demolition at the cellular level. Nat Rev Mol Cell Biol. 2008; 9:231–41.
 63
Dunaief JL, Dentchev T, Ying GS, Milam AH. The role of apoptosis in agerelated macular degeneration. Arch Ophthalmol. 2002; 120:1435–42.
 64
Bhattacharya S, Chaum E, Johnson DA, Johnson LR. Agerelated susceptibility to apoptosis in human retinal pigment epithelial cells is triggered by disruption of p53Mdm2 association. Invest Ophthalmol Vis Sci. 2012; 53:8350–66.
 65
Salvesen GS, Duckett CS. IAP proteins: blocking the road to death’s door. Nat Rev Mol Cell Biol. 2002; 3:401–10.
 66
Kauppinen A, Paterno JJ, Blasiak J, Salminen A, Kaarniranta K. Inflammation and its role in agerelated macular degeneration. Cell Mol Life Sci. 2016; 73:1765–86.
 67
Toomey CB, Kelly U, Saban DR, Bowes Rickman C. Regulation of agerelated macular degenerationlike pathology by complement factor H. Proc Natl Acad Sci USA. 2015; 112:E3040–3049.
 68
Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al.Directcoupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA. 2011; 108:E1293–1301.
 69
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sin. 2010; 20:101–48.
 70
Besag J. Statistical analysis of nonlattice data. Statistician. 1975; 24:179–95.
 71
Aurell E, Ekeberg M. Inverse Ising inference using all the data. Phys Rev Lett. 2012; 108:090201.
 72
Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491:56–65.
Acknowledgements
The GWAS data used were obtained from the National Eye Institute Study of the AgeRelated Macular Degeneration (NEIAMD) Database found at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000684.v1.p1. We thank NEIAMD participants and the NEIAMD Research Group for their valuable contribution to this research. HJW thanks Marianne Spevak and Joy Hoffman. Parts of the computation were performed on the highperformance computing resources at the U.S. Air Force Research Laboratory, U.S. Army Research Laboratory, and U.S. Army Engineer Research and Development Center.
This work was supported by the U.S. Army Medical Research and Materiel Command (Ft. Detrick, Maryland). The opinions and assertions contained herein are the private views of the authors and are not to be construed as official or as reflecting the views of the U.S. Army or of the U.S. Department of Defense. This paper has been approved for public release with unlimited distribution.
Availability of data and materials
The source code and documentation of the software (GeDI; genotype distributionbased inference) implementing the algorithm are freely available: the archived version at http://dx.doi.org/10.5281/zenodo.32630 and most recent version at http://github.com/BHSAI/GeDI; programming language: C++; license: GNU GPL.
Authors’ contributions
HJW and JR conceived the study. HJW derived formulas/implemented the algorithm, wrote the software, collected data, and performed the analyses. CY, KK, BG, and JR participated in the data collection and analyses. HJW and JR wrote the manuscript. CY and KK tested the software. All authors read and approved the manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Affiliations
Corresponding author
Additional information
An erratum to this article can be found at http://dx.doi.org/10.1186/s1286401630952.
Additional files
Additional file 1
Supplementary material. Text S1. Mathematical formulation of inference algorithms. Table S1. IndependentSNP inference comparison of logistic regression (from PLINK) and DDA (GeDI). (PDF 283 kb)
Additional file 2
Figure S1. Inference properties of a single independent SNP. Log odds ratio (OR) and power (level of significance 0.05) inferred from casecontrol data of size n=2n _{0}=2n _{1} using DDA (analytic) and logistic regression (numerical) are shown for the dominant model. Equation (S29) of Text S1 was used. The minor allele frequency ϕ _{ y } for control and case groups were set such that \(f^{(y)}=2\phi _{y}(1\phi _{y})+{\phi _{y}^{2}}=\phi _{y}(2\phi _{y})\). We used ϕ _{ y }=(0.1,0.25) such that f ^{(y)}=(0.19,0.4375) and h ^{(y)}=(−1.45,−0.25) for control and case groups, respectively, and β=1.1987. (PDF 6.99 kb)
Additional file 3
Figure S2. Examples of true versus inferred parameters. a Dominant model with m=10 SNPs and inference on a sample of size n=10^{3}. b Genotypic model with m=10 SNPs and inference on a sample of size n=10^{5}. In all cases, the penalizer value was determined by crossvalidation. (PDF 18.1 kb)
Additional file 4
Figure S3. Determination of penalizer λ via crossvalidation. The data set is one realization of simulations shown in Fig. 2 b and the inference is with the exact enumeration (EE) method. The minima in mean square error (a) and the maxima in AUC (b) shift to lower λ as sample size n increases. (PDF 6.17 kb)
Additional file 5
Figure S4. Distributions of interaction likelihood ratio statistics under the null hypothesis. Empirical cumulative distribution functions (CDF) in terms of the interaction statistics q were obtained by resampling. Simulation conditions were as described in Fig. 2 e and inferences used EE. (PDF 7.39 kb)
Additional file 6
Figure S5. Wholegenome pvalue profile of AMD data. IndependentSNP DDA with genotypic model was used. (PDF 25.7 kb)
Additional file 7
Figure S6. Regional views of AMD data. IndependentSNP DDA results (light blue) are compared to logistic regression from PLINK (yellow). Genotypic model was used. (PDF 10.9 kb)
Additional file 8
Figure S7. Marginal pairwise interaction pvalues. PLINK epistatic module was used to m=96 AMD SNPs. SNP pairs with strongest pvalues near HTRA1 have p∼10^{−9}. Genotypic model was used. The bottom panel shows the independentSNP pvalues. (PDF 54.6 kb)
Additional file 9
Figure S8. Linkage disequilibrium r ^{2} within m=96 AMD SNPs from PLINK. (PDF 48.3 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Woo, H., Yu, C., Kumar, K. et al. Genotype distributionbased inference of collective effects in genomewide association studies: insights to agerelated macular degeneration disease mechanism. BMC Genomics 17, 695 (2016). https://doi.org/10.1186/s1286401628713
Received:
Accepted:
Published:
Keywords
 Genomewide association
 Machine learning
 Epistasis
 Singlenucleotide polymorphism
 Agerelated macular degeneration