The genomic signature of trait-associated variants

Background Genome-wide association studies have identified thousands of SNP variants associated with hundreds of phenotypes. For most associations the causal variants and the molecular mechanisms underlying pathogenesis remain unknown. Exploration of the underlying functional annotations of trait-associated loci has thrown some light on their potential roles in pathogenesis. However, there are some shortcomings of the methods used to date, which may undermine efforts to prioritize variants for further analyses. Here, we introduce and apply novel methods to rigorously identify annotation classes showing enrichment or depletion of trait-associated variants taking into account the underlying associations due to co-location of different functional annotations and linkage disequilibrium. Results We assessed enrichment and depletion of variants in publicly available annotation classes such as genic regions, regulatory features, measures of conservation, and patterns of histone modifications. We used logistic regression to build a multivariate model that identified the most influential functional annotations for trait-association status of genome-wide significant variants. SNPs associated with all of the enriched annotations were 8 times more likely to be trait-associated variants than SNPs annotated with none of them. Annotations associated with chromatin state together with prior knowledge of the existence of a local expression QTL (eQTL) were the most important factors in the final logistic regression model. Surprisingly, despite the widespread use of evolutionary conservation to prioritize variants for study we find only modest enrichment of trait-associated SNPs in conserved regions. Conclusion We established odds ratios of functional annotations that are more likely to contain significantly trait-associated SNPs, for the purpose of prioritizing GWAS hits for further studies. Additionally, we estimated the relative and combined influence of the different genomic annotations, which may facilitate future prioritization methods by adding substantial information.


Background
Genome-wide association studies (GWAS) have been successful in discovering associated variants for a wide range of common diseases and traits [1,2]. More than 8,000 trait associations have been recorded to date (as of 11 Jan 2013). They can be divided into significant associations passing the generally accepted genome-wide threshold (P-value = 5 × 10 -8 ) [3], or suggestive associations with a decreased significance threshold (P-value = 5 × 10 -5 -5 × 10 -8 ). Larger association studies and increasingly informative genotyping arrays together with high-throughput sequencing are expected to confirm some of the associations currently considered as suggestive and identify many new associations [4]. For confirmed associations, experiments identifying the causal underlying biology are expensive in both time and money. This causes a bottleneck in elucidating the molecular processes and pathways underlying these associations [5][6][7][8] and hence in gaining new biological knowledge. There has therefore been much interest in computational prioritization of candidate variants, both to accelerate the search for causal variants, and to provide insights into the biology underlying disease states.
Although confirmed trait-associated SNP will most often not be the causal variants, the surrounding genomic regions in linkage disequilibrium (LD) with associated SNPs are expected to contain causal variants with biological function. While it is clear that trait-associated SNPs are enriched in genic regions the majority of trait-associated variants are not within genes [7,9]. Two studies have previously investigated trait-associated SNP enrichment in a range of genomic features [8,10]. Hindorff et al. (2009) investigated 20 genomic features for enrichment or depletion of trait-associated SNPs [8]. They found nonsynonymous sites and 5 Kb regions upstream of transcription start sites significantly enriched for trait-associated SNPs, while intergenic regions were significantly depleted [8]. Knight et al. (2011) [10] replicated the significant enrichment results of Hindorff et al. [8] and additionally found that cis expression quantitative trait loci (cis-eQTLs) were enriched for associated variants. Studies focusing on the epigenomic landscape, such as DNA methylation [11] and histone modification patterns [12] surrounding the variants have shown SNPs and DNA modifications jointly influencing transcription of nearby genes. Recently, combinatorial patterns of histone modifications were found to indicate regions with particular functions, ranging from active promoters or enhancers to transcriptionally silenced loci [13]. Ernst et al. [13] also showed that GWAS variants associated with diseases showing lineage-specific phenotypes were enriched in enhancer regions predicted using chromatin data from similar cells, e.g. acute lymphoblastic leukemia variants were enriched within strong enhancers found in T cells.
Alternative approaches to functional enrichment analyses have the potential to provide additional insight into the influence of genomic features on trait associations. To study enrichment or depletion both Knight et al. [10] and Hindorff et al. [8] compared the annotations of associated variants to background sets of SNPs present on the original GWAS genotyping platforms. Hindorff et al. [8] generated 100 randomly sampled background SNP sets, weighted to approximate the composition of the genotyping platforms originally used to uncover the associations. Knight et al. [10] calculated enrichments based upon backgrounds composed of all SNPs from two popular genotyping platforms (the Affymetrix 500 K platform, the Illumina HumanHap 550 K platform, and the union of these two platforms). These approaches have important caveats. Firstly, the platform, or combination of platforms, used to detect an association is not always recorded, as shown by the GWAS catalogue [5]. Secondly, the underlying distributions of functional genomics features and SNPs are ignored, although it is known that their distributions in the human genome are often non-uniform and clustered [14]. Sampling randomly selected SNPs implicitly assumes that SNPs occur uniformly across the genome, which may result in misleading conclusions. It is also unclear what level of sampling is sufficient to produce an appropriate null distribution for a given set of variants. If we aim to assess the significance of the co-occurrence of associated SNPs and genomic features, an appropriate background SNP set should reflect the degree to which SNPs and genomic features are clustered and occur in the genome. Finally, previous studies have failed to take account of the often strong inter-dependencies between different genomic features e.g. the associations between chromatin structure, gene density and evolutionary divergence rates [15]. These inter-dependencies make it difficult to disentangle the relative importance of individual genomic features when analyzed separately.
Here, we investigated the genomic signature of 1,909 significantly trait-associated SNPs (P-value < 5 × 10 -8 ) by analyzing the overlaps between regions annotated for 58 genomic features with the associated variants and their LD SNP partners. We used a novel circular permutation approach to assess the significance of the observed results and to calculate enrichment or depletion scores for each genomic feature. Our permutation approach preserves the observed distribution of annotations and SNPs around the genome, and establishes a robust null distribution from which the significance of the observed enrichments and depletions can be calculated. We compared the permutation results with results obtained by a sampling strategy based on Hindorff et al. [8], which randomly samples genotyping platforms and SNPs from the HapMap II project present in CEU (Caucasian population). In addition to examining the annotations investigated by Hindorff et al. [8] and Knight et al. [10], we included 15 different annotations relating to chromatin states associated with regulatory regions [13], eQTLs [16], higher order chromatin structure [17] and regions with identified evolutionary signatures [18]. Most of the annotations we examined are correlated with at least some of the others, prompting us to investigate their combined effects. We applied stepwise logistic regression in order to derive a minimum set of enriched or depleted annotations that jointly influence trait-association status. The logistic regression approach accounts for any redundant information carried by individual variables, for example due to co-location of different functional annotations. Additional annotations are only included if they add information that is not explained by other annotations that are already in the model resulting in a final model that incorporates the most important variables only. All analyses performed took the underlying LD structure into account, as all analyzed SNPstrait-associated and nonassociatedwere investigated with their LD partners at the chosen LD cut-off. The enrichment/depletion and logistic regression analyses were repeated with another SNP set consisting of 2,410 suggestively trait-associated SNPs (P-value between 5 × 10 -5 and 5 × 10 -8 ). The results shed new light on the genomic architecture of trait-associated SNPs and may be useful to aid prioritization of associated variants for further study and as prior weighting for association studies.

Confirmation of functional enrichments by two independent methods
Figures 1, 2, and 3 display the results obtained by circular permutations and the sampling method for 54 annotations and are summarized according to annotation class. Four of the annotations were excluded from further analyses, as their coverage by all analyzed SNPs was very low, and thus not informative. Summary statistics for the analyzed annotations including the number of observed sites, the percentage of total nucleotides covered in the genome, the percentage of SNPs covered in the genome, and the average length of the annotated sites in base pairs were calculated for all annotations (Additional file 1). Odds ratios, which indicate enrichment/depletion of traitassociated SNPs, were calculated for each annotation. An odds ratio equal to unity indicated that traitassociated SNPs were as likely to coincide with the analyzed genomic feature as non-associated SNPs. An odds ratio above unity indicated that the genomic feature was enriched for trait-associated SNPs, while odds ratios below unity were evidence for depletion. Fold enrichment and odds ratios were approximately equivalent, (see Additional file 2 and Additional file 3). In general, the odds ratios for a particular annotation were very similar between the sampling and permutation approaches with odds ratios correlating strongly (r 2 = 0.98, P-value = 1.01 × 10 -51 ). However, two (vega PseudoGenes and inactive/poised promoters) of 54 annotations obtained different significance using the two different methods. The odds ratios associated with these two annotations are almost identical, which means that the differences are due to different confidence intervals obtained by the two methods. Table 1 shows a comparison of odds ratios and P-values obtained using the permutation and sampling methods on the significantly and suggestively trait-associated SNPs. Table 2 shows the average of the CI widths for each of the three annotation classes per method. Note that P-values from permutation were truncated at <5.00 × 10 -5 due to the number of permutations performed (i.e. 20,000). A more extreme threshold would not materially change the conclusions and each order of magnitude decrease in the threshold requires an order of magnitude increase in the number of permutations and hence computation. The effect of the annotation was declared significant if the observed P-value passed the significance threshold set by the Bonferroni correction for the number of annotations studied (P-value ≤ 8.62 × 10 -4 ).
The striking similarity of enrichment patterns seen overall between two independent methods provides strong evidence for the co-occurrence of trait-associated SNPs  and genomic regions annotated with specific functional annotations. Of the 21 analyzed classes of annotation associated with genic features only intergenic regions showed depletion. All other genomic features associated with genes were enriched to various degrees for traitassociated SNPs ( Figure 1). The odds ratio for the eQTLs (4.42 [3.52-5.54]) was the highest significant odds ratio obtained for the random sampling approach. There is growing evidence in the scientific community that eQTLs influence complex traits by measurably changing expression levels of genes [16]. The differences in significance of the odds ratios in two of the genomic features between the two methods are most likely caused by the differences of theoretical versus empirical confidence intervals (see Discussion, which highlights the value of the empirical method (circular permutations)).
Chromatin states are a stronger predictor of trait association than sequence conservation All of the 13 annotation classes associated with conservation and other evolutionary signatures were only modestly enriched using either method, showing odds ratios of less than two ( Figure 2). Three of the annotations (evofold, vista enhancers and exapted repeats) failed to reach [19] significance. The negative set ( Figure 2) is intended as an approximation to a negative control for the evolutionary signatures annotation class, and was composed of intergenic SNP data lacking any other genic or conserved/evolutionary annotations irrespective of the chromatin states present. However, almost half of trait-associated SNPs or their LD partners were found to be within this negative set (see Discussion), which could explain the modest depletion seen in that class. The relatively weak performance of evolutionary measures is a surprising result given the ubiquitous use of evolutionary conservation in computational variant prioritization approaches [20][21][22][23][24].
The results for 19 genomic annotations corresponding to distinct chromatin states are shown in Figure 3. Regions associated with a variety of states implicated in gene activation as identified by histone modifications were enriched for trait-associated SNPs. Strong enhancers of proximal genes (OR = 3.93 [2.88-5.  quarters of trait-associated SNPs or their LD partners were located in regions annotated as exhibiting a relatively 'open' , de-condensed higher order chromatin structure (Additional file 1). Strong enhancers regulating distal genes ( Figure 3) were also enriched for trait-associated SNPs, albeit less so than strong enhancers which regulate proximal genes ( Figure 3). Conserved distal regulatory enhancers are frequently found at loci containing developmental genes [25,26]. The results presented here may therefore reflect the depletion of variants in such enhancers due to their detrimental effects upon developmental processes. Both the relatively repressive, 'closed' higher order chromatin domains and heterochromatin features show depletions. Repetitive/CNV regions obtained odds ratios close to one and therefore were not significantly enriched or depleted.
Analyses were repeated using a more liberal LD cut-off point of r 2 > 0.7 to determine LD partners. The results obtained from these data were similar to the ones obtained using r 2 > 0.9 with only a few annotations becoming significant (Additional file 4, Additional file 5 and Additional file 6).

Similar enrichment trends for significantly and suggestively associated SNPs
There has been substantial interest in the roles of GWAS variants showing 'suggestive' levels of significance (i.e. SNPs with P-values = 5 × 10 -5 -5 × 10 -8 ), as they are believed to contain many true positives with modest effect sizes [27]. If that is correct, we might expect similarities in the functional enrichment patterns of these two classes of variants. Figure 4 highlights the 14 significant enrichment/ depletion results of the suggestively associated SNPs in the annotations, nine of which are from the genic annotation category. The trends are similar to those observed for genome-wide significant SNPs, but with lower odds ratios (see Additional file 7). Additional file 8 shows the results of analyses of significantly and suggestively trait-associated SNPs for all annotations. Table 3 presents the β-coefficient (the ratio of the estimated effect to its standard error) from the logistic regression models for the significantly and suggestively trait-associated SNPs in the final models. The significant β-coefficients are plotted in Figure 5. The annotations were ordered in terms of significance and effect size in the logistic regression for the significant trait-associated SNPs. Negative values implied that trait-associated SNPs were depleted in those regions, once the effects of the previously added annotations were taken into account. The final model for the significant SNPs included 25 annotations, 17 of which were significant. The OMIM morbid regions and OMIM genes were excluded from the analysis, as they were trait-associated by definition    The results from logistic regression demonstrate the value of a comprehensive modeling approach that helps identify annotations providing independent information on the trait-association status of SNPs. Some of the individual annotations have highly significant effects in both the logistic regression and the enrichment analyses. Nonetheless, the overall explanatory power of the final model as evidenced from the pseudo-r 2 values (Table 4) is relatively limited. One might imagine that the power of predictive modeling might be enhanced by the inclusion of quantitative variables rather than the essentially binary variables used here. This was borne out by an examination of models including a quantitative estimate of the upstream proximity of GWAS hits to TSSs. Additional file 9 shows two histograms of A) the upstream   (Table 4). Additional file 12 shows the effect of including distance to TSS for six representative genomic annotations for the significantly trait-associated SNPs. The importance of a number of other annotations declined correspondingly, with the odds ratios for strong enhancers (proximal), eQTLs and active promoters reduced the most. The reduction of the odds ratio of eQTLs confirms previous results that eQTLs are usually found close to the transcription start site of genes [28].

Discussion
We analyzed the enrichment and depletion of significantly and suggestively trait-associated SNPs in genomic regions annotated for 58 different functional genomics features. For the significantly trait-associated SNPs we observed significant enrichment in genic annotations and several features associated with particular chromatin states. The enrichment in genic annotations has been well documented in previous studies [8,10], while there has been modest evidence for enrichment of traitassociated SNPs in regions with distinct chromatin structures [13]. However, the greatest insight is provided by logistic regression analysis, which evaluates the genomic features in terms of their influence on traitassociation status in the context of the complete model for prediction of trait association status for SNPs. Annotation classes associated with genes ( Figure 1) showed enrichments and depletions comparable to previous studies. The highest significant enrichment was observed in gained stop codons obtained by the  Table 2 for numeric values of all annotations. Solid symbols indicate odds ratios significantly different from 1 (p ≤ 0.05). Odds ratios below or above one show depletions or enrichments respectively. permutation method, but not the sampling method. This, and other sparsely annotated genomic features (i.e. with a low percentage of the genome covered; Additional file 1), resulted in large confidence intervals on the estimated odds ratio by either method. The observed differences in significance are due to the theoretically determined P-values, which were a widely used asymptotic approximation (see Methods), employed in the sampling method versus those determined empirically in the circular permutations method (see Methods). Theoretical values were used for the computationally intensive sampling method since they were necessarily based on a limited number of random samples, and such limitations do not apply to the permutation approach. The confidence intervals derived by permutation are generally slightly more conservative (i.e. larger) than those from the sampling approach. This is consistent with the permutation approach taking appropriate account of non-random distributions of annotations and SNP locations. As expected among the genomic features enriched for significantly trait-associated SNPs were the OMIM morbid regions, identified as regions associated with traits in GWAS and linkage studies [29]; these regions may approximate a positive control.
Conserved elements and regions with other evolutionary signatures ( Figure 2) usually exhibited significant though modest enrichment with odds ratios ranging from 1.64 to 1.97. Odds ratios for evofold, VISTA enhancers and exapted repeats were found to be not significant, but other conserved regions and evolutionary signatures showed odds ratios comparable to each other. The PREMOD [30] annotation is the only annotation obtained from a predictive algorithm that shows significant enrichment (OR = 1.64 [1.38-1.97]). It is, unlike other algorithms, not restricted to modules located proximal to genes, but mostly contains distal predicted cis-regulated module predictions [30]. This has implications for follow-up studies, as trait-associated SNPs in conserved regions tend to be prioritized for further studies [31] either manually or via algorithms [20,21,23,32]. Conserved regions have also been shown to add little information to the prediction of nucleotides acting as eQTLs, with no significant odds ratios of enrichment observed for conserved phastCons elements [33]. It is likely that many of the most conserved sites ('MCS' , obtained through phastCons) are shared among the sets of sites identified using different alignments with a varying number of investigated species (Figure 2). We detected enrichment for trait-associated SNPs in various chromatin states (Figure 3) associated with a variety of regulatory functions. The proximal and distal sets of the chromatin states influence expression of proximal and rather more distant distal genes, respectively [13]. The significant enrichment signal in the enhancer annotation is consistent with the results of Ernst et al. [13], who investigated GWAS results for immune and blood related phenotypes in chromatin data from the GM12878 lymphoblastic cell line. The authors reported a two-fold enrichment for a combination of the proximal and distal strong enhancers for SNPs associated with leukemia, rheumatoid arthritis, and systemic lupus erythematosus [13]. We are able to confirm their observed enrichment of trait-associated SNPs and also observe enrichment signals with larger odds ratios in the strong enhancer sets of proximal genes than for the strong enhancer set of distal genes. A third of the significantly trait-associated SNPs used here are associated to immunity-related traits, so a   Figure 5 Logistic regression identifies most influential annotations. Significant ratio between estimated effect and standard error of annotations in the logistic regression models for the significantly and suggestively trait-associated SNPs (○,Δ) sorted after decreasing significance in the logistic regression model for significantly trait-associated SNPs. The final models for the significant and suggestive trait-associated SNPs included 27 and 12 annotations of which 17 and 6 were significant in the models, respectively. Corresponding values can be found in Table 4.
strong signal in the enhancer regions of a lymphoblastic cell line is intuitively reasonable. A clear difference is seen between open and closed domains of higher order chromatin. This was expected as closed chromatin is known to contain a somewhat higher density of SNPs [15], but is also likely to contain fewer trait-associated SNPs due to low gene density [17]. Open chromatin, however, is present at gene and regulatory feature dense areas [17] and is therefore more likely to harbor trait-associated variants. The suggestively trait-associated SNPs showed similar results to the significantly associated SNPs, but with more moderate odds ratios. This result is consistent with suggestively associated SNPs containing both false positives (which we would expect to have no bias towards particular annotations) and true associations, whose effects were not of sufficient magnitude to show genome wide significance. These true positives would be expected to have the same bias towards particular genomic features as traitassociated SNPs of genome wide significance [10].
While significantly trait-associated SNPs are consistently documented, suggestive associations often remain unreported, since they are assumed to contribute less to our understanding of the underlying biology. Additionally, the NHGRI GWAS catalog only incorporates SNPs with association levels starting at 5 × 10 -5 , where the more commonly accepted level for suggestively associated SNPs starts at 5 × 10 -4 [5,7]. This means that the significantly associated SNP set is likely to be a more comprehensive and complete SNP set, despite containing a smaller number of SNPs. The similarity of enrichment trends between the significant and suggestive sets are encouraging and may be of use to aid further research into areas of the human genome surrounding suggestively trait-associated variants on exonic variants, which may introduce bias towards certain genomic features, such as genic regions, which may affect the results for the suggestively associated SNPs more than the significantly associated SNPs, as there are more of the latter. However, the enrichment trends between the two sets suggest that this is not a major problem. The results are therefore encouraging and may be of use to aid further research into areas of the human genome surrounding suggestively trait-associated variants.
A combination of the two full logistic regression models identified six annotations that significantly influenced trait-association status for both significantly and suggestively associated SNPs. These annotations were open chromatin, eQTLs, exons, strong enhancers (proximal) of proximal genes, vegaGenes, and gained stop codons. These results are biologically reasonable, as the disruption of coding regions of genes gives rise to different phenotypes. Open chromatin is, as mentioned before, densely populated by genes and regulatory features, while recent literature indicates that eQTLs are highly influential in causing phenotypic variation by regulating gene expression [16,28,[34][35][36][37]. In the significantly traitassociated model, the conserved regions included were the most conserved elements identified in the primate lineage, followed by all conserved sites identified between 44 vertebrates. This suggests that these two levels of conservation are sufficiently different from each other to be separately included in the model. The previously observed trend of more moderate effects in the suggestively traitassociated SNP dataset was confirmed in all genome annotations, with the exception of the gained stop codons, which had a stronger effect on suggestively associated SNPs.
The majority of significantly trait-associated SNPs and their LD partners (55%) overlap in regions identified as containing genes listed in the vega database [29,38]. This percentage can be increased to 70% by adding the remaining 7 genic annotations found to influence significant trait-association status: eQTLs, exons, TSS 5 Kb upstream, gained stops, synonymous SNPs, non-synonymous SNPs and 5 0 UTR. One or more of the conserved region annotations overlap with 72% of significantly traitassociated SNPs, which is reduced to 48% if the genomic features overlapping with the highest number of SNPs, positively selected genes and regions showing constraint in the accumulation of indels, are excluded from that analysis. These latter two genomic features contained the highest number of trait-associated SNPs in regions with evolutionary signatures. Most widely used prediction algorithms [20,32] already make use of conserved sites to predict traitassociated SNPs, but could possibly be improved if conserved indels were included into their predictive methods. The negative set was overlapping with 47% of the trait-associated SNPs. It is, for example, possible that a SNP overlapping with conserved sites has LD partners, which overlap with the negative sequence annotation.
The variables identified as informative in the logistic regression do indeed harbor many trait-associated SNPs on closer inspection. Some (4%) significantly trait-associated SNPs or their LD partners overlapped with none of the genomic features with a positive influence on traitassociation status. In contrast, 23% of background SNPs were not overlapping any of those genomic features (Additional file 13). The odds ratio, which was calculated for the observed distribution of significantly associated SNPs and their LD partners that are overlapping with the 12 identified annotations with a positive β-coefficient, was 7.70 [6.09-9.73] and a P-value of 6.19 × 10 -126 . The 4.2% of the trait-associated SNPs that are not explained by those 12 annotations overlap mainly with heterochromatin and intergenic regions. The chromatin states defined by Ernst et al. [13] cover the entire genome, so that all traitassociated SNPs co-occur with at least one of the states. Table 4 shows the McFadden's and McKelvey and Zavoina pseudo-r 2 values for the empty and full models with and without distance to TSS for suggestively and significantly trait-associated SNPs. The logistic regression model without the distance to TSS for the significantly associated SNPs explained 11-25% of the observed variance, which was an increase of 4-11% when compared to the empty model, which only included the effects of the genotyping arrays. An ANOVA test, using a chi-squared test, showed the difference between the two models to be significant (Deviance = 1501.00, P-value = 3.13 × 10 -309 ). The difference between the two models for the suggestively associated SNPs was also significant, albeit less so (Deviance = 113.06, P-value = 1.49 × 10 -18 ). The pseudo-r 2 values for the model including the distance to TSS range from 12-42% depending on the method to calculate the value. Although the full model without the Distance to TSS variable is a substantial improvement on the empty model, the pseudo-r 2 suggests much variance remains unexplained. This is hardly surprising when it is considered that the data contains millions of SNPs which are functionally annotated, either directly or through LD partners, but which are not known to be trait-associated. This indicates that there is quite some way to go before one can use annotation information to predict trait-association status with any confidence. Improvements in the accuracy and precision of annotation will undoubtedly helpfor example the resolution of conserved regions or chromatin states can be expected to improve over time. Such improvement combined with better information on SNP trait association status such as effect size or size and power of individual studies might further speed progress towards models that are better able to predict functionally relevant SNPs, either for focused functional studies, or for inclusion in health prediction algorithms. Additionally, the investigation using distance to TSS highlights the importance of quantitative variables, which may be a future avenue to take, once these annotations become available.

Conclusion
We have identified genomic features which are significantly enriched or depleted for both significantly and suggestively trait-associated SNPs. Additionally, we were successful in assigning weights to 17 genomic features, which indicate their relative influence on trait-association status of GWAS hits significant at the genome-wide level. These weights could be used to further prioritize GWAS hits as candidates for potential follow-up studies. The most informative and influential genomic features for significant trait-association status were regions associated with particular chromatin states, as identified using logistic regression. Conserved elements and regions with other evolutionary signatures were shown to have relatively weaker influence than either chromatin states or genic region annotations, once all other included genomic features were taken into account. Distance to transcription start site (TSS) was identified as an influential factor, where SNPs further away from the TSS were less likely to be significantly trait-associated. We have also identified four genomic featuressynonymous SNPs, transcriptional elongation, 5 0 UTRs and active promotersthat are enriched for significantly trait-associated SNPs in both the circular permutations and sampling method, but show relative depletion in the logistic regression model, which looks at relative influences across the analyzed genomic features. This stresses the value of studying combined influences of the genomic features relative to each other, rather than separately. With the data in place, we can now investigate different trait-subsets and other co-occurrences within the genome.

Trait-associated SNPs
The significantly and suggestively trait-associated SNP sets were derived from the NHGRI GWAS catalogue; accessed 25 Aug 2011 [5]. This dataset reported 4,520 SNPs with at least one associated trait (5,800 reported associations in total) from 764 studies. A unique SNP is the "rs" number of a SNP that is associated to at least one trait. The common genome-wide level significance threshold (p < 5 × 10 -8 ) was used to define 1,909 significantly trait-associated SNPs from 586 studies. The suggestively trait-associated SNPs set was defined as SNPs with association P-values between 5 × 10 -8 and 5 × 10 -5 . SNPs that were located on either the Y-chromosome or unassigned chromosomes were removed from all analyses. SNPs in the suggestively associated SNPs set found to be in LD (r 2 > 0.9) with significant SNPs were removed from the dataset, resulting in 2,410 unique rs numbers from 412 studies.

Total number of analyzed SNPs
A list of 3,840,944 SNPs incorporated all SNPs that were included on different genotyping arrays and also those that are part of the HapMap CEU II data. The latter was included to account for SNPs that were identified as traitassociated SNPs through imputation in meta-analyses. The list included information on linkage disequilibrium (LD) partners of all SNPs (see below), the location of SNPs in the genome and the observed co-occurrences with annotations for the selected genomic features (see below). Autosomes and the X chromosome were analyzed in this study.

LD partners
The HapMap CEU II data was used to define LD partners of all SNPs. LD partners were defined as SNPs from the total set in LD (r 2 > 0.9) with the analyzed SNPs [39,40]. The distance between LD partners was up to 250 Kb on either side of the SNP. The maximum distance between two LD partners for any particular SNP was therefore up to 500 Kb. This r 2 was chosen since the effect of SNPs in LD of that value are said to be equivalent in trait-association studies [8].

Genomic features
This study analyzed three main categories of annotations: genic and regulatory features, regions with conserved or evolutionary signatures, and chromatin states. Additional file 14 provides further details of the annotations and their sources. All annotations were downloaded in hg18, where available. If they were not available, the UCSC liftOver tool was used to transfer the annotated regions into hg18 [41]. A SNP was annotated as overlapping within a particular genomic feature if it or any of its LD partners was located within the annotation. Trait-associated SNPs without LD partners were analyzed on their own. We also included a derived annotation as an approximation for a negative control: From the intergenic dataset we excluded sites overlapping with any form of evolutionary, regulatory or genic annotation examined here, so that it is negative for sequence annotation irrespective of epigenetic state.

Odds ratios
For a particular genomic feature, one overlap was counted, if a trait-associated SNP or any of its LD partners cooccurred in a region annotated for that genomic feature. Non-overlaps were defined as the lack of co-occurrences between a trait-associated SNP or its LD partners and an annotation. Odds ratios were calculated to enable comparisons with previous studies [8], where an odds ratio was defined as the product of the overlaps of the real data and the non-overlaps of the sample data divided by the product of the non-overlaps of the real data and the overlaps of the sample data, i. Odds ratios of enrichment/depletion were calculated by comparing overlaps between genomic features and real trait-associated data with overlaps of SNPs determined by chance alone. The 'epitools' package [42] of the statistical program R version 2.12.1 [43] was used for the calculations. Odds ratio P-values were significant when below the Bonferroni-corrected significance threshold, which in our case was calculated for 58 independent variables (P ≤ 8.62 × 10 -4 ). These annotations were not independent from each other, which means that the Bonferroni corrected P-value is conservative. Fold enrichment, a ratio of hits in the associated data over the hits in the permuted data, was calculated for the significant SNP set to compare it to the calculated odds ratio to aid the interpretation of results.

Sampling genotyping SNP platforms
The sampling method was based on Hindorff et al. [8] and aimed to obtain sample sets of SNPs of equal size to the set of trait-associated SNPs represented on genotyping platforms. We used weighted groups based on the manufacturer(s) of the SNP platform(s) to draw the samples, rather than on individual genotyping arrays, as that information was often unavailable. The numbers of SNPs drawn from each manufacturer group were proportional to the number of SNPs observed in the real data. The HapMap CEU II SNPs were included to account for the traitassociated SNPs obtained from GWAS using imputed genotypes. The LD partners were ascertained as for the trait-associated SNPs. Odds ratios, confidence intervals and P-values indicating the significance of the observed results were calculated using the oddsratio.wald() function from the 'epitools' R package. This function calculated the odds ratios by comparing unconditional maximum likelihoods of the observed value compared with the mean number of hits of the 100 samples.

Chromosome-bound circular permutations
A novel permutation approach was applied, which preserved the internal structure of the datasets in terms of relative distance between SNPs, the observed clustering of annotations, and the LD structure around SNPs. GWAS hit status (appears in NHGRI GWAS catalogue with required p-value or not) was established for a list of a total of 3,840,944 known HapMap CEU II SNPs in autosomes and the X chromosome with information on LD (r 2 > 0.9) SNP partners for each SNP appearing on the list. For each permutation a randomly generated number, drawn from a uniform distribution between one and the number of SNPs per analyzed chromosome, was used to shift the trait-association status of all SNPs within a chromosome. Permutations were circular within chromosomes: where a shift of status exceeded the SNPs available before the end of the chromosome it resumed at the beginning of the chromosome. This produced a population of 20,000 permuted genomes containing the same number of trait-associated SNPs and showing the same degree of genomic clustering as observed in the original SNP datasets. Overlaps between the permuted trait-associated variants and the annotations were counted. The odds ratios were calculated by comparing the mean number of overlaps of permuted SNPs and the observed results for the associated SNPs for each annotation. The 95% confidence intervals were obtained by calculating odds ratios for the 5th and 95th largest values of the permuted hits. The P-value of the odds ratios obtained by the permutations was calculated from the proportion of permuted datasets that were more extreme than the observations in the real trait-associated SNP set. Hence the lower bound of the P-value was 5 × 10 -5 when results from the real data were more extreme than any of the 20,000 permutations.