Exome localization of complex disease association signals

Background Genome-wide association studies (GWAS) of common diseases have had a tremendous impact on genetic research over the last five years; the field is now moving from microarray-based technology towards next-generation sequencing. To evaluate the potential of association studies for complex diseases based on exome sequencing we analysed the distribution of association signal with respect to protein-coding genes based on GWAS data for seven diseases from the Wellcome Trust Case Control Consortium. Results We find significant concentration of association signal in exons and genes for Crohn's Disease, Type 1 Diabetes and Bipolar Disorder, but also observe enrichment from up to 40 kilobases upstream to 40 kilobases downstream of protein-coding genes for Crohn's Disease and Type 1 Diabetes; the exact extent of the distribution is disease dependent. Conclusions Our work suggests that exome sequencing may be a feasible approach to find genetic variation associated with complex disease. Extending the exome sequencing to include flanking regions therefore promises further improvement of covering disease-relevant variants.


Background
While development of next-generation sequencing technologies has opened exciting new opportunities to identify disease relevant mutations, whole genome sequencing of large patient cohorts is still prohibitively expensive. Therefore, more modest sequencing-based association studies are being undertaken. One way to overcome the cost limitation is the targeted sequencing of selected genomic regions such as genes or exons using enrichment methods as reviewed by Summerer [1]. Recently, several groups have successfully identified causal mutations for monogenic disorders by sequencing all exons (the "exome") in a small number of patients [2][3][4]. A similar approach has been proposed for complex diseases, but how important is the exome in complex disease? Cellular processes are ultimately driven by proteins, encoded by the exons, but genetic variation leading to a disorder is not necessarily located in protein coding regions. Regulatory elements that affect gene expression can play a role and, although they tend to be clustered around genes [5,6], some of them can be many kilobases (kb) upstream or downstream of the transcribed region. Further complexity is for example added by microRNAs (miRNA) with regulatory functions; some of which have been linked to disease phenotypes [7]. Whole genome sequencing may be available in the future as sequencing costs are dropping rapidly. However, for large sample numbers it is still too costly, therefore we want to identify regions that are most likely to contain disease-related variation in the human genome. To address this question we assess how disease-associated genetic variants are distributed with respect to protein-coding genes. Our analysis is based on genome-wide association study (GWAS) data for seven common diseases genotyped by the Wellcome Trust Case Control Consortium (WTCCC) (see Table 1) [8]. We first assign every genotyped SNP to its closest gene. SNPs are then binned into 10 kb windows upstream and downstream of genes. A central gene-window contains all SNPs that fall within the transcribed region of a gene and an exon-window all SNPs that fall into a coding exon. To assess the amount of association signal within every window we determine the proportion of SNPs with a p-value smaller than a threshold α.

Results
The WTCCC study was carried out using the Affymetrix 500K GeneChip which covers~500,000 SNPs spread out over the human genome. To explore how much of the association signal is linked to the exome, we assign every SNP to its closest gene based on genomic localisation ( Table 2); all SNPs are subsequently classified into groups according to their distance from the closest gene, i.e. "within a gene", "less than 10 kb upstream", "between 10 kb and 20 kb upstream", etc. Table 3 shows the total number of SNPs represented on the Affymetrix 500k GeneChip for every window. For each SNP we perform association tests comparing the frequency of the sequence variants between diseased and healthy individuals as described in the Methods section. In the original publication by the WTCCC careful multiple testing correction procedures were employed to ensure the identification of association signals with genome-wide significance, typically with a p-value p < 5 x10 -8 [9]. Due to the high number of statistical tests, many SNPs are statistically significant at p < 0.05, but do not pass this genome-wide multiple testing correction. Amongst the latter might be true associations. We assume a uniform distribution of false positive signal over all SNPs, whereas true positives might be enriched in specific locations relative to the coding sequences. Here, we (arbitrarily) define a SNP where p < α with α {0.1, 0.01, 0.001} as "suggestive" and analyse the distribution of "suggestive" p-values with respect to exons and genes by examining the proportion of SNPs in each sequence window achieving p-values below α. To establish the significance of the association signal for each sequence window we perform 100,000 permutations of the disease status to derive a 95% confidence interval as described in the Methods section.
The distribution of association signal varies between diseases. Figure 1 shows the enrichment of association signal for each disease at α = 0.01 (Additional File 1: Figures S1 to S7 and Additional File 2: Tables S1 to S9 for α {0.1, 0.01, 0.001}).
Within genes we observe significant enrichment of association signal for Type 1 Diabetes T1D (α = {0.1, 0.01, 0.001}), Rheumatoid Arthritis RA (α = {0.01, 0.001}), Bipolar Disorder BD (α = 0.01), Crohn's Disease CD (α = 0.001) and Type 2 Diabetes T2D (α = 0.001). Coding exons typically only constitute a small fraction of the total length of a gene (~3%), but changes to the amino acid sequence are likely to alter or disrupt the function of a gene. When we therefore analyse the enrichment of association signal in exons only (purple triangles), we observe a significant enrichment of association signal for T1D (α = {0.1, 0.01, 0.001}), RA (α = {0.01, 0.001}), Hypertension HT (α = 0.1), BD (α = 0.01) and CD (α = 0.001). Furthermore, for almost all diseases and levels of α the proportion of "suggestive" SNPs in coding exons is higher than in the whole gene (except BD and T2D at α = 0.001) indicating that the majority of the association signal is to be found in the coding region rather than the introns. Outside genes, T1D, RA and CD show consistent enrichment of association around the central gene window; this enrichment becomes stronger with increasingly stringent levels of α. For the other diseases we observe sporadic enrichment.
For BD, Moskvina et al. compared the distribution of SNPs within and outside of genes independently of distance to the gene [10]. They reported a significant enrichment of SNPs with p < 0.01 within genes. Our analysis confirms their findings: we, too, do not find enrichment within genes for any other α for BD (Additional File 1: Figure S4), nor do we observe consistent enrichment in the vicinity of genes for BD.
One factor that might be influencing our analysis is linkage disequilibrium (LD). In an ideal situation all SNPs would be inherited independently. However, SNPs located closer to each other on a chromosome are more  likely to be inherited together, because the likelihood for separation due to crossing-over events is lower. When deriving confidence intervals by permuting disease labels we maintain LD structure, i.e. the confidence intervals take the LD structure into account. Yet, a true association signal could extend over large regions of the genome if it falls into a LD block. Most of the SNPs in such a region could appear to be associated with the phenotype. If such a region is rich in genes we would observe an enrichment that is inflated, because in genedense regions the windows close to the central gene window are more populated since every SNP is assigned to closest gene. The Major Histocompatibility Complex (MHC) region on chromosome 6 is such a gene-rich region and known to have a high level of LD. RA and T1D have their most significant signal within the MHC region. We therefore repeated our analysis but excluded the MHC region on chromosome 6 (position 25,930,839 to position 33,297,046, NCBI assembly GRCh37). For all diseases except RA and T1D the distribution changes only marginally after removal of the MHC region (Additional File 1: Figures S1 to S7); but for RA-MHC and T1D-MHC enrichment in the vicinity of genes is substantially reduced (Figure 2, Additional File 1: Figures  S2 and S3). T1D-MHC continues to display a moderate but significant enrichment in and around genes and we find a significant enrichment within exons. In contrast we observe no significant enrichment for RA-MHC.
To test if we can increase the power of our analysis we combine the data for all seven diseases (Figure 3, Additional File 1: Figures S8 and S9). When the MHC region is excluded from the analysis (Figure 3 and Additional File 1: Figure S9) the combined distribution of all seven diseases shows moderate enrichment. The enrichment is still significant for α = 0.001 and we observe a bell shaped distribution around the gene for all three thresholds α (Additional File 1: Figure S9), suggesting a deviation from uniform distribution. For all thresholds α the proportion of SNPs with p < α is higher in exons than in the whole gene (purple triangle in Figure 3).
In this work we analyse the distribution of SNPs with "suggestive" p-values in respect to genes. To assess whether the observed enrichment is driven by a few genes only, we count the number of genes with SNPs that have "suggestive" p-values (p < α) ( Table 4). We considered SNPs that are located within less than 100 kb of a gene and SNPs that are located within an exon separately (Table 4). Gene counts for the observed data are compared to gene counts from 100,000 permutations of the disease status. We observe significantly  more genes with "suggestive" SNPs for the diseases and levels of α where we observe an increase of association signal. This suggests that the observed enrichment is not driven by a few genes only. For T2D (α = 0.001) and BD (α = 0.01) the number of genes with "suggestive" SNPs does not reach statistical significance, but is notably above the centre of the 95% confidence interval. Diseases that show an enrichment of association signal around the gene (CD, T1D, RA) also have substantially larger number of genes with "suggestive" SNPs than the diseases that do not show enrichment (T2D, CAD, HT, BD) ( Table 4).

Discussion
In this work we consider SNPs that show a "suggestive" association (p-value < α with α {0.1, 0.01, 0.001}) with a disease. Using the genome annotation (NCBI Gene Build downloaded in November 2009, NCBI assembly GRCh37) we find these "suggestive" SNPs to be enriched in genes and their vicinity. We observe a significant enrichment of association signal in protein-coding exons (T1D, CD, HT, BD), in genes (T1D, CD, BD, T2D) and in regions up-and downstream of genes (T1D, CD). The distribution of association signal varies between the different diseases, possibly due to different genetic architectures of the analysed diseases. Yet, for all seven diseases we found a consistently stronger association signal in coding than in non-coding regions. A major issue in our analysis and various related studies [11][12][13][14][15][16][17][18][19][20] is the presence of linkage disequilibrium (LD) which makes it very difficult to allocate association signal correctly. The causal variant might be located several kilobases from a variant in LD, which substantially complicates the identification of the causal gene. In our analysis LD might in particular inflate the enrichment for the immune related diseases RA and T1D, for which most of the association signal is located in the MHC region. We address this problem by taking into account LD structure when deriving confidence intervals and by removing the MHC region. While removing the MHC region did not influence the enrichment results for most of the diseases examined here we found profound effects on the results for T1D and RA. The WTCCC study has shown a substantial fraction of the variation associated with RA and T1D are located within the MHC region; thus ignoring this region in the enrichment analysis is likely to result in an underestimate of the true enrichment signal, while including the MHC region probably leads to an inflated signal.
Because of the way we assign SNPs to genes, the total number of SNPs per window is substantially higher within genes and their vicinity (Table 3). This is reflected by the 95% confidence interval which becomes wider with increasing distance from the central gene window. As a result the statistical power to detect a significant enrichment of association signal is higher closer to the central gene window. However, if the observed enrichment was only due to an increase in statistical power around the gene window we would expect a similar percentage of SNPs with p < α for every window, which would become significant around the gene due to the smaller confidence interval. In contrast, for all diseases that show enrichment we observe an increase of association signal around the gene window rather than just a decrease of the 95% confidence interval.
Our approach takes into account distance between variants and genes, which ultimately allows us to detect enrichment; however, effects of genomic features that have a variable distance with respect to genes (e.g. enhancers) will not be detected by our approach. In fact the WTCCC could not associate a number of replicated GWAS hits with any gene because they fall into so-called gene deserts, with the closest gene being over 100 kb away [8]. Detection and interpretation of these variants is likely to improve as the annotation of sequence elements such as miRNAs, enhancers and other regulatory features is increasingly available and some are already included in the exome enrichment kits offered by commercial suppliers. As long as annotation of non-coding elements is still sparse, sequence conservation might be a reasonable proxy for an analysis similar to the one presented here.
Our analysis of genetic association is limited by the SNPs represented on the Affymetrix 500K GeneChip, which provides exonic SNPs for~14.5 % of all genes. Other genotyping arrays might capture association signals that are not detectable using this platform and as a consequence the distribution of association signal might differ.
In addition to common variants, sequencing allows for the detection of rare variants. Rare variants that show association with a disease might be distributed differently from the common variants analysed here. We repeated our analysis for rare SNPs (minor allele frequency of less than 0.05), but did not observe a bias towards genes or exons. This is mainly due to the lack of statistical power, because only 10% of the SNPs that pass Quality Control are rare. Little is known about the role of rare variants in genetic traits although candidate gene studies for common genetic traits have found rare alleles with strong effect sizes in coding regions [21,22] and the majority of known causal variants in mendelian disorders are nonsynonymous mutations or mutations in splice sites [2].

Conclusions
We found a consistently stronger association signal in coding than in non-coding regions for all seven diseases analysed by the WTCCC. We also observed an enrichment of association signal in the vicinity of genes which varies between diseases. We therefore recommend that sequencing efforts focus on the exome and, depending on the disease, to extend the targeted sequence to include other annotated elements, as well as regions upand downstream of genes. The latter can be very costly. Whereas the entire exome only accounts for 2.3% of the genome, including 40 kb flanking windows would mean sequencing approximately 34% of the genome. Including 10 kb flanking window reduces the amount of sequence to 14%; focusing on annotated regulatory elements reduces it further. Until whole genome sequencing for large cohorts becomes affordable sequencing "extended" exomes seems to provide a sensible way to reduce costs while maximizing the chances of detecting disease-associated variants.

GWAS data quality control and association testing
GWAS of seven diseases have been reported by the WTCCC [8]. Approximately 2,000 cases and 3,000 shared controls were genotyped for every disease on the Affymetrix GeneChip 500K Mapping Array Set (Table 1). Our analysis includes moderate associations which are more susceptible to study biases. Moreover, we wanted to make our results comparable to a related study by Moskvina et al. [10]. We therefore re-analyzed the WTCCC I data performing very conservative Quality Control using PLINK v1.06 [23]. In addition to SNPs and individuals in the exclusion lists provided with the genotyping data, we applied more stringent quality control criteria. Based on the pooled case/control dataset we excluded SNPs with Hardy-Weinberg equilibrium p < 0.001, a minor allele frequency of less than 0.01 or call-rates of less than 0.97. Association testing was performed using an Armitage trend test (1df). We manually checked the most strongly associated SNPs for every disease to ensure consistency with the original WTCCC I results. To take into account inflated test statistics caused by population stratification we corrected χ 2 values using the genomic control metric λ median as described by Devlin and Roeder [24]. The estimated λ median (for simplicity denominated as λ) range from 1.04 to 1.12 (Table 1) and are in good agreement with the original values reported by the WTCCC. For every disease, 100,000 permutations of the disease status were performed using the PLINK max(T) permutation method and association p-values were calculated.

Gene to SNP assignment
A tab-delimited text-file (seq_gene.md) containing genomic coordinates for all genes was downloaded from the NCBI ftp-server in November 2009 [25]. Only entries for the human reference sequence (NCBI assembly GRCh37) and protein-coding genes were retained. Genes mapping to sex-chromosomes, the mitochondrial chromosome, unassembled contigs or alternative Numbers are derived for all seven diseases and for T1D and RA after removal of the MHC region (T1D-MHC and RA-MHC). We consider SNPs that are located within less then 100 kb of a gene and SNPs that are only located within an exon. We derived 95% confidence intervals for the number of genes with p < α based on 100,000 permutations of the disease status.
haplotypes were discarded. SNPs on the GeneChip 500K Mapping Array Set were assigned to the remaining genes. Because this genotyping platform is based on the previous assembly of the human genome (NCBI 36) all SNP positions were converted to the latest assembly using the "Lift-Over" tool on the GALAXY website [26]. SNPs were assigned to a gene if they are located within the primary transcript of that gene. None of the SNPs on the Affymetrix 500K GeneChip fell into a region where the primary transcripts of two genes overlapped. All other SNPs were assigned to their closest gene and were then binned into 10 kb windows upstream and downstream of the gene. Positions for coding exons were obtained using GALAXY [26] and SNPs within coding exons were labelled as such. In total we assigned SNPs to approximately 17,000 genes. Table 2 summarises the SNP to gene assignment. We performed our analysis with and without the MHC region. Removal of the MHC region (chromosome 6, position 25,930,839 to position 33,297,046, NCBI assembly GRCh37) excluded 1,473 SNPs and 185 genes.

Enrichment Plots
We determined the extent of association signal in every 10 kb window upstream and downstream of a gene and within genes. For each window and for coding exons we calculated the ratio of the number of SNPs with an association p-value below threshold α and the total number of polymorphic SNPs within that window. The same procedure was applied to the results of each of the 100,000 permuted data sets. Thus, we derived a 95% confidence interval for the proportion of SNPs significant at a p-value α when no association is present. To highlight the overall trend, values were smoothed for the observed data by averaging values over a 50 kb sliding window. Where results are shown for the combined datasets, the numbers of SNPs per window (observed and permuted) are averaged over all seven diseases.

Additional material
Additional File 1: Enrichment Plots for all Diseases and Thresholds. This file presents the enrichment plots for all seven diseases and thresholds α with and without the MHC region ( Figure S1 to S7). Figure  S8 and S9 show the combined enrichment of all seven diseases for all thresholds α with and without the MHC region. Additional