Accounting for multiple comparisons in a genome-wide association study (GWAS)
© Johnson et al. 2010
Received: 19 August 2010
Accepted: 22 December 2010
Published: 22 December 2010
Skip to main content
© Johnson et al. 2010
Received: 19 August 2010
Accepted: 22 December 2010
Published: 22 December 2010
As we enter an era when testing millions of SNPs in a single gene association study will become the standard, consideration of multiple comparisons is an essential part of determining statistical significance. Bonferroni adjustments can be made but are conservative due to the preponderance of linkage disequilibrium (LD) between genetic markers, and permutation testing is not always a viable option. Three major classes of corrections have been proposed to correct the dependent nature of genetic data in Bonferroni adjustments: permutation testing and related alternatives, principal components analysis (PCA), and analysis of blocks of LD across the genome. We consider seven implementations of these commonly used methods using data from 1514 European American participants genotyped for 700,078 SNPs in a GWAS for AIDS.
A Bonferroni correction using the number of LD blocks found by the three algorithms implemented by Haploview resulted in an insufficiently conservative threshold, corresponding to a genome-wide significance level of α = 0.15 - 0.20. We observed a moderate increase in power when using PRESTO, SLIDE, and simpleℳ when compared with traditional Bonferroni methods for population data genotyped on the Affymetrix 6.0 platform in European Americans (α = 0.05 thresholds between 1 × 10-7 and 7 × 10-8).
Correcting for the number of LD blocks resulted in an anti-conservative Bonferroni adjustment. SLIDE and simpleℳ are particularly useful when using a statistical test not handled in optimized permutation testing packages, and genome-wide corrected p-values using SLIDE, are much easier to interpret for consumers of GWAS studies.
Since the first successful genome-wide association studies (GWAS) in 2005, over 600 GWAS have been reported . Due in large part to rapid advances in genotyping technology and standardized guidelines for reporting statistical evidence, the multitude of comparisons made in a GWAS will result in both false positive (Type 1 errors) and, if the correction for multiple comparisons is overly conservative or power is inadequate, false negative (Type 2 errors) results.
is a function of n, the number of independent comparisons made, as well as α. The direct application to a GWAS is that, with a significance level typical to small studies and candidate gene studies (e.g. α = 0.05, α = 0.01, α = 0.001), the probability of not committing a GWAS-wide Type I error is very small.
The standard for evidence of significance in GWAS to securely identify a genotypephenotype association in European Americans is generally considered to be p < 5 × 10-8 or p < 1 × 10-8, for α = 0.05 and 0.01, respectively [2–5]. This standard is based on a Bonferroni correction for an assumed million independent variants in the human genome. As a consequence, the avoidance of Type 1 errors may inflate Type 2 errors. This is especially true for analyses with low power, such as rare diseases where patient numbers are limited, low frequency alleles, or genetic factors with small effect sizes. This conundrum can be resolved with extremely large study sizes, but in practice this is not always cost efficient or practical. These issues should be major considerations both for designing GWAS and interpreting GWAS results.
Several methods are commonly used to control the GWAS-wide Type I error rate: p-value adjustments for multiple comparisons have long been used when making multiple comparisons ; the use of q-values, a measure of the false discovery rate, has been proposed as a way to indirectly measure and control the Type I error rate ; a two-stage analysis of the data can be used not only to decrease the Type I error rate , but also to decrease the genotyping costs incurred ; genotype imputation can result in a net increase in statistical power [10, 11].
A Bonferroni adjustment fits our problem particularly well because many comparisons are made and a GWAS is considered agnostic, with no prior hypotheses . Several studies have estimated the number of statistical comparisons made in a GWAS [2–5], but the universal application of a one-size-fits-all significance level to GWAS studies is inappropriate. Power to detect associations is determined, in large part, by allele frequencies and their effect sizes; since these variables are constants, only sample size can be adjusted. As the sample size increases, the power to detect low frequency and/or small effect size genetic variants also increases. Newer SNP arrays, designed to more fully capture the range of SNPs in diverse human populations and to include rare SNPs hypothesized to be more likely to have larger effect sizes, will increase the number of independent statistical comparisons . Additionally, the dependent nature of genetic data, where SNPs in linkage disequilibrium (LD) are correlated to some degree, may lead to over-correction when using Bonferroni adjustments. One of the key assumptions of a Bonferroni adjustment is that all comparisons are independent. Neighboring SNPs on a chromosome tend to be inherited together in blocks and are not independent , making a strict Bonferroni adjustment overly conservative.
What is not clear, however is which SNPs fall into the informative set, so all SNPs are tested. The assumption is then made that the test statistics are distributed similarly to the test statistics from an analysis including only the informative SNPs. Based on the simulations done by Gao et. al. this seems to be a reasonable assumption .
Another relevant question is how to adjust the p-values directly, rather than relying on a significance threshold . These corrected p-values, measuring significance on the genome-wide scale, have the added benefit of easier interpretation. For example, comparing two uncorrected p-values, 6.8 × 10-8 and 4.1 × 10-10, becomes much more tractable after a genome-wide correction, resulting in corrected p-values of 0.0291 and 0.0004, respectively.
There have been a number of studies attempting to provide an accurate picture of how SNPs, and/or statistical tests of SNPs, are correlated in genome-wide studies. These fall into three general categories: variations and alternatives to permutation testing [14, 15], principal components analysis [13, 16–18], and analysis of the underlying LD structure in the genome [19–21].
We have recently genotyped 1514 European Americans for 700,078 SNPs using the Affymetrix 6.0 platform in a GWAS to search for AIDS restriction genes. Here we compare traditional Bonferroni significance thresholds with methods from each of these statistical correction strategies to identify an appropriate measure of significance in our GWAS: 1) PRESTO, an optimized permutation algorithm  verified by PERMORY ; 2) the Sliding-window method for Locally Inter-correlated markers with asymptotic Distribution Errors corrected (SLIDE) program, an alternative to permutation testing, developed to correct p-values in a GWAS using a multivariate normal distribution-based correction [14, 23]; 3) the simpleℳ method, specifically developed to calculate the number of informative SNPs being tested in a GWAS using a principal components analysis ; 4) the number of LD blocks found by the Gabriel, Solid Spine of LD, and 4-Gamete algorithms, as implemented in Haploview .
Our aim is to identify the most appropriate method for obtaining accurate GWAS-wide significance thresholds and/or corrected p-values among 700,000 linked SNPs, the best method being one that results in an accurate estimate of the number of comparisons and has reasonable computational requirements.
After filtering for a 90% sample call rate, 1,514 European Americans were successfully genotyped on the Affymetrix 6.0 platform. These subjects consisted of 1,255 HIV- infected and 259 HIV-negative individuals at risk of HIV infection; clinical categories were distributed randomly across plates and batch effects were monitored. We chose 700,078 SNPs, after filtering each SNP for >95% call rate, Hardy-Weinberg equilibrium, Mendel errors, and a minor allele frequency below 1%. After re-clustering and filtering bad SNPs, all sample call rates were >95% with an average call rate of 98.9%. Individuals were unrelated, with the exception of 8 CEPH trios used to check for Mendel errors in the genetic data. A principal components analysis of the genetic data using Eigensoft was used to identify population structure. No significant outliers were identified, however, since there is some stratification in European American populations, SNPs that contributed significantly to population structure were tagged in subsequent analyses . Association statistics were not used for the purposes of this paper, except where indicated in the multiple comparisons methods below.
To address the concern that an excess number of cases to controls would lead to less generalizable results, we analyzed a random sample of 259 cases with all 259 controls. Other than the changes in case/control ratio and sample size, all other variables were left unchanged.
PRESTO : The software package, PRESTO, was used to permute case/control status 10,000 times, and the minimum Mantel trend test p-value for all SNPs in the genome, comparing cases with controls, was recorded for each permuted data set . These minimum p-values were then used to estimate the uncorrected distribution of p-values under the null hypothesis of no true associations in the study. Each p-value was then corrected by finding the corresponding percentile of the distribution of uncorrected p-values, and a significance threshold for a study-wide significance level of α was be obtained by finding the αth percentile of the uncorrected distribution. This distribution was used as the standard by which each method's accuracy is gauged, and corresponding significance levels for all other methods were estimated using this distribution. Results from PRESTO were compared with the results from PERMORY, another optimized permutation testing software package that was recently released .
SLIDE : The SLIDE software package was used to implement a multivariate normal distribution-based approximation to a permutation test, using the quantitative trait option, with 10,000 iterations [14, 23]. For comparisons with the other methods considered, SLIDE corrected p-values were used to estimate the GWAS-wide significance threshold by finding a corrected p-value equal to the desired study-wide significance. level, α.
Simpleℳ : The simpleℳ method , based on a principal components analysis of the data, was implemented in R, version 2.9.0 , following the example code provided by Gao et al. https://dsgweb.wustl.edu/rgao/simpleM_Ex.zip. This measure of the number of informative SNPs was then used in a Bonferroni adjustment to estimate the GWAS-wide significance threshold. Each chromosome was broken into regions of approximately 5,000 SNPs due to computational constraints. To choose appropriate regions, with as little LD between adjacent regions as possible, we chose cut points between LD blocks identified by Haploview. A second analysis using the largest regions possible, given the memory available, was also explored to see if results were dependent on the region size.
LD blocks were inferred in our GWAS data using the three methods available in Haploview . The number of LD blocks across the human genome, including interblock SNPs (i.e. singleton SNPs), was used in a Bonferroni adjustment to estimate GWAS-wide significance thresholds . Entire chromosomes could not be analyzed, due to memory constraints, so smaller regions were analyzed. All SNPs from the last full LD block of the previous region were included in the analysis of the next region to ensure complete LD blocks.
The Gabriel protocol, the default method for Haploview, was used with an upper D' confidence interval bound of 0.98, a lower D' confidence interval bound of 0.70, and with 5% of informative markers required to be in strong LD . The Solid Spine of LD algorithm  was used with minimum D' value of 0.8, as suggested by Duggal et al. . The 4-Gamete test was run setting the cutoff for frequency of the fourth pairwise haplotype at 1% [30, 31].
Summary of Analysis Results
Corresponding α level
0.71 × 10-7
0.76 × 10-7
0.82 × 10-7
1.09 × 10-7
2.72 × 10-7
3.06 × 10-7
3.71 × 10-7
Difference in Significance Threshold in a Subset of the Data
Δ Significance Threshold
-8 × 10-11
-8 × 10-10
-5 × 10-9
-6 × 10-8
7 × 10-7
-8 × 10-7
Bonferroni : The standard Bonferroni correction, simply using the total number of SNPs tested in the genome-wide significance level calculation, was 7.1 × 10-8, which corresponded to a genome-wide significance level of α ≈ 0.05 when compared with PRESTO (see Table 1). While a permutation test may not result in a large improvement in the corresponding genome-wide significance level when compared with a standard Bonferroni correction in this SNP set, other, denser SNP sets will result in greater disparities in significance levels.
SLIDE : The significance threshold identified by SLIDE was 1.1 × 10-7, which corresponded to a genome-wide significance level of α = 0.07 when compared with PRESTO (see Table 1). The significance threshold found in the analysis of the smaller sample was remarkably similar, differing only by 5 × 10-9 (see Table 2). Over all, these results indicate that SLIDE is an excellent alternative to permutation testing. Additionally, the corrected p-values provide increased ease in interpretation of GWAS results.
The simpleℳ method is currently the fastest way to calculate the effective number of independent tests in a GWAS , but due to the O (n 2) nature of this algorithm the genome needs to be broken up into small regions to maintain this computational speed. This adds complexity to the analysis and requires a significant amount of pre-analysis. Considering the many examples of long range LD across the genome, simpleℳ could also lead to a slightly more conservative estimate in some studies .
Comparison of α levels when restricting the definition of a haplotype
Corresponding α level
D'U > 0.98
2.72 × 10-7
D'L > 0.70
D'U > 0.98
2.11 × 10-7
D'L > 0.85
Cutoff = 1%
3.06 × 10-7
Cutoff = 0.5%
2.50 × 10-7
D' = 0.80
3.71 × 10-7
D' = 0.95
2.79 × 10-7
Nicodemus et al.  noted that estimates may be more or less conservative under varying levels of LD. An alternate LD algorithm or parameter constraints could be found that would result in a more accurate estimate , but this would vary significantly depending on the sample size, the set of SNPs, and the underlying level of LD structure in the population. This is further illustrated in the large differences found using the Gabriel and Solid Spine of LD algorithms on a subset of the individuals in this study (see Table 2). While LD blocks do provide key information on patterns of LD and how SNPs are correlated, providing invaluable information for interpreting GWAS results and for the planning of follow-up studies, we find the use of significance thresholds derived from LD blocks to be too variable for general application to GWAS data.
A one-size-fits-all Bonferroni correction, although conservative, may not result in a large Type II error rate with a sample size in the tens of thousands, but as the sample size drops, so does statistical power. In studies where gathering large numbers of cases is prohibitive (e.g. when disease prevalence is low), a Bonferroni correction becomes overly conservative by detrimentally inflating the Type II error rate. The methods considered here can ameliorate this loss of power and make interpretation of study results less enigmatic.
The results from the PRESTO, SLIDE and simpleℳ methods appear to be equally good in population data genotyped on the Affymetrix 6.0 platform in European Americans (α = 0.05 thresholds between 1 × 10-7 and 8 × 10-8), and each presents a modest gain in power over the strict Bonferroni thresholds advocated by some [2–5]. The SLIDE and simpleℳ methods may be less dependent on the number of individuals in the study, and will be particularly useful when using a statistical test that is not supported by optimized permutation packages (e.g. when modeling survival or longitudinal data) and when the SNPs being tested are sufficiently dense. SLIDE not only has much nicer computational properties when compared to simpleℳ, but the corrected p-values measuring significance on the genome-wide scale are easier to interpret. While the idea of an even standard across studies is appealing, the traditional standard of presenting p-values in the context of the study more accurately represents the data.
This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD http://biowulf.nih.gov. We thank the individuals who participated in the HGDS, MACS, MHCS, and SFCC cohort studies, as well as the physicians and researchers responsible for recruitment and sample collection. We also thank Michelle Hall, Michael Malasky, Lisa Maslan, Mary McNally, and Jami Troxler who performed the genotyping, Leslie Chinn, Sher Hendrickson, Carl McIntosh, and Joan Pontius who helped with quality control and annotation of the GWAS data, and Julie Johnson for her comments and discussion.
This project has been funded in whole or in part with federal funds from the NationalCancer Institute, National Institutes of Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government. This research was supported [in part] by the Intramural Research Program of NIH, National Cancer Institute, Center for Cancer Research.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.