Detecting positive selection from genome scans of linkage disequilibrium
© Huff et al. 2010
Received: 24 August 2009
Accepted: 5 January 2010
Published: 5 January 2010
Skip to main content
© Huff et al. 2010
Received: 24 August 2009
Accepted: 5 January 2010
Published: 5 January 2010
Though a variety of linkage disequilibrium tests have recently been introduced to measure the signal of recent positive selection, the statistical properties of the various methods have not been directly compared. While most applications of these tests have suggested that positive selection has played an important role in recent human history, the results of these tests have varied dramatically.
Here, we evaluate the performance of three statistics designed to detect incomplete selective sweeps, LRH and iHS, and ALnLH. To analyze the properties of these tests, we introduce a new computational method that can model complex population histories with migration and changing population sizes to simulate gene trees influenced by recent positive selection. We demonstrate that iHS performs substantially better than the other two statistics, with power of up to 0.74 at the 0.01 level for the variation best suited for full genome scans and a power of over 0.8 at the 0.01 level for the variation best suited for candidate gene tests. The performance of the iHS statistic was robust to complex demographic histories and variable recombination rates. Genome scans involving the other two statistics suffer from low power and high false positive rates, with false discovery rates of up to 0.96 for ALnLH. The difference in performance between iHS and ALnLH, did not result from the properties of the statistics, but instead from the different methods for mitigating the multiple comparison problem inherent in full genome scans.
We introduce a new method for simulating genealogies influenced by positive selection with complex demographic scenarios. In a power analysis based on this method, iHS outperformed LRH and ALnLH in detecting incomplete selective sweeps. We also show that the single-site iHS statistic is more powerful in a candidate gene test than the multi-site statistic, but that the multi-site statistic maintains a low false discovery rate with only a minor loss of power when applied to a scan of the entire genome. Our results highlight the need for careful consideration of multiple comparison problems when evaluating and interpreting the results of full genome scans for positive selection.
Until a few years ago, studies of positive selection have been limited to sequence data from a single gene covering only a few thousand nucleotides. Now that detailed genetic maps of human variability are available in many populations, it is possible to measure the signature of positive selection on a genomic scale [1, 2]. Traditional tools for detecting selection are not applicable to these large SNP datasets, as most traditional tests require sequence data with no ascertainment bias. However, with dense SNP coverage across the genome, it is now possible to accurately measure the decay of linkage disequilibrium (LD) over long genomic distances, opening the door for new tests that can detect the fingerprint of selection across hundreds of thousands of nucleotide positions. Most of the tests that measure this signal of selection have been constructed using one of two basic statistics, Extended Haplotype Homozogosity (EHH) and Fraction of Recombinant Chromosomes (FRC) [3, 4]. Variants of both statistics have been used in multiple whole genome scans to provide a global view of recent positive selection in humans.
Most of the discussion surrounding these genome scans has focused on the similarities of their results, since all indicate that positive selection has been a surprisingly important force in recent human evolution [4–7]. However, beneath the broad picture lie curious differences in the results of the two approaches. In 2006, Wang et al. identified 1799 candidate genes as potential targets of recent positive selection in a scan of the human genome based on the FRC statistic. Later that year, Voight et al.  identified 431 candidate genes in a similar genome scan using the iHS statistic, which is based on EHH. Both groups calculated a summary statistic at each site that measured the LD associated with that site, and then aggregated those statistics to identify candidate loci from the outliers of the empirical distribution, with Voight et al. including 1% of the distribution and Wang et al. including 1.6%. So although the Wang et al. study was only slightly less restrictive, it identified over four times the number of candidate loci. One possible explanation is that FRC is a more powerful statistic for detecting recent positive selection. However, Voight et al. estimated that the power of iHS to detect recent positive selection was approximately 33% for the range of allele frequencies considered in Wang et al. If their estimate is accurate, even if the power of the FRC test is 100%, the discrepancy between the two tests cannot fully explained. Additional genome scans have demonstrated that the differences in these results are not artifacts, and instead represent stable differences between the two statistics [6, 8].
While several studies have estimated the power of EHH statistics to infer positive selection, the statistical power of FRC has not yet been explored. To address this gap, we use simulated data to compare the properties of FRC and EHH statistics. We first examine the power of the single-site statistics of each method under explicit null models of neutrality and alternative models of selection. We then estimate the false positive rate, power, and false discovery rate of each test when applied to an empirical distribution of its respective statistic based on a combination of neutral and selected loci.
The available computational methods for simulating genealogies cannot easily model complex demographic scenarios combined with the presence of positive selection. Most methods require a single population of constant size. This is problematic when evaluating the statistical power of LD-based tests in the presence of positive selection, as population bottlenecks and subdivision can create LD that mimics that generated by selection. Here, we introduce a new approach for simulating positive selection in complex population histories with subdivision, migration, bottlenecks, and expansions in a coalescent framework. With this approach, we first generate a set of potential allele trajectories for the favored allele using forward-in-time simulations. Then for each backwards-in-time simulation, we select an allele trajectory at random and condition the coalescent simulation on the population sizes and migration history of the favored allele as specified by the allele trajectory (see Methods).
In our analysis, we considered one test derived from the FRC statistic, ALnLH, and two tests derived from the EHH statistic, LRH and iHS [3–5]. To evaluate the effects of population history on the power of each of these statistics, we considered four demographic models: constant population size, expansion, expansion with migration, and bottleneck with migration. For the final three models, we obtained parameter values from the best fitting model in . From this model, we used Africa to represent the expanding population and Europe to represent the bottlenecked population. For histories with migration, we allowed low levels of migration between Europe, Asia, and Africa as specified in their model . For selection models, we used the estimate for the average strength of recent positive selection in humans of s = 0.022 from . We set the origin generation of the favored allele for each model to produce an average allele frequency of approximately 0.5, which met our goal of providing coverage for allele frequencies between 0.2 and 0.8.
For the results presented above, we calculated a statistic for each SNP and evaluated the power to detect selection, with the null hypothesis of neutrality and the alternative hypothesis of strong positive selection acting on the SNP in question. This is an appropriate test for positive selection when the investigator has a prior hypothesis about the potential influence of natural selection and when there are a small number of candidate loci. However, as we demonstrate below, when this simple strategy is applied to an uninformed scan across the genome, it introduces a multiple testing problem that heavily weights the significant results toward false positives. The testing methodology that Voight et al.  employed for iHS addresses this problem by binning the genome into 100 Kb segments, and then calculating the fraction of SNPs in each segment with |iHS| greater than 2.0 as their test statistic. This approach takes advantage of the tight linkage of genetic hitchhikers near the favored locus to reduce the number of tests from the number of SNPs in the study to the number of 100 Kb regions in the genome. Their candidate genes were those in regions with the highest fraction of significant iHS scores, taking the top 1% of the empirical distribution. By lowering the criteria for a significant iHS score and considering the total fraction of significant results, they were able to test each 100 Kb region one time at the 0.01 level. In contrast, Wang et al. set a higher threshold for a significant result and tested each SNP individually, taking the top 1.6% of the distribution. All genes within 100 Kb of a significant result were included as candidate genes, which resulted in potentially hundreds of tests at the 0.016 level for each 200 Kb region .
From our evaluation of false discovery rates, we can estimate the number of false discoveries for each genomic scan. Of the 1799 candidate genes identified by Wang et al. , we estimate that 1331 to 1727 of those were false discoveries. For Voight et al. , we estimate that there were 0 to 231 false discoveries over 431 candidate genes. The estimates for true discoveries are then 72 to 468 for Wang et al. , and 200 to 431 for Voight et al. . So after adjusting for false discoveries, the two studies are in close agreement. Given our true discovery and power estimates for iHS, we estimate that there are between 600 and 1000 variants with an allele frequency of at least 0.2 that have been influenced by strong recent positive selection in the HapMap phase 2 populations.
While the single-site statistics used in these studies perform equally well under simulations with constant recombination rates, several factors inhibited the performance of ALnLH. These factors primarily involve implementation details of the test and not the properties of the FRC statistic itself. Since both ALnLH and iHS methods measure the long range LD for each allele at each focal site, it may be possible to design a test based on the FRC statistic that matches or exceeds the performance of iHS using the Voight et al. implementation as a template . Five features that should be included in such a test are local controls for recombination rate, standardization for allele frequency, population specific critical regions, external inference of gametic phase, and the aggregation of results at nearby loci to mitigate multiple testing problems. While a future FRC test may prove more valuable, the false positive and false discovery rates are too high in the current ALnLH implementation to provide a useful set of candidate genes in genomic scans.
Throughout our analysis of EHH statistics, iHS consistently outperformed LRH. Since specific guidelines are not available for determining the core haplotype region and level of EHH decay for LRH, we may have underestimated the power of LRH. However, we tested 4 sets of parameter values using examples in Sabeti et al. as a guide , and none of the tests were able to match the performance of iHS in any of our scenarios.
Our estimates for the power of the iHS test were consistently higher than those reported in Voight et al. , but it is important to distinguish between our single-site analysis vs. our site-aggregation analysis when comparing the two results. Figures 1, 2, and 3 report the power of the single-site statistic, which is based on one iHS value measuring the decay of LD surrounding one SNP. This is not directly comparable to the power analysis in Voight et al., which was based on the aggregation of 51 iHS scores for SNPs near the favored allele . This aggregation strategy successfully mitigates the multiple testing problem inherit in a full genome scan by incorporating information from potential genetic hitchhikers near the favored allele. However, as demonstrated in the comparisons between Figures 3 and 4, the power of the site-aggregation test is appreciably lower than the single-site test. This tradeoff is worthwhile for uniformed genome scans involving large numbers of SNPs, since it reduces the number of tests by one or more orders of magnitude. However, candidate gene studies that involve only a few potential targets of selection do not suffer from the same multiple testing problems as full genome scans. For these studies, the single-site iHS test is a better choice, providing an average power of 0.81 at the 0.01 level according to our estimates (Figure 3).
There are two other considerations when comparing the power analysis from Voight et al.  with the results from this study. First, Voight et al.  established critical values from null models with histories of population bottlenecks, but tested those values against selection models where the population size was constant. Because population bottlenecks also introduce LD, this resulted in conservative critical regions and lower power. Second, our strategy for simulating ascertainment bias resulted in higher SNP density and more low frequency alleles compared to Voight et al. , which probably elevated the power of the test in our analysis when the favored allele was relatively rare.
As pointed out by Przeworski et al., empirical scans for selection will miss many selection events when they are applied to genomes that have been heavily influenced by recent positive selection . This is evident in Figure 4a, where the power of iHS is 0.74 when selective sweeps are very rare, 0.69 when 1% of the genome is influenced by positive selection, and drops to 0.33 when just 5% of the genome influenced by selection. This effect could be mitigated by choosing critical values from a subset of the genome that has a smaller proportion of recent selection events. We expect a priori that nongenic regions are less likely to be targeted by selection. This expectation is supported in Voight et al. , where they demonstrated a highly significant enrichment for genic regions within the group of loci identified as potential targets of positive selection (p < 1E-20). By establishing critical regions from nongenic regions, it may be possible to substantially improve the power of genome scans for recent positive selection with only a small increase in false positives.
In agreement with previous findings, our results demonstrate that the multi-site iHS test is an excellent test for detecting incomplete selective sweeps in full genome scans, with power between 0.33 and 0.74 and false discovery rate between 0 and 0.53 at the 0.01 level. In comparison, the power of the ALnLH test in full genome scans was approximately 25% lower with a false discovery rate between 0.74 and 0.96. However, the statistical properties of the two statistics are quite similar when applied to a single site in a candidate gene test, with power of over 0.8 at the 0.01 level, demonstrating the importance differences in the adjustments made for multiple tests in full genome scans. Our results highlight the need for careful consideration of multiple comparison problems when evaluating and interpreting the results of full genome scans for positive selection. The algorithm we present for simulating genealogies influenced by positive selection will allow for more thorough exploration of complex demographic scenarios when evaluating methods for detecting positive selection.
To simulate positive selection, we employed the coalescent framework first proposed by Kaplan et al. , where the selected and neutral alleles are treated as two subdivided populations. In this method, the trajectory of the favored allele is determined separately through model or simulation, which provides the population sizes of the two allelic classes throughout the coalescent simulation. Though there are a variety of existing methods for generating the trajectory of the favored allele, most are limited to simple models of demography and selection. The original method of Kaplan et al.  models strong balancing selection by assuming that allele frequencies remain constant. Braverman et al.  introduced a model of directional selection, but the trajectory path is deterministic. Stochastic simulations of the trajectory have generally been limited to backward time Moran models, which require a single population of fixed size [13, 16–18]. Slatkin proposed an importance-sampling method that weights realizations of a reversed Wright-Fisher model according to the conditional probability of the trajectory path in forward time given the observed genetic data . This model allows for variable population size, and could be extended to include population subdivision and migration. However, the method is computationally intensive and the introduction of n subpopulations with migration would increase computational complexity by a factor of n2+n. Pickrell et al.  adopt a hybrid approach, where a single population is initialized by coalescent simulation until the first population split. From that point on, the simulation occurs in forward time using Wright-Fisher drift . While this method can model complex demography, it does not allow for conditioning on the desired allele frequency of the favored allele. It also requires the simulation to track each recombinant haplotype in each subpopulation, and as such is computationally intensive even for relatively small genomic regions.
Because Wright-Fisher drift is a Markov process, the importance weight depends only on the allele frequency in the final generation. In contrast, Slatkin's method employs a backward process that is only a rough approximation to Wright-Fisher drift, so the sampling weight must be calculated over the entire history of the two alleles with a separate term for each population and for each potential migration path in each generation .
Because the allele trajectory is generated from Wright-Fisher forward simulation, this method can seamlessly model complex demographic scenarios that include bottlenecks, expansions, and population subdivision with migration. The biggest downside to this flexibility is the potential for choosing parameter values that rarely result in population allele frequencies that are near the observed frequency in the sample. This concern must be evaluated when choosing parameter values, as some will require a prohibitive number of forward simulations to cover the sample space. However, all of the backward time methods are approximations to a forward Wright-Fisher process, and are meant to model natural processes that clearly occur in forward time, so this method is adequate for exploring most relevant models of positive selection and demography. For models where the sample allele frequency is particularly unlikely, Slatkin's method will be preferable since it involves a backward process conditioned on the sample .
For the results presented here, we set s and t to fixed values, though in principle they could be set to random variates in each forward iteration, reflecting uncertainty around estimates of selection strength and allele age. If t is a random variable, each origin generation-subpopulation must be weighted by its respective population size to reflect the probability that a new mutation originates in that generation .
We assumed all recombination events were crossovers, where a crossover occurs with the favored or neutral allele with probability proportional to the frequency of the alleles in the subpopulation [20, 21]. For models with variable recombination rates, we followed Przeworski et al.  in setting the recombination rate to an exponential random variate in each simulation, with mean equal to the rate of recombination in our constant models, 1 cm/Mb. We adopted the implementation details of the coalescent process from , storing each generation in a lookup table indexed by the cumulative hazard of coalescence. To account for population subdivision, we introduced a subpopulation dimension to the coalescence table.
The trajectory of the favored allele was generated under a model where the migration rates are constant between subpopulations for each epoch. However, since a trajectory is in part a realization of this random process, we could not assume constant migration rates in a coalescent simulation based on a particular trajectory. The number of individuals of each genotype migrating to and from each population in a given generation is determined by the forward simulation and is therefore treated as a constant during the backward simulation. The individual migrants themselves are, however, chosen at random during the backward simulation. To implement this process, we introduced two migration lookup tables. The first table was analogous to the coalescence lookup table, storing the cumulative hazard of migration out of a given subpopulation for each allele. We used the second table to determine the destination subpopulation of a migrant, by storing the conditional probability of migrating from an origin subpopulation to a destination subpopulation given that a migration event occurred out of the origin subpopulation in a particular generation. Expanding on Coop and Griffith's method, we accessed the coalescence and migration lookup tables with uniform random variates to generate the waiting time until the next event for each subpopulation-allele combination . We then generate the waiting time until the next recombination event from an exponential random variate. Then from the memoryless property of the exponential distribution, the next event to occur is the event with the shortest wait time [17, 20].
To introduce ascertainment bias to the simulated data, we developed a procedure to model the process in the Perlegen dataset. In their SNP discovery process, they identified all polymorphic sites in a fully sequenced subsample, then genotyped those sites in a larger sample . To replicate their procedure on simulated data, we randomly assigned mutation events to the tree under an infinite sites model using a mutation rate of 2.2E-8 per nucleotide per generation . We then designated 13-33% of the chromosomes in each simulation as the ascertainment subsample . We excluded all mutations that were not polymorphic in the subsample. To generate diploid genotypes for calculating FRC, we grouped the simulated chromosomes into randomly chosen pairs.
EHH is defined as the probability that two chromosomes in a sample share the same haplotype for a given set of SNPs . Sabeti et al.  introduced the long range haplotype test (LRH), which is calculated by dividing EHH on a core haplotype by EHH among all samples not containing the core haplotype. Since explicit guidelines for identifying the core haplotype were not available, we tested two criteria. Our core haplotype region was then either a fixed 15 Kb region surrounding the focal site, or the first 8 SNPs nearest the focal site with minor allele frequencies greater than 0.05, including the focal site. The core haplotype was then the largest haplotype in the core haplotype region. We based these criteria on the size of the G6PD region and the simulation methodology employed in . We calculated LRH for both methods at the furthest distance where the EHH was greater than either 0.25 or 0.05 at core haplotype. In our tests, the 15 Kb regions with an EHH cutoff of 0.25 had the highest average power of the options we considered, so we only report those results. For both LRH and iHS, we measure EHH from the expected homozygosity given the allele frequencies of each haplotype rather than observed homozygosity.
Voight et al.  introduced the integrated EHH (iHH), which is the integral of the observed decay of EHH away from a particular allele, summing over both directions until EHH is less than 0.05. To obtain their single-site statistic, iHS, they divide the value of iHH at the ancestral allele by the value of at the derived allele and then take the natural log. Finally, they standardize iHS by subtracting the expectation and dividing by the standard deviation, which are conditioned on the frequency of the derived allele. This final step accounts for the frequency of the allele, since low frequency derived alleles are younger and as such will be associated with longer LD blocks.
where X is the distance from the focal site.
For a given allele at a focal site, Wang et al. calculate FRC separately for each site within 500 Kb of the focal site with a minor allele frequency greater than 0.1 . They then input each array of FRC statistics into a pseudo-likelihood function to measure the goodness-of-fit to the G6PD model under the assumption that FRC values are normally distributed. This likelihood is adjusted for allele age, as described below.
Here, Yi is the FRC at site i, Xi is the distance from site i to the focal site, F is the expected value of FRC as a function of the distance from the focal site, N is the number of sites, and σ2 is the variance of g over the entire empirical distribution.
They calculate ALnLH for each allele at each site with a homozygote minor allele frequency of greater than 0.05. From the empirical distribution, they determine the average and standard deviation of ALnLH scores. Candidates for positive selection are those SNPs where one allele has an ALnLH score of 2.6 SD above the mean while the other allele has a score of less than 1 SD above the mean. In their 2006 study, these criteria included the top 1.6% of the empirical distribution . We determined the details of this algorithm from source code provided by Eric Wang (personal communication).
We would like to thank Jon Seger for his help in designing the coalescent simulation algorithm we present here. We would also like to thank Eric Wang for providing the source code used to calculate ALnLH. This work was supported in part by the Primary Children's Medical Center Foundation National Institute of Diabetes and Digestive and Kidney Diseases (DK069513).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.