How do SNP ascertainment schemes and population demographics affect inferences about population history?
© McTavish and Hillis; licensee BioMed Central. 2015
Received: 12 December 2014
Accepted: 17 March 2015
Published: 3 April 2015
The selection of variable sites for inclusion in genomic analyses can influence results, especially when exemplar populations are used to determine polymorphic sites. We tested the impact of ascertainment bias on the inference of population genetic parameters using empirical and simulated data representing the three major continental groups of cattle: European, African, and Indian. We simulated data under three demographic models. Each simulated data set was subjected to three ascertainment schemes: (I) random selection; (II) geographically biased selection; and (III) selection biased toward loci polymorphic in multiple groups. Empirical data comprised samples of 25 individuals representing each continental group. These cattle were genotyped for 47,506 loci from the bovine 50 K SNP panel. We compared the inference of population histories for the empirical and simulated data sets across different ascertainment conditions using F ST and principal components analysis (PCA).
Bias toward shared polymorphism across continental groups is apparent in the empirical SNP data. Bias toward uneven levels of within-group polymorphism decreases estimates of F ST between groups. Subpopulation-biased selection of SNPs changes the weighting of principal component axes and can affect inferences about proportions of admixture and population histories using PCA. PCA-based inferences of population relationships are largely congruent across types of ascertainment bias, even when ascertainment bias is strong.
Analyses of ascertainment bias in genomic data have largely been conducted on human data. As genomic analyses are being applied to non-model organisms, and across taxa with deeper divergences, care must be taken to consider the potential for bias in ascertainment of variation to affect inferences. Estimates of F ST , time of separation, and population divergence as estimated by principal components analysis can be misleading if this bias is not taken into account.
KeywordsBos taurus Bos indicus Gene-flow Migration SNP chip
Next-generation sequencing has made genomic sequence data available even in many non-model organisms. Broader analysis of genetic variation across many individuals or populations within species typically relies on methods that subsample variable sites within genomes. One of the most efficient and widely used approaches for comparing genomic variation within species uses single nucleotide polymorphism (SNP) panels [1,2]. SNP panel methods rely on deeply sequencing a subset of the population of interest and then using this information to select polymorphic loci for additional genotyping in a much larger pool of individuals, often using chip-based genotyping. However, a bias present in the initial selection of markers may affect inferences about the larger population. In this study, we investigated the effects of this selection bias on inferences of demographic history using an empirical example from cattle.
Standardizing SNP panels, as was done for the Human Hap-Map project , makes it straightforward for research groups to combine data and address a broad array of biological questions. For example, SNP-panel analyses have been used extensively for disease research (reviewed in ). Commercial direct-to-consumer applications of SNP-panel genotyping allow individuals to trace their ancestry and test for disease-associated SNPs . Novembre et al.  used SNP loci genotyped for the POPRES project  to analyze the genetic spatial structure of human populations in Europe. Chip-based SNP sequencing is also available for several plants and animals of scientific or agricultural importance, including dogs, mice, cattle, chickens, horses, pigs, sheep, and corn [http://www.neogen.com/geneseek/SNP_Illumina.html]. Chip-based SNP analyses have been used to resolve evolutionary relationships in extinct ruminants , and to understand global patterns of population structure in cattle and dogs [9-11]. SNP sets are also being developed for conservation applications  and have been used to test for hybridization between common and endangered species (e.g. [13-15]).
To discover variable SNP loci for inclusion in a SNP panel, a sample of individuals representing the taxon of interest is sequenced. This sample of individuals is called the “ascertainment group.” The ascertainment group’s size and composition is determined by the developers of the panel, and typically depends on the aims of the study at hand. A set of SNPs is then selected from the resequencing data of the ascertainment group. The selection of individuals used for the ascertainment group can bias which SNPs are discovered and included in later genotyping analyses.
Ascertainment bias is of course not unique to SNP analyses. For example, in morphological analyses, variable traits are often preferentially selected over fixed traits for analysis. Furthermore, in microsatellite or gene sequencing studies, genes are often chosen for sequencing based on their levels of variability within a group of interest . Arnold et al.  recently demonstrated that RAD sequencing introduces genealogical biases due to nonrandom haplotype sampling. All of these forms of ascertainment bias influence the variability of the sampled data relative to the expectations for data sampled at random from the genome.
There are two main forms of ascertainment bias associated with SNP-panel analyses: minor allele frequency (MAF) bias and subpopulation bias. MAF bias results in the over-representation of polymorphisms with high minor allele frequencies and the under-representation of polymorphisms with low minor allele frequencies. The number of individuals in the ascertainment group will influence the lower frequency limits of SNPs included on the SNP panel. Mutations that are less common than 1/n, where n is the number of alleles in the panel, are unlikely to be observed in the ascertainment group. Much research has been devoted to describing and mitigating the impacts of minor allele frequency cut-offs in the generation of SNP panels [18-21].
In this study we addressed the issue of subpopulation bias in ascertainment. This bias arises from the selection of individuals to include in an ascertainment panel. If the panel is chosen from individuals from a subpopulation or geographic region, variability in that group will be over-represented [22,23]. Wang and Nielsen  addressed phylogenetic aspects of ascertainment bias in an outgroup of the taxon of interest. Excoffier et al.  developed a simulation-based framework, fastsimcoal2, which can accurately infer demographic parameters for even very complex models under known ascertainment schemes (such as markers heterozygous in a single individual). Subpopulation bias in the composition of the group used to select variable markers can also affect inferences using those markers. For example, microsatellite repeat loci are consistently longer in the species in which they are discovered than in other species in which they are amplified . Subpopulation ascertainment can inflate heterozygosity and apparent diversity in populations closely related to the ascertainment group [20,21,27-30]. Using simulated and empirical data for 30 restriction-site polymorphism markers, Eller  demonstrated that ascertainment-group bias can artificially inflate within-group estimates of diversity, especially when real heterozygosity is low. The effects of subpopulation bias in genomic data needs further exploration, particularly as it affects studies of non-humans. The bulk of these analyses of SNP ascertainment bias have been performed on human data [20,24,25,27-31], where among population divergences are necessarily limited. As genomic analyses are expanding into analyses of non-model organisms, it is essential to investigate these issues across broader time-scales and in other organisms.
This study examines on the impact of subpopulation ascertainment bias on population demographic inference using F ST values and principal components analysis (PCA). F ST is a frequently used measure of population differentiation that summarizes differentiation between groups . PCA is a statistical method for reducing the dimensionality of data that can be used for inferring population structure from genetic data (e.g. [33,34]). The first two principal component (PC) axes of human SNP data are correlated strongly with spatial coordinates . PCA has been widely applied to inferring spatial genetic structure using SNP data in humans (e.g., [35,36]; as well as other species (e.g., cattle: ; and dogs: ). McVean  described a genealogical interpretation of the principal component axes for SNP data, where the first PC axis is expected to capture the deepest coalescent split in a tree. In addition, relative PC components can be used to infer admixture between ancestral populations .
To test the effects of subpopulation-biased ascertainment on inference of population histories, we simulated data based on demographic models of cattle evolution [38,39]. Domesticated cattle are comprised of lineages derived from two independent domestication events: the taurine and indicine lineages. Indicine cattle are common in the Indian subcontinent and taurine cattle are common in Europe; an African taurine lineage as well as indicine cattle and hybrid lineages exist in Africa. Taurine and indicine cattle likely share a most recent common ancestor 200,000 or more years ago (84–219 thousand years ago [kya]: ; 260–300 kya: ; 335 kya: ; 200 kya–1 mya: ). The divergence between African and European taurine cattle is much more recent (9–15 kya: ; 10–15 kya: ; 12.5 kya: ). This divergence represents the major population structuring within taurine cattle. In addition, there is a several-thousand-year history of admixture between taurine and indicine lineages in Africa . This range is consistent with either a single domestication of taurine cattle, or an independent African domestication event.
We compared data simulated under three demographic models to empirical data for samples of European, African and Indian cattle collected using a 50 K-marker bovine SNP chip . The 50K SNP panel was generated by a complex ascertainment scheme including taurine, indicine, and hybrid African breeds, but it is biased toward capturing polymorphisms that segregate in European breeds, as well as polymorphisms that are shared between taurine and indicine cattle . It under-represents sites that are fixed differences between taurine and indicine lineages, or are polymorphic only in indicine cattle . The minor allele frequency cut off was an average marker (MAF) of at least 0.15 among common cattle breeds, including both taurine and indicine cattle .
Cattle are a useful system to investigate the effects of ascertainment bias because there exist well-parameterized demographic models based on sequence data that allow us to simulate large unbiased data sets. In addition, domesticated cattle comprise groups (the taurine and indicine lineages) with deep divergences between them. Therefore, cattle represent a good system to explore the effects of capturing SNP loci across subspecies or species boundaries.
The term “SNP” is commonly used to mean “variable site” across samples irrespective of whether a given SNP is polymorphic within a population. Although Wakeley et al.  coined the more accurate term “SNP-discovered locus” (SDL) to describe these single nucleotide differences that may or may not be segregating within sampled groups, this terminology is not widely used. Here, we use SNP in the broad sense of “variable site.”
Our empirical data set consisted of a subset of the cattle SNP data described in McTavish et al. . We used genotypes for 25 individuals from each of three breeds representative of the three major geographic clusters of cattle: Indian (Gir), African (N’Dama), and European (Shorthorn). The African (N’Dama) samples are from a group with largely African taurine ancestry, but have some indicine introgression . We included all 25 Gir samples from the published data set. The 25 Shorthorn individuals included were a random subset of the total set of Shorthorn samples (n = 99). The 25 N’Dama individuals included were a random subset of the N’Dama samples excluding 13 individuals estimated to have admixed ancestry within the last 100 years (; n = 46). The loci examined consisted of 47,506 SNPs genotyped using the bovine 50 K SNP chip . This subset of markers was selected by removing loci that had >10% missing data across a larger sample of 1,420 cattle . There were no ambiguous or absent base calls in the analyzed SNP data matrix, as the larger data set had been filtered and missing data imputed as described in McTavish et al. .
Parameter values for the three demographic models simulated, shown in Figure 1
Na = Nt = Ni
Ancestral population sizes
Current European taurine population size
Current African taurine population size
Current indicine population size
Time of African–European divergence
15 kya (3,000 generations)
Timing of bottleneck in taurine cattle
Size of bottleneck in taurine cattle
150 (0.01 × Na)
Time of indicine–taurine divergence
280 kya (56,000 generations)
Number of migrants from indicine to taurine lineages per generation (prior to European–African split 15 kya) (Murray et al. 2010 )
Number of migrants from taurine to indicine lineages per generation (prior to European–African split 15 kya) (Murray et al. 2010 )
Number of migrants from indicine lineages into Africa per generation for the past 15 kya
We simulated data with this demographic model under three different migration conditions (full parameters in Table 1, Additional file 1: Table S1): (a) no migration; (b) low levels of asymmetric gene flow (migration) as estimated from nuclear sequence data in  between indicine and taurine lineages equivalent to indicine to taurine gene flow of 1 migrant every 4.6 generations (m i→t), and lower taurine to indicine gene flow of 1 migrant every 80 generations (m t→i); and (c) migration as described in b plus moderate levels of gene flow equivalent to 2 individuals per generation from indicine lineages into the African taurine population from 15 kya to present (m i→A).
We simulated demographic histories using the software ms . The ms program is a backwards-in-time coalescent simulator that generates samples according to a Wright–Fisher neutral model. We used ms to generate both gene trees and samples of variable sites for each migration scenario. To match our simulated data to the empirically generated data set, we simulated samples of 50 haplotypes at 47,506 variable loci for each of the groups of European, Indian, and African cattle. We paired consecutive haplotypes to create diploid genotypes. The software ms uses θ (4N 0 μ) where N 0 is the diploid population size, and μ is the neutral mutation rate for the locus. As we were interested only in variable sites, we used a high neutral mutation rate (3x10-6) and included only sites at which a mutation had occurred. All markers were variable with respect to the 150 simulated haplotypes. We did not use a within-group minor allele frequency cutoff. Each simulated locus was independent and unlinked from all others. The infinite sites assumption of the ms model prevents multiple mutations at the same site from occurring. The commands we used are listed in the supplemental information (Additional file 1: Table S1). We replicated the simulations five times.
We subjected each of these simulated migration conditions to three SNP ascertainment treatments. We selected 1,000 SNPs under each of the following ascertainment schemes: (I) Random: SNPs were selected at random without replacement; (II) Geographically-biased: 800 SNPs were selected from loci that were polymorphic in Europe, regardless of polymorphism in other groups, and 200 SNPs were selected randomly; and (III) Polymorphism-biased: 800 SNPs were selected from SNPs that were polymorphic in more than one group. Under this polymorphism biased scheme SNPs that were polymorphic in all three groups were four times as likely to be selected as those only polymorphic in two groups. 200 SNPs were selected randomly.
The simulation process generated five 47,506-SNP replicates for each of the three demographic scenarios (a, b, and c). For each of the simulated data sets we created 1,000-marker subsamples under each of our three ascertainment schemes (I, II, and III). For the observed data set we created five 1,000-marker random subsamples. This replication allows us to test for statistical significance of results, and to compare variation among samples of the observed data to that within and between the simulated samples. We performed the analyses described below on each of five replicates for the nine migration by ascertainment scheme conditions ([a, b, c] * [I, II, III]), and compared the parameter values and variances to those calculated from five 1,000-SNP random subsamples of the empirical data set.
Population genetic parameters
We calculated the number of polymorphic sites in each continental group (European, African, Indian) in each of the empirical and simulated data sets. We calculated pairwise F ST for all pairs of populations for the subsampled data using Weir and Cockerham’s  method implemented in Genepop 4.2 . We calculated the mean and standard deviation of the F ST values across the five simulation runs. We tested for differences among and interactions between demographic scenarios and ascertainment schemes for pairwise F ST values using two way analysis of variance (ANOVA) using the StatsModels package in Python .
Principal components analysis
We performed principal components analysis on each sampled data set using smartpca in the EIGENSTRAT software package . We calculated the average proportion of variation explained by PC1 and PC2 under each condition across the five simulation runs. Analysis of variance (ANOVA) on these values was performed with the stats.f_oneway function in SciPy . Additional PC axes captured within-population variation and were not further explored. We compared the major axes of variation in the PCA and the proportion of variation explained by each PC axis between data sets generated under each of these ascertainment schemes .
To test the goodness of fit of alternative demographic models to our observed data, we calculated the percentage of polymorphisms falling into each of seven categories: (1) segregating only in the European lineage; (2) segregating only in the African lineage; (3) segregating only in the Indian lineage; (4) segregating in the European and African lineages; (5) segregating in the Indian and European lineages; (6) segregating in the Indian and African lineages; and (7) segregating among all three lineages. In each of our five replicate runs we calculated the absolute difference between the empirical percentages observed in each category and the percentages observed in simulated replicates. We summed these percentages to create a quantitative measure of the degree of match. The lower the sum of absolute differences, the closer the fit. We did not perform significance tests on these deviations as we had no null expectations for their values.
To measure goodness of fit for the simulated principal components analyses, we took two approaches. First, we calculated the estimated admixture proportions of the African cattle. Admixture between two population groups for an individual may be estimated using PCA by calculating the relative position along the major PC axis differentiating those groups . Second, we used Procrustes analysis to compare the spatial relationships of PC coordinates across different migration and ascertainment schemes [55,56]. Procrustes analysis applies rotation and scaling to coordinates to minimize the Euclidean distance among individuals across analyses. This provides a metric of differences in the spatial orientation of observed points in two dimensions, and thus allows us to compare patterns across the entire PCA results between analyses. We used the Procrustes function in the R package vegan to perform Procrustes superposition and calculate the residual sums of squares, and performed a test of significance of similarity of coordinates using PROTEST [57,58]. These values were calculated for comparisons of the simulated data sets to the observed data across the five 1,000 SNP replicates.
Distribution of polymorphisms
Mean multilocus F ST values (± standard deviation) calculated for each pair of populations
0.16 ± 0.01
0.15 ± 0.01
0.13 ± 0.00
0.79 ± 0.01
0.79 ± 0.01
0.49 ± 0.01
0.65 ± 0.01
0.55 ± 0.01
0.55 ± 0.01
0.15 ± 0.01
0.15 ± 0.00
0.14 ± 0.01
0.66 ± 0.01
0.64 ± 0.01
0.58 ± 0.01
0.68 ± 0.01
0.57 ± 0.01
0.54 ± 0.01
0.22 ± 0.02
0.16 ± 0.01
0.17 ± 0.01
0.68 ± 0.01
0.39 ± 0.01
0.57 ± 0.00
0.44 ± 0.01
0.56 ± 0.01
0.32 ± 0.01
Principal components analysis
The lowest residual sum of squares following Procrustes superposition between the empirical data and simulated data was under the moderate migration (b) and European-polymorphism biased (II) treatment (Additional file 1: Table S7). Therefore, the overall distance between the PCA locations of individuals in the empirical data and those simulated in this treatment was lowest. In all cases, coordinates were significantly more similar across treatments than would be expected by chance (P < 0.0001, based on a randomization test).
Effects of subpopulation ascertainment bias
We found that subpopulation bias in the selection of SNP loci can affect inferences of population history. The type of ascertainment bias affects both the direction and extent of deviation in estimates of both F ST and the population structure revealed by PCA.
As described in Albrechtsen et al. , selection of loci that are polymorphic within populations decreases the estimates of F ST between populations. This decrease in measured F ST suggests lower differentiation between populations than would be estimated from unbiased data. However, subpopulation-biased ascertainment can inflate F ST as well . Multiple studies have shown inflated F ST values calculated from ascertained SNPs compared to whole genome sequence data [20,59]. Across our simulated data sets, we found that F ST values decreased when biases inflated polymorphism in at least one of the compared populations. More problematically, at high biases toward shared polymorphism (III), F ST values varied little across gene flow regimes. These results suggest that ascertainment bias may obscure information about actual population differentiation as estimated by F ST values in empirical SNP data, and limit the ability of researchers to differentiate among demographic scenarios. In addition, F ST values can depend heavily on the level of variation present in a sample, and the frequency of the most frequent allele . Indeed, Jost  argued that F ST was so affected by genetic diversity that it should not be used as a measure of population differentiation, gene flow, or relatedness. Based on our simulation results we do not recommend using F ST to estimate demographic relationships using SNP data.
The effects of ascertainment bias on PCA are more complex. The genealogical interpretation of PCA on SNP data usually assumes that the first principal component (PC) axis captures the deepest coalescent split in the tree, and subsequent axes capture later splits . In all simulated cases this interpretation was correct. However, that relationship should not be challenging to reconstruct. Admixed populations should fall between their two ancestral populations, and the proportion of ancestry inherited from each can be estimated linearly . This interpretation assumes that SNP ascertainment will have a simple and predictable effect on PC projections with little influence on the relative placing of samples, except in the most extreme cases. However, in our analysis, the ascertainment scheme did impact the relative placing of simulated samples in some cases. In particular, the position of the African samples with respect to the PC1 axis was affected by an ascertainment scheme that favored selection of European polymorphisms in demographic scenario (a) (Figure 4). The change in relative PC1 score can be important for population genetic inference, because differences in the PC1 coordinates of the African samples can be interpreted as the difference in their proportion of admixed ancestry [10,37]. In migration scenarios a and c, selection for polymorphism in Europe (II) significantly overestimated indicine ancestry of African cattle in comparison to using randomly selected SNPs (I) (Additional file 1: Table S6). Our Procrustes superposition analyses suggest that this overestimation is due to rotation of the PC axes rather than absolute deviation in the relative centroid distances. These results show that care must be taken in interpreting PCA analyses of SNP data that are biased toward polymorphisms found in only one population.
Although variation in ascertainment bias interacted with migration to affect inference of migration based on PC1, this was not reflected in the Procrustes residual sums of squares. The Procrustes metric measures the overall deviations in the relative locations in the two-dimensional PCA coordinate space of the samples. The Procrustes results reflect that differences between ascertainment scheme affect rotation of the points relative to the axes, rather than relative to the other sampled individuals. Therefore, although ascertainment bias can affect the interpretation of PC1 as the deepest coalescent split (as described in ), inference of relationships among populations is less affected by population-based ascertainment bias, and is robust to biases that favor the sampling of polymorphic sites.
Recent analyses of human SNP data have made an effort to select polymorphisms within the population of interest (e.g., ), but subpopulation ascertainment bias is likely to continue to be a concern as panels of variable SNP loci are developed in other species . Our empirical SNP chip data was generated for domesticated cattle, a group for which species relationships are not defined consistently. Some authors treat the taurine and indicine lineages as distinct species (Bos taurus and Bos indicus), whereas others treat them as subspecies (Bos taurus taurus and Bos taurus indicus). Irrespective of the naming conventions, domesticated cattle as a group capture a deep divergence between populations, and is therefore useful for examining the properties of SNP ascertainment bias across wider divergence times than those found in many model organisms. Subsets of SNPs that are informative about population structure within subpopulations may not be informative when applied to larger geographic samples . The effects of bias may be even stronger when SNP panels are applied across even more divergent species, because fewer polymorphisms will be shared among these lineages as differences become fixed through time. Under these conditions, estimates of diversity in lineages closely related to the ascertainment group will be artificially inflated compared to lineages that are distantly related to the ascertainment group. Furthermore, SNPs that have been selected to differentiate between two species may result in misleading inferences about relationships among populations within other species.
As costs of sequencing continue to decrease, it is becoming more feasible to generate whole-genome sequence data, even from non-model organisms. Such data do decrease the effects of ascertainment bias on inference relative to SNP samples . Nonetheless, even in whole genome sequence data, alignment to a divergent reference genome  or removing sites with a high proportion of missing data across taxa can generate ascertainment bias in the analyzed data set .
Application to inference of cattle population history
Murray et al.  estimated the demographic parameters that we used in our simulations, using 37 kb of autosomal DNA sequenced in cattle from Europe, Africa, and the Indian subcontinent. Although these loci were selected based on their variability, this data set lacks the strong ascertainment bias of the SNP data set. The SNP panel captures many sites that are polymorphic in both taurine and indicine cattle. Figure 3 demonstrates that if our demographic simulations are accurate, the 50 K bovine SNP panel data greatly over-represents both European and African polymorphism and shared polymorphism among groups. This SNP panel also underestimates indicine diversity.
Based on inferences from ascertained SNP data, there are remarkably high levels of shared polymorphisms maintained between indicine and taurine lineages across 280 kya of divergence. This prevalence of deep coalescence events is particularly surprising given the estimates from mtDNA of extremely narrow bottlenecks associated with domestication . MacEahern et al.  found that approximately 10% of all ascertained 50 K SNP chip polymorphisms that segregate in two taurine breeds (Angus and Holstein) also segregate in at least one of Bison, Yak, or Banteng. Matukumalli et al.  also found that 1–5% of SNPs in the 50 K panel were polymorphic in other Bos species, and some were variable in multiple outgroup species. Taken together, these results suggest that this SNP panel is capturing sites with unusual evolutionary histories, such as older polymorphisms that have been maintained through selection . Nonetheless, even in autosomal data, shared polymorphisms between taurine and indicine lineages are numerous enough that the best-fit model requires significant gene flow between the lineages, strong balancing selection on segregating sites, very large population sizes, or some combination of these factors [38,68].
By comparing the simulation results with the estimates based on empirical data from cattle, we can assess the effects of different types of ascertainment bias on estimates of population history. Biases toward shared polymorphisms (Table 2: II, III) decreased estimates of F ST by increasing the contribution of shared among-group variation. Our simulated data consistently had lower within-taurine African–European divergence than in observed data. Biased samples in the highest gene flow regime (Table 2: IIc, IIIc) did reflect the observed divergence between African and indicine populations. This result suggests that indicine gene flow into Africa likely occurred at a higher rate than estimated by Murray et al. , although these authors did not explicitly address African taurine cattle.
There are many alternative combinations of demographic processes and ascertainment biases that could produce the patterns we observed in empirical data, and we do not compare among all possibilities. In addition, all simulation conditions reflected less divergence between European and African cattle, than were observed in our empirical data, consistent with the reduced F ST values. This suggests that these lineages may have diverged more than 15 kya.
There are several potentially important demographic factors that were not addressed in our simulations or Murray et al.’s  demographic analyses. In both cases, major continental groups were treated as panmictic populations, which is biologically unlikely. Population substructuring within each of these regions could affect inference of demographic parameters in several ways. Within-population structure can bias estimates of population sizes, often resulting in apparent recent population size declines [69-71]. These effects of population structuring can also interact with gene flow and the sampling scheme to cause spurious inference of bottlenecks [72,73]. Although the empirical data used here do include extensive within-population sampling, which should mitigate some of the potential issues caused by overdispersed sampling schemes, overdispersed sampling nonetheless likely affected both our inferences and the demographic model of Murray et al. . New whole-genome approaches for estimating the history of recent population size may contribute better estimates for these parameters in the near future [74,75].
The sample size of ascertainment sets strongly affects the limit of the minor allele frequency that can be captured in a SNP panel. Although we did not directly explore the effects of different sample sizes of subpopulations in our analyses, our ascertainment bias schemes capture the effects of uneven sampling across populations. Biasing selection of sites to those that are polymorphic within a single population is analogous to having larger sample sizes for that subpopulation. In either case, more sites that are polymorphic in targeted population are included in later analyses.
Although issues of ascertainment bias have been addressed extensively in human data, studies of non-model organisms often involve deeper divergences among sampled populations. Our simulation results demonstrate the importance of taking ascertainment bias into account when using SNP data for phylogeographic analysis. Despite the limitations of SNP studies, the strongest signal in our example empirical and simulated data sets for cattle—the differentiation between indicine and taurine cattle —was consistent across treatments, and was robust to even strong ascertainment bias. Bias toward polymorphisms found in only a single population affects inferences of population relationships more strongly than does bias toward interpopulational polymorphisms.
Availability of supporting data
The empirical and simulated data, as well as the python code used for simulation and analyses, have been deposited in the Dryad repository (datadrayd.org; doi:10.5061/dryad.ht0hs upon publication).
Single nucleotide polymorphism, used here in the broad sense to mean variable site
Principal components analysis
Analysis of variance
We thank Martha Smith, Thomas Juenger, David Cannatella, Randy Linder, Michael Landis, Lacey Knowles, Roz Eggo, Mark Holder, and anonymous reviewers for suggestions that improved the manuscript. We thank Debbie Davis and the Texas Longhorn Cattleman’s Association for genetic samples; and Jerry Taylor, Jared Decker, and Bob Schnabel for advice and assistance.
Humboldt Postdoctoral Fellowship to EJM, Graduate Program in Ecology, Evolution, and Behavior at the University of Texas at Austin; Texas EcoLabs; Texas Longhorn Cattleman’s Foundation; National Science Foundation BEACON (Cooperative Agreement DBI–0939454). This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575.
- Brito PH, Edwards SV. Multilocus phylogeography and phylogenetics using sequence-based markers. Genetica. 2009;135:439–55.View ArticlePubMedGoogle Scholar
- Brumfield RT, Beerli P, Nickerson DA, Edwards SV. The utility of single nucleotide polymorphisms in inferences of population history. Trends Ecol Evol. 2003;18:249–56.View ArticleGoogle Scholar
- Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F, Yang H, et al. The international HapMap project. Nature. 2003;426:789–96.View ArticleGoogle Scholar
- Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J Clin Invest. 2008;118:1590–605.View ArticlePubMed CentralPubMedGoogle Scholar
- Ng PC, Murray SS, Levy S, Venter JC. An agenda for personalized medicine. Nature. 2009;461:724–6.View ArticlePubMedGoogle Scholar
- Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101.View ArticlePubMed CentralPubMedGoogle Scholar
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–4.View ArticlePubMedGoogle Scholar
- Decker JE, Pires JC, Conant GC, McKay SD, Heaton MP, Chen K, et al. Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proc Natl Acad Sci. 2009;106:18644–9.View ArticlePubMed CentralPubMedGoogle Scholar
- McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, et al. An assessment of population structure in eight breeds of cattle using a whole genome SNP panel. BMC Genet. 2008;9:37.View ArticlePubMed CentralPubMedGoogle Scholar
- McTavish EJ, Decker JE, Schnabel RD, Taylor JF, Hillis DM. New world cattle show ancestry from multiple independent domestication events. Proc Natl Acad Sci. 2013;110:E1398–406.View ArticlePubMed CentralPubMedGoogle Scholar
- von Holdt BM, Pollinger JP, Lohmueller KE, Han E, Parker HG, Quignon P, et al. Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. 2010;464:898–902.View ArticleGoogle Scholar
- Seeb JE, Carvalho G, Hauser L, Naish K, Roberts S, Seeb LW. Single-nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in nonmodel organisms. Mol Ecol Resour. 2011;11:1–8.View ArticlePubMedGoogle Scholar
- Finger AJ, Stephens MR, Clipperton NW, May B. Six diagnostic single nucleotide polymorphism markers for detecting introgression between cutthroat and rainbow trouts. Mol Ecol Resour. 2009;9:759–63.View ArticlePubMedGoogle Scholar
- Hohenlohe PA, Amish SJ, Catchen JM, Allendorf FW, Luikart G. Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Mol Ecol Resour. 2011;11:117–22.View ArticlePubMedGoogle Scholar
- Schwenke PL, Rhydderch JG, Ford MJ, Marshall AR, Park LK. Forensic identification of endangered Chinook Salmon (Oncorhynchus tshawytscha) using a multilocus SNP assay. Conserv Genet. 2006;7:983–9.View ArticleGoogle Scholar
- Brandström M, Ellegren H. Genome-wide analysis of microsatellite polymorphism in chicken circumventing the ascertainment bias. Genome Res. 2008;18:881–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Arnold B, Corbett-Detig RB, Hartl D, Bomblies K. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol Ecol. 2013;22:3179–90.View ArticlePubMedGoogle Scholar
- Nielsen R. Population genetic analysis of ascertained SNP data. Hum Genomics. 2004;1:218–24.View ArticlePubMed CentralPubMedGoogle Scholar
- Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005;15:1496–502.View ArticlePubMed CentralPubMedGoogle Scholar
- Albrechtsen A, Nielsen FC, Nielsen R. Ascertainment biases in SNP chips affect measures of population divergence. Mol Biol Evol. 2010;27:2534–47.View ArticlePubMed CentralPubMedGoogle Scholar
- McGill JR, Walkup EA, Kuhner MK. Correcting coalescent analyses for panel-based SNP ascertainment. Genetics. 2013;193:1185–96.View ArticlePubMed CentralPubMedGoogle Scholar
- Rosenblum EB, Novembre J. Ascertainment bias in spatially structured populations: a case study in the eastern fence lizard. J Hered. 2007;98:331–6.View ArticlePubMedGoogle Scholar
- Heslot N, Rutkoski J, Poland J, Jannink J-L, Sorrells ME. Impact of marker ascertainment bias on genomic selection accuracy and estimates of genetic diversity. PLoS One. 2013;8:e74612.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang Y, Nielsen R. Estimating population divergence time and phylogeny from single-nucleotide polymorphisms data with outgroup ascertainment bias. Mol Ecol. 2012;21:974–86.View ArticlePubMed CentralPubMedGoogle Scholar
- Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M. Robust demographic inference from genomic and SNP data. PLoS Genet. 2013;9:e1003905.View ArticlePubMed CentralPubMedGoogle Scholar
- Ellegren H, Moore S, Robinson N, Byrne K, Ward W, Sheldon BC. Microsatellite evolution–a reciprocal study of repeat lengths at homologous loci in cattle and sheep. Mol Biol Evol. 1997;14:854–60.View ArticlePubMedGoogle Scholar
- Mountain JL, Cavalli-Sforza LL. Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms. Proc Natl Acad Sci. 1994;91:6515–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Jorde LB, Bamshad MJ, Watkins WS, Zenger R, Fraley AE, Krakowiak PA, et al. Origins and affinities of modern humans: a comparison of mitochondrial and nuclear genetic data. Am J Hum Genet. 1995;57:523–38.View ArticlePubMed CentralPubMedGoogle Scholar
- Rogers AR, Jorde LB. Ascertainment bias in estimates of average heterozygosity. Am J Hum Genet. 1996;58:1033–41.PubMed CentralPubMedGoogle Scholar
- Eller E. Effects of ascertainment bias on recovering human demographic history. Hum Biol. 2001;73:411–27.View ArticlePubMedGoogle Scholar
- Han E, Sinsheimer JS, Novembre J. Characterizing bias in population genetic inferences from low-coverage sequencing data. Mol Biol Evol. 2014;31:723–35.View ArticlePubMed CentralPubMedGoogle Scholar
- Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet. 2009;10:639–50.View ArticlePubMedGoogle Scholar
- Cavalli-Sforza LL. Population structure and human evolution. Proc R Soc B Biol Sci. 1966;164:362–79.View ArticleGoogle Scholar
- Jombart T, Pontier D, Dufour AB. Genetic markers in the playground of multivariate analysis. Heredity. 2009;102:330–41.View ArticlePubMedGoogle Scholar
- Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461:489–94.View ArticlePubMed CentralPubMedGoogle Scholar
- Bryc K, Auton A, Nelson MR, Oksenberg JR, Hauser SL, Williams S, et al. Genome-wide patterns of population structure and admixture in West Africans and African Americans. Proc Natl Acad Sci. 2010;107:786–91.View ArticlePubMed CentralPubMedGoogle Scholar
- McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009;5:e1000686.View ArticlePubMed CentralPubMedGoogle Scholar
- Murray C, Huerta-Sanchez E, Casey F, Bradley DG. Cattle demographic history modelled from autosomal sequence variation. Philos Trans R Soc B Biol Sci. 2010;365:2531–9.View ArticleGoogle Scholar
- Teasdale MD, Bradley DG. The Origins of Cattle. In: Womack JE, editor. Bovine Genomics. Oxford, UK: Wiley-Blackwell; 2012.Google Scholar
- Ho SY, Larson G, Edwards CJ, Heupink TH, Lakin KE, Holland PW, et al. Correlating Bayesian date estimates with climatic events and domestication using a bovine case study. Biol Lett. 2008;4:370–4.View ArticlePubMed CentralPubMedGoogle Scholar
- Achilli A, Bonfiglio S, Olivieri A, Malusa A, Pala M, Kashani BH, et al. The multifaceted origin of taurine cattle reflected by the mitochondrial genome. PLoS One. 2009;4:e5753.View ArticlePubMed CentralPubMedGoogle Scholar
- Loftus RT, MacHugh DE, Bradley DG, Sharp PM, Cunningham P. Evidence for two independent domestications of cattle. Proc Natl Acad Sci. 1994;91:2757–61.View ArticlePubMed CentralPubMedGoogle Scholar
- Bonfiglio S, Ginja C, De Gaetano A, Achilli A, Olivieri A, Colli L, et al. Origin and spread of Bos taurus: new clues from mitochondrial genomes belonging to haplogroup T1. PLoS One. 2012;7:e38601.View ArticlePubMed CentralPubMedGoogle Scholar
- Freeman AR, Meghen CM, Machugh DE, Loftus RT, Achukwi MD, Bado A, et al. Admixture and diversity in West African cattle populations. Mol Ecol. 2004;13:3477–87.View ArticlePubMedGoogle Scholar
- Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, et al. Development and characterization of a high density SNP genotyping assay for cattle. PLoS One. 2009;4:e5350.View ArticlePubMed CentralPubMedGoogle Scholar
- Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K. The discovery of single-nucleotide polymorphisms—and inferences about human demographic history. Am J Hum Genet. 2001;69:1332–47.View ArticlePubMed CentralPubMedGoogle Scholar
- McTavish EJ, Hillis DM. A genomic approach for distinguishing between recent and ancient admixture as applied to cattle. J Hered. 2014;105:445–56.View ArticlePubMed CentralGoogle Scholar
- Chikhi L, Goossens B, Treanor A, Bruford MW. Population genetic structure of and inbreeding in an insular cattle breed, the Jersey, and its implications for genetic resource management. Heredity. 2004;92:396–401.View ArticlePubMedGoogle Scholar
- Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8.View ArticlePubMedGoogle Scholar
- Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–70.View ArticleGoogle Scholar
- Rousset F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour. 2008;8:103–6.View ArticlePubMedGoogle Scholar
- Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference. 2010. p. 57–61.Google Scholar
- Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190.View ArticlePubMed CentralPubMedGoogle Scholar
- Jones E, Oliphant T, Peterson P. SciPy: open source scientific tools for Python. 2001. http://www.scipy.org/.
- Wang C, Zöllner S, Rosenberg NA. A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet. 2012;8:e1002886.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang C, Szpiech ZA, Degnan JH, Jakobsson M, Pemberton TJ, Hardy JA, et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat Appl Genet Mol Biol. 2010;9:1544–6115.Google Scholar
- R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. ISBN 3-900051-07-0; 2012.Google Scholar
- Oksanen FJ, Blanchet G, Kindt R, Legendre P, Minchin PR, O’Hara RB, et al. Vegan: community ecology package. 2011. [R packa version 2.0-2] http://CRAN.R-project.org/package=vegan.
- Lachance J, Tishkoff SA. SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. Bioessays. 2013;35:780–6.View ArticlePubMedGoogle Scholar
- Jakobsson M, Edge MD, Rosenberg NA. The relationship between FST and the frequency of the most frequent allele. Genetics. 2013;193:515–28.View ArticlePubMed CentralPubMedGoogle Scholar
- Jost L. GST and its relatives do not measure differentiation. Mol Ecol. 2008;17:4015–26.View ArticlePubMedGoogle Scholar
- Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature. 2010;463:757–62.View ArticlePubMed CentralPubMedGoogle Scholar
- Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, Mahoney MW, et al. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet. 2007;3:e160.View ArticlePubMed CentralGoogle Scholar
- Bertels F, Silander OK, Pachkov M, Rainey PB, van Nimwegen E. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol Biol Evol. 2014;31:1077–88.View ArticlePubMed CentralPubMedGoogle Scholar
- Huang H, Knowles LL. Unforeseen consequences of excluding missing data from next- generation sequences: simulation study of RAD sequences. Syst Biol. 2014; Advance Access published July 4, 2014, doi:10.1093/sysbio/syu046.
- Bollongino R, Burger J, Powell A, Mashkour M, Vigne J-D, Thomas MG. Modern taurine cattle descended from small number of Near-Eastern founders. Mol Biol Evol. 2012;9:2101–4.View ArticleGoogle Scholar
- MacEachern S, Hayes B, McEwan J, Goddard M. An examination of positive selection and changing effective population size in Angus and Holstein cattle populations (Bos taurus) using a high density SNP genotyping platform and the contribution of ancient polymorphism to genomic diversity in domestic cattle. BMC Genomics. 2009;10:181.View ArticlePubMed CentralPubMedGoogle Scholar
- MacEachern S, McEwan J, Goddard M. Phylogenetic reconstruction and the identification of ancient polymorphism in the Bovini tribe (Bovidae, Bovinae). BMC Genomics. 2009;10:177.View ArticlePubMed CentralPubMedGoogle Scholar
- Wakeley J. Nonequilibrium migration in human history. Genetics. 1999;153:1863–71.PubMed CentralPubMedGoogle Scholar
- Beaumont MA. Adaptation and speciation: what can Fst tell us? Trends Ecol Evol. 2005;20:435–40.View ArticlePubMedGoogle Scholar
- Heller R, Chikhi L, Siegismund HR. The confounding effect of population structure on Bayesian skyline plot inferences of demographic history. PLoS One. 2013;8:e62992.View ArticlePubMed CentralPubMedGoogle Scholar
- Städler T, Haubold B, Merino C, Stephan W, Pfaffelhuber P. The impact of sampling schemes on the site frequency spectrum in nonequilibrium subdivided populations. Genetics. 2009;182:205–16.View ArticlePubMed CentralPubMedGoogle Scholar
- Chikhi L, Sousa VC, Luisi P, Goossens B, Beaumont MA. The confounding effects of population structure, genetic diversity and the sampling scheme on the detection and quantification of population size changes. Genetics. 2010;186:983–95.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–60.View ArticlePubMed CentralPubMedGoogle Scholar
- Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics. 2013;194:647–62.View ArticlePubMed CentralPubMedGoogle Scholar
- Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:0090–5.View ArticleGoogle Scholar
- Perez F, Granger BE. IPython: a system for interactive scientific computing. Comput Sci Eng. 2007;9:21–9.View ArticleGoogle Scholar
- Bouckaert RR. DensiTree: making sense of sets of phylogenetic trees. Bioinformatics. 2010;26:1372–3.View ArticlePubMedGoogle Scholar
- Schliep KP. Phangorn: phylogenetic analysis in R. Bioinformatics. 2011;27:592–3.View ArticlePubMed CentralPubMedGoogle Scholar
- Micallef L, Rodgers P. eulerAPE: Drawing Area-Proportional 3-Venn Diagrams Using Ellipses. PLoS One. 2014;9:e101717.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.