The effects of sample size on population genomic analyses – implications for the tests of neutrality
© Subramanian. 2016
Received: 27 October 2015
Accepted: 5 February 2016
Published: 20 February 2016
One of the fundamental measures of molecular genetic variation is the Watterson’s estimator (θ), which is based on the number of segregating sites. The estimation of θ is unbiased only under neutrality and constant population growth. It is well known that the estimation of θ is biased when these assumptions are violated. However, the effects of sample size in modulating the bias was not well appreciated.
We examined this issue in detail based on large-scale exome data and robust simulations. Our investigation revealed that sample size appreciably influences θ estimation and this effect was much higher for constrained genomic regions than that of neutral regions. For instance, θ estimated for synonymous sites using 512 human exomes was 1.9 times higher than that obtained using 16 exomes. However, this difference was 2.5 times for the nonsynonymous sites of the same data. We observed a positive correlation between the rate of increase in θ estimates (with respect to the sample size) and the magnitude of selection pressure. For example, θ estimated for the nonsynonymous sites of highly constrained genes (dN/dS < 0.1) using 512 exomes was 3.6 times higher than that estimated using 16 exomes. In contrast this difference was only 2 times for the less constrained genes (dN/dS > 0.9).
The results of this study reveal the extent of underestimation owing to small sample sizes and thus emphasize the importance of sample size in estimating a number of population genomic parameters. Our results have serious implications for neutrality tests such as Tajima D, Fu-Li D and those based on the McDonald and Kreitman test: Neutrality Index and the fraction of adaptive substitutions. For instance, use of 16 exomes produced 2.4 times higher proportion of adaptive substitutions compared to that obtained using 512 exomes (24 % vs 10 %).
Measuring genetic variation is fundamental in population genetics. Molecular genetic variation (θ) could be measured as the product of mutation rate (μ) and population size (N e ) and the theoretical relationship is θ = 4N e μ (for diploid organisms). Empirically θ could be estimated by the Watterson’s estimator θ w  (or simply θ hereafter), which is based on the number of segregating sites (S) or by the Tajima’s estimator θ π  (π hereafter), which uses the mean pair-wise differences between sequences. The estimation of θ is based on population coalescent theory and its popularity is due to its simplicity. Hence it is widely used in theoretical and empirical population genetic analyses. For instance θ or S is used to model the expected number of mutations, which is fundamental in molecular evolutionary biology . θ is an unbiased estimator when the assumptions such as neutrality and constant population sizes are met . However, θ is downwardly biased for an exponentially growing population. Similarly, estimates of θ are biased for sequences under purifying selection, which results in an excess of low frequency variants. This is because the theoretical relationship (θ = 4N e μ) assumes that all mutations neutral (or observable), which is not true in reality. Therefore this relationship can be written as θ = 4N e μf, where f is the fraction of neutral mutations and f = 1 and f < 1 for neutral and for selectively constrained sites respectively .
Although the factors influencing θ are well known, the effects of sample size in modulating the bias in θ estimation is not clear. In the pre-genomic era the sample sizes used in population genetics and molecular evolutionary analyses were modest. Therefore, the effects of sample size on fundamental parameters were not well appreciated as the magnitude of these effects were not obvious due to the small differences between the sample sizes used in various studies. However, in the post-genomic period the use of thousands of samples compared to the few dozens of the past will indeed make a huge impact on the estimates [6–9]. Recent studies based on several thousand human exomes identified a huge difference in the θ estimates. Nelson et al.  studied this issue and compared the θ estimated using different sample sizes. They reported that θ estimated using the protein-coding genes from 11,000 humans was 4.6 times higher than that estimated using 23 humans (40.4 vs. 8.8). In contrast π estimated for the two datasets were identical (3.96). Similarly, another study using exomes from >5,000 Americans showed a negative correlation between sample size and Tajima D estimates and up to fourfold difference between the estimates obtained using various sample sizes . Since Tajima D is based on the difference between θ and π the fourfold difference observed in this study was due to the difference in the θ estimated for various sample sizes. Similar discrepancies owing to sample size bias were also reported in other organisms such as plants . These studies attributed this phenomenon to the exponential growth of the populations . However, a proper simulation study is needed to confirm this.
Another important factor that is known to bias θ estimation is purifying selection. However, whether sample size will modulate the magnitude of this bias is not known. This is an important issue because every gene (and the genome) consists of regions under selection as well as under neutral evolution and most of the population genetic parameters are estimated for both regions. If sample size differentially influences the θ estimates of neutral regions and selected regions, then the estimates obtained for these regions are not comparable. This has serious implications for the tests of neutrality such as Tajima D , Fu and Li D  and the statistics based on McDonald and Krietman test  namely the Neutrality Index (NI)  and the proportion of adaptive substitutions (α) .
To examine the differential effects of sample size bias in neutral and constrained sites we assembled a large dataset consisting over 1000 exomes [obtained from the 1000 genomes project  and estimated various population genetic parameters. We also examined how and to what extent sample size effects influence the tests of neutrality. Finally, we conducted robust simulations to further elucidate the magnitude of sample size effects on the estimation of θ and to determine the probable cause for this pattern.
Differential effects of sample size on neutral and selected sites
Magnitude of purifying selection and the extent of sample size bias
Sample size effects on Tajima D (D T ) and Fu-Li D (D FL )
Influence of sample size on MK-test based measures
Finally, we quantified the proportion of adaptive nonsynonymous substitutions (α) using different sample sizes. Our result produced a strong negative correlation (P < 0.01) between the sample size and α (Fig. 5b). Interestingly using a small number of exomes suggested that 24 % of the nonsynonymous mutations were fixed by adaptive evolution. However, using a much larger sample size of 512 exomes this number reduced 2.4 fold and only 10 % of the nonsynonymous substitutions were estimated to be under positive selection.
Results from simulation analysis
The results from the sequences simulated under varying levels of selective constraints are shown in Figs. 6b and c for constant and exponential growth models respectively. Figure 6b shows that even under constant population growth conditions the estimation of θ varied with sample sizes. The rate of variation was much higher for sequences under high selective constraints (NeS = 2000) than those under relaxed selective pressures (NeS = 2). The slope of the regression line of the former was 2.7 times higher than that of the latter (0.21 Vs 0.08). However, θ estimated for neutrally evolving sequences did not vary with the sample size as there was no significant correlation between the two variables (P < 0.13). These results suggest that for populations under constant growth, purifying selection alone modulate the sample size bias in estimating θ.
Our simulation results for exponentially growing population shows that the sample size bias in estimating θ is much higher compared to that under constant growth. For instance, the slopes of the regression lines shown in Fig. 6c are much higher (0.17 – 0.34) than the corresponding lines shown in Fig. 6b (0.03 – 0.21). Furthermore, the difference in θ estimates obtained for large (N = 512) and small (N = 16) sample sizes are also much higher for exponentially growing populations than those under constant growth. For instance, this difference was 3.3 times for the highly constrained (NeS = 2000) exponentially growing populations (blue circles-Fig. 6c) and this was only 2.1 times for those under constant growth (blue circles-Fig. 6b). The overall results from our simulation study were qualitatively identical to those observed using the 1000 genome human data.
The role of selection could be explained by comparing the regions under neutrality to those under selective constraints. Since purifying selection prevents deleterious mutations reaching high frequencies, constrained genomic regions are typically abundant in low frequency SNVs . Therefore, large sample sizes are required to properly identify these rare SNVs. Hence the estimation of θ for the constrained regions of exponentially growing populations is much more severely biased by sample sizes because they are modulated by both demographic and selective forces. Since the human population is known to be under exponential growth, the sample size effects on the estimation of θ for neutral synonymous sites are influenced by the demographic factor alone but estimates for the nonsynonymous sites are modulated by both demographic and selective forces. This is clear in the simulation study, which showed that the sample size bias is influenced only by purifying selection in constant populations. Therefore, the extent of sample size bias for the constrained regions of exponentially growing populations was much higher than that observed for the constrained regions of populations under constant growth. Furthermore, humans have a unique demographic history and it is well known that human populations have undergone an explosive population growth, which resulted in much higher fraction of rare deleterious variants [8, 9]. This is evident from the unusually high ratio of θ N/θ S (0.5) observed for large sample sizes (Fig. 1b).
Apart from neutrality and constant growth, the estimation of θ is also based on the assumption that individual sites/mutations in a genome are inherited and evolve independently. However, this is not true in reality as the genomic regions along with the mutations are inherited as large blocks of IBD segments. Therefore, the equation used for θ estimation does not account for this and therefore the results shown in this study might have the influence of this bias.
In this study we have used the data 1000 genome project, which consists of genomes from a number of populations all over the world. Hence this sample composition is from a continuous population, which is a unique and unusual characteristic of this dataset. Therefore, to examine the generality of the patterns observed in this study for specific populations we examined the sample size issue using single populations. For this purpose, we used the subset of 85 exomes belonging to the CEU (Utah American) population and divided the data into two groups, one with small (16 exomes) and another with large (64 exomes) sample sizes (Additional file 1: Figure S1). For neutral genomic regions (synonymous sites) the θ estimate obtained for the large sample size was 9 % higher than that observed for small sample sizes and the difference was highly significant (P < 10−7). In contrast for constrained regions this difference was 26 % (P < 10−7). As expected, π estimates were similar between large and small sample sizes and this was true for neutral (P = 0.48) as well as constrained sites (P = 0.61). Similar results were observed for African (YRI) and Asian (CHB) populations (Additional file 1: Figures S2 and S3). Although we could perform the population specific analyses using only a small number of available exomes the results were highly significant and qualitatively similar to the main results reported in this study.
The results of this study highlight the significance of sample size in estimating some of the fundamental parameters of population genetics. Importantly we showed that for small sample sizes the underestimation of θ is higher for constrained regions than that for neutral regions of the same set of exomes. Hence the different θ estimated for the two regions using same population genomic data are not comparable especially when the sample size is small. Therefore, this bias affects all neutrality tests and the estimates based on them. For instance, Fig. 3 shows that the difference in Tajima’s D estimated for neutral and constrained sites widens with the increase in sample size. In fact, the (close to) true values will only be obtained for very large sample sizes. When sample size is small Tajima’s D of the two types of sites are apparently similar. Hence use of small number of samples in this analysis will produce erroneous results due to severe underestimation of θ for constrained sites. This also is true for the results of Fu-Li D test (Fig. 4).
In the case of MK test based statistics, the proportion of adaptive nonsynonymous substitutions (α), use of large number of samples results in identifying more deleterious (low frequency) nonsynonymous SNVs, which increases P n in eqn 5 and thus the value of α is reduced. In contrast, small sample sizes identify fewer nonsynonymous SNVs, which leads to an overestimation of the proportion of adaptive substitutions. The other measure based on MK-test, the Neutrality Index, is underestimated using a small number of samples due to the failure to precisely identify some of the low frequency nonsynonymous SNVs.
Genomic sequence data and analyses
We obtained the genome data for 1092 humans from GenBank, which was originally generated by the 1000 genome project (phase 1-version 3) . Using the genome annotations, we extracted the single nucleotide variants (SNVs) present in the synonymous and nonsynonymous sites of protein-coding genes and included only the bi-allelic SNVs. We divided the data into six non-overlapping sets consisting of 16, 32, 64, 128, 256 and 512 exomes (or samples). To determine the magnitude of selection on nonsynonymous sites we used the dN/dS ratio computed for the protein-coding genes using the human-chimpanzee pair. For this purpose, we obtained the human-chimpanzee pair-wise genome alignment from the UCSC genome browser data resource (https://genome.ucsc.edu/). Using the exon-intron boundaries provided in the reference gene annotations we extracted the protein-coding transcripts from the human-chimp alignment. Using the gene annotations from Ensembl (http://www.ensembl.org/) we retained the longest transcript for each gene. For each gene the divergence at synonymous sites (dS) and nonsynonymous sites (dN) were estimated based on the maximum likelihood method employed in the software PAML . While dS (SM/SS) is the number of synonymous substitutions (SM) per synonymous site (SS) in a gene dN (NM/NS) is the number of nonsynonymous substitutions (NM) per nonsynonymous site (NS). In estimating dN or dS, the maximum likelihood method tend to overestimate when the actual divergence is large. To avoid such estimation errors (due to the overcorrection of multiple hits) we excluded the genes for which dN or dS estimate was > 0.8. These filters resulted in 13,454 unique protein-coding genes, which were eventually used for further analysis. The ratio of dN and dS (dN/dS) was used as the proxy for the magnitude of selection pressure on a gene.
We estimated a number of population genetic parameters such as θ, π, Tajima D, Fu-Li D, Neutrality Index and the proportion of adaptive nonsynonymous substitutions using the following equations.
Estimation of θ and π
In this study we used π rather than k, which is the average number of pair-wise nucleotide differences per site.
Tajima D (D T )
This test is based on the difference between the number of segregating sites and average number of pair-wise nucleotide differences. Under neutrality these two measures are expected to be equal. Tajima’s D is given by :
Fu and Li D (D FL )
This is another neutrality test similar to Tajima D T but based on the difference between the total number of mutations and the singleton mutations in a population genealogy. Under neutrality these two numbers are expected to be equal. The D FL of Fu and Li (without outgroup) is given by :
McDonald and Kreitman (MK) test
The popular statistics used in population genetics namely the Neutrality Index (NI) and the proportions of adaptive nonsynonymous substitutions (α) are based on the principles of the MK test.
Neutrality Index (NI)
Proportion of adaptive nonsynonymous substitutions (α)
We conducted an extensive simulation using the program SFS_CODE , which is based on forward-in-time population genetic model. The simulation was performed under constant and exponential growth models. Sequences were also simulated for neutral evolution and purifying selection. A sequence length of 5000 bp, N e = 10,000 and a mutation rate of 1 × 10−8 per site per generation was used for the simulation . We conducted separate simulation runs using sample sizes of 16, 32, 64, 128, 256 and 512. For human population growth we followed the model proposed by Tennessen et al. . This model uses two growth phases, the first one was slow and a second one was exponentially fast. To keep the simulations comparable between constant and exponential growth models we combined the simulation runs and used the parameters suggested by a previous study . In the beginning we simulated a population that first splits into two and both grow at the same rate until they reach N e = 9,210. Then only one population undergoes a large exponential growth phase until it reaches N e = 512,210. The other population undergoes a constant growth phase and thus its number remains at N e = 9,210. For modelling constrained site evolution, we used the scaled selection coefficient γ = 2Ns with γ following a gamma distribution, which has a mean of α/β. We fixed α as 0.206 based on previous studies  and varied β to model various magnitudes of selection ranging between γ = 2 to 2000. We performed 1000 replicates, obtained the estimates θ and π and computed the mean values. Since the simulation conducted here was only to compare the θ estimates from different sample sizes changing any parameter (eg. mutation rate) does not affect the end results.
The author is grateful to David Lambert and acknowledges the support from Australian Research Council (LP150100583) and Environmental Futures Research Institute, Griffith University. The author also thanks Richard O’Rorke for critical comments.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1975;7(2):256–76. PubMed.View ArticlePubMedGoogle Scholar
- Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–95. PubMed PMID: 2513255; PubMed Central PMCID: PMCPMC1203831.Google Scholar
- Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132(4):1161–76. PubMed PMID: 1459433; PubMed Central PMCID: PMCPMC1205236.Google Scholar
- Nei M, Kumar S. Molecular Evolution and Phylogenetics. Oxford: Oxford University Press; 2000.Google Scholar
- Henn BM, Botigue LR, Bustamante CD, Clark AG, Gravel S. Estimating the mutation load in human genomes. Nat Rev Genet. 2015;16(6):333–43. doi:10.1038/nrg3931. PubMed.View ArticlePubMedGoogle Scholar
- Al-Khudhair A, Qiu S, Wyse M, Chowdhury S, Cheng X, Bekbolsynov D, et al. Inference of distant genetic relations in humans using "1000 genomes". Genome Biol Evol. 2015;7(2):481–92. doi:10.1093/gbe/evv003. PubMed PMID: 25573959; PubMed Central PMCID: PMCPMC4350174.
- Korneliussen TS, Moltke I, Albrechtsen A, Nielsen R. Calculation of Tajima's D and other neutrality test statistics from low depth next-generation sequencing data. BMC Bioinformatics. 2013;14:289. doi:10.1186/1471-2105-14-289. PubMed PMID: 24088262; PubMed Central PMCID: PMCPMC4015034.
- Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–4. doi:10.1126/science.1217876. Epub 2012/05/19. PubMed.PubMed CentralView ArticlePubMedGoogle Scholar
- Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337(6090):64–9. doi:10.1126/science.1219240. PubMed PMID: 22604720, PubMed Central PMCID: PMC3708544, Epub 2012/05/19.PubMed CentralView ArticlePubMedGoogle Scholar
- Larsson H, Kallman T, Gyllenstrand N, Lascoux M. Distribution of long-range linkage disequilibrium and Tajima's D values in Scandinavian populations of Norway Spruce (Picea abies). G3 (Bethesda). 2013;3(5):795–806. doi:10.1534/g3.112.005462. PubMed PMID: 23550126; PubMed Central PMCID: PMCPMC3656727.
- Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133(3):693–709. PubMed PMID: 8454210; PubMed Central PMCID: PMCPMC1205353.Google Scholar
- McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991;351(6328):652–4. doi:10.1038/351652a0. PubMed.View ArticlePubMedGoogle Scholar
- Rand DM, Kann LM. Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humans. Mol Biol Evol. 1996;13(6):735–48. PubMed.View ArticlePubMedGoogle Scholar
- Smith NG, Eyre-Walker A. Adaptive protein evolution in Drosophila. Nature. 2002;415(6875):1022–4. doi:10.1038/4151022a. PubMed.View ArticlePubMedGoogle Scholar
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. doi:10.1038/nature11632. PubMed PMID: 23128226; PubMed Central PMCID: PMCPMC3498066.
- Hernandez RD. A flexible forward simulator for populations subject to selection and demography. Bioinformatics. 2008;24(23):2786–7. doi:10.1093/bioinformatics/btn522. PubMed PMID: 18842601; PubMed Central PMCID: PMCPMC2639268.
- Zhang Q, Tyler-Smith C, Long Q. An extended Tajima's D neutrality test incorporating SNP calling and imputation uncertainties. Stat Interface. 2015;8(4):447–56. doi:10.4310/SII.2015.v8.n4.a4. PubMed PMID: 26681995; PubMed Central PMCID: PMCPMC4678577.
- Subramanian S. The abundance of deleterious polymorphisms in humans. Genetics. 2012;190(4):1579–83. doi:10.1534/genetics.111.137893. PubMed PMID: 22267501; PubMed Central PMCID: PMCPMC3316666.
- Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91. doi:10.1093/molbev/msm088. PubMed.View ArticlePubMedGoogle Scholar
- Li W-H. Molecular Evolution. Sunderland: Sinauer Associates Inc.; 1997.Google Scholar
- Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–9. doi:10.1126/science.1186802. PubMed PMID: 20220176; PubMed Central PMCID: PMCPMC3037280.
- Gazave E, Chang D, Clark AG, Keinan A. Population growth inflates the per-individual number of deleterious mutations and reduces their mean effect. Genetics. 2013;195(3):969–78. doi:10.1534/genetics.113.153973. PubMed PMID: 23979573, PubMed Central PMCID: PMC3813877, Epub 2013/08/28.PubMed CentralView ArticlePubMedGoogle Scholar
- Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS genetics. 2008;4(5):e1000083. doi:10.1371/journal.pgen.1000083. PubMed PMID: 18516229, PubMed Central PMCID: PMC2377339, Epub 2008/06/03.PubMed CentralView ArticlePubMedGoogle Scholar