Simple models of genomic variation in human SNP density
- Raazesh Sainudiin^{1, 2}Email author,
- Andrew G Clark^{3} and
- Richard T Durrett^{2}
https://doi.org/10.1186/1471-2164-8-146
© Sainudiin et al; licensee BioMed Central Ltd. 2007
Received: 09 November 2006
Accepted: 06 June 2007
Published: 06 June 2007
Abstract
Background
Descriptive hierarchical Poisson models and population-genetic coalescent mixture models are used to describe the observed variation in single-nucleotide polymorphism (SNP) density from samples of size two across the human genome.
Results
Using empirical estimates of recombination rate across the human genome and the observed SNP density distribution, we produce a maximum likelihood estimate of the genomic heterogeneity in the scaled mutation rate θ. Such models produce significantly better fits to the observed SNP density distribution than those that ignore the empirically observed recombinational heterogeneities.
Conclusion
Accounting for mutational and recombinational heterogeneities can allow for empirically sound null distributions in genome scans for "outliers", when the alternative hypotheses include fundamentally historical and unobserved phenomena.
Keywords
Background
Understanding the population-genetic forces behind the observed variation among human genome sequences is vital to deciphering the genetic causes of phenotypic variation among humans. The phenomena that influence the density of human SNPs include (1) variation-introducing events that are empirically observable, such as, point-mutations, recombinations, and activities of various transposable elements that may result from the counteraction of various DNA damage and repair pathways [[1], for e.g.], as well as (2) genealogy-affecting events that are historical and generally unobserved, such as population dynamics, population structure, and natural selection. A biological understanding of the observed genomic variation in SNP density, by means of explicit population-genetic models of coalescence in the presence of recombination and mutation, must incorporate any interplay among the heterogeneities in the above phenomena. Here we strive for an empirically sound understanding of the observed human SNP density, as determined by a genome-wide alignment of two different consensus sequences, by accounting for the empirically observable mutational and recombinational heterogeneities under the simplest model of population history (selectively-neutral, constant-sized, random-mating). The two sequences are the NCBI human genome sequence and the sequence produced by Celera Genomics [2]. Our SNP density data were obtained from first aligning the Celera consensus sequence to the NCBI assembly and then counting the number of SNPs in bins of 100 kb (100,000 base pairs), as was done in section 6 of the above study [2]. Next, we build simple models for the distribution of SNP density from random samples of size 2 from a locus that is 100 kb in length. Our objective is to explain as much of this simple measure of diversity as possible, under empirically sound null hypotheses that include coarse-grained, genome-wide measurements of recombinational variation.
Results
Two approaches toward modeling are taken. The first approach is descriptive and employs hierarchical Poisson models to obtain better fits than the homogeneous Poisson distribution used earlier [2]. Insights gained from the first approach inform the second approach. The second approach is non-descriptive and population-genetic with biologically interpretable parameters. It employs mixtures of SNP densities simulated under the coalescent with different mutation and recombination rates to obtain a better fit to the observed SNP density distribution. This approach introduces heterogeneity into the coalescent-based simulation of SNP density that was shown to produce a poor fit under the assumptions of genome-wide homogeneity and equality of mutation and recombination rates [2]. The simple closed-form expressions used in the paper are elementary results in coalescent theory [3, 4].
Descriptive Hierarchical Poisson Models
Let Λ and T be the parameters in the mass function of a Poisson distribution given by Pr(X = x|ΛT) = e^{-ΛT}(ΛT)^{ x }/x!. The random variables Λ and T are generally proxies for relative mutation rate and the sum of branch lengths of the coalescent trees for all the non-recombining segment(s) of the 100 kb locus, respectively. In other words, T is a proxy for the sum of the branch lengths of the ancestral recombination graph (ARG-size) of our sample of size 2 at a locus that is 100 kb long. The random variable X represents the count of SNPs in contiguous 100 kb intervals from an alignment of two human genomes. In this hierarchical scheme, heterogeneities are modeled by the Gamma and Beta probability density functions (PDF s), G(γ_{1}, γ_{2}) and B (β_{1}, β_{2}), respectively, as described in Methods. We chose the Gamma distribution G(γ_{1}, γ_{2}) to model T for the following reasons. When there is no recombination, the depth of the coalescent tree of two samples is exponentially distributed with rate parameter 1, i.e., G(1,1). And when there are n sites with free recombination in between them, the sum of the n independent and exponentially distributed depths is G(n, 1). Thus, T is only a mathematically convenient proxy for the ARG-size of our sample of size 2, since T ~ G(γ_{1}, γ_{2}) does not explicitly capture the distribution of ARG-size for intermediate levels of intra-locus recombination among sites at our locus. We use the relatively flexible Beta family on [0, 1] to model, Λ which is a proxy for relative mutation rate. The Poisson distribution for SNP density follows from the assumption of the infinitely-many-sites mutational model under selective neutrality, where mutations hit a site at most once according to the product of the total length of the site-specific coalescent tree and the site-specific relative mutation rate. Therefore, such hierarchical Poisson models are merely descriptive, as they are built via mathematically convenient Beta and Gamma distributed random variables Λ and T that act as proxies for the relative mutation rate and the ARG-size, respectively.
Maximum likelihood analysis and comparison of Poisson models
Model | T | Λ | Maximum Likelihood Estimates | ML | AIC* |
---|---|---|---|---|---|
Poisson | λ | 1 | $\widehat{\lambda}$ = 90.2, $\widehat{\mu}$ = 90.2, $\stackrel{\_}{{\sigma}^{2}}$ = 90.2 | -616497 | 861964 |
A | G(γ_{1}, γ_{2}) | 1 | $\stackrel{\_}{{\gamma}_{1}}$ = 2.7, $\stackrel{\_}{{\gamma}_{2}}$ = 32.9, $\widehat{\mu}$ = 90.2, | -186348 | 1670 |
$\stackrel{\_}{{\sigma}^{2}}$ = 3049.7 | |||||
A' | λ | B(β_{1}, β_{2}) | $\widehat{\lambda}$ = 387.6, $\stackrel{\_}{{\beta}_{1}}$ = 2.17, $\stackrel{\_}{{\beta}_{2}}$ = 7.16, | -185869 | 714 |
$\widehat{\mu}$ = 90.1, $\stackrel{\_}{{\sigma}^{2}}$ = 2683.9 | |||||
B | G(γ_{1}, γ_{2}) | B(β_{1}, β_{2}) | $\stackrel{\_}{{\gamma}_{1}}$ = 6.4, $\stackrel{\_}{{\gamma}_{2}}$ = 19.0, $\stackrel{\_}{{\beta}_{1}}$ = 1.3, | -185511 | 0 |
$\stackrel{\_}{{\beta}_{2}}$ = 0.46, $\widehat{\mu}$ = 90.1, $\stackrel{\_}{{\sigma}^{2}}$ = 2538.2 |
Population-Genetic Coalescent Mixture Models
A panmictic, Wright-Fisher, neutral coalescent model with a constant effective population size of 10,000 diploid individuals was assumed to simulate the distribution of the number of segregating sites at a locus of 100 kb evolving under an infinitely-many-sites mutation model using the C program ms [8]. The scaled product of the effective population size (N_{ e }) and the mutation rate per locus per generation (μ) is denoted by θ = 4N_{ e }μ. The recombination rate r is the probability of cross-over per generation between the ends of the locus being simulated and its scaled product with N_{ e }is denoted by ρ = 4N_{ e }r.
In the absence of recombination and with constant mutation rates, the distribution of SNPs is known to have an explicit form. The coalescent tree is identical for every nucleotide site in the locus in any given realization of the coalescent process of two samples. Since the rescaled time to the coalescent event and the mutation event are exponentially distributed with rates 1 and θ, respectively, the probability of a mutation event before the coalescent event is θ/(1 + θ). Thus, the probability of observing x mutations at our locus before the coalescent event is (θ/(1 + θ))^{ x }1/(1 + θ). In other words, the probability of observing x SNPs at a locus when r = 0 is geometrically distributed with parameter 1/(1 + θ).
It is also known that as the recombination rate at our locus approaches infinity, the distribution of SNPs approaches a Poisson distribution with parameter θ. This can be seen from the following argument. High levels of recombination assures that the coalescent tree at each site is independent of those at other sites. Thus, for a locus with n sites, the probability of observing x SNPs is $\left(\begin{array}{c}n\\ x\end{array}\right){\left(\frac{\theta}{n}/\left(1+\frac{\theta}{n}\right)\right)}^{x}{\left(1/\left(1+\frac{\theta}{n}\right)\right)}^{n-x}$. For large loci, this binomial mass function is known to approximate e^{-θ}θ^{ x }/x!, the Poisson mass function, as n → ∞ and $n\frac{\theta}{n}/\left(1+\frac{\theta}{n}\right)\to \theta $.
θ_{ i }∈ Θ = {θ_{1}, ⋯, θ_{304}} = {0.001, 0.01, 0.1, 0.5, 1, 2, 3, 4, 5, ⋯, 298, 299, 300},
Discussion
Another study [13] claimed to have achieved a good fit to single reads with 0, 1, 2, 3, or 4 SNPs, by accounting for mutational heterogeneity and genealogical variability in a different manner. They partitioned the genome into 200 kb bins, and selected a single read from each bin. They calculated the observed GC content of the bin, and from a regression of GC content on nucleotide diversity across the whole genome, they calculated an expected diversity given the local GC content of each bin induced by the exponentially distributed coalescent time for samples of size 2 in the absence of recombination. However, when the full bin size of 100 kb were used [2], the SNP count ranged to more than 100 per bin. Because many neighboring reads have shared genealogies, the magnitude of variability from bin to bin is much greater, and the power to detect this heterogeneity is far greater. Thus, the latter study [2] found that the coalescent in the presence of recombination fit the observed SNP density better than the coalescent without recombination. The model employed in the former study [13] fits without recombination only because the power is so low to detect a departure and because there are correspondingly fewer recombination events expected within single reads vs. 100 kb bins. Using the data of SNP counts in 100 kb bins in this study, we find that the coalescent with heterogeneities in recombination as well as mutation gives substantially better fits than the coalescent with a constant rate of recombination and mutation. We have shown that by invoking heterogeneities in mutation and recombination rates, one can better explain the observed variation in SNP density across two randomly sampled 100 kb segments of human chromosomes. Descriptive fits by means of hierarchical Poisson models, as well as population-genetic fits by means of coalescent mixture models, significantly improved when heterogeneities in recombination as well as mutation rates were accounted for. The coalescent mixture model does not completely fit the data in the most interesting region, namely, the segments with the least SNP density. This is partly due to the filtering strategy used to obtain the data. Since there were considerable gaps in the alignments for several bins, there was an overestimation of bins with 0 SNPs. Thus, these bins were ignored from the analysis. Were low SNP counts from such currently ignored bins made available from a high-resolution alignment, a similar analysis would reveal the poorer fits of the descriptive hierarchical Poisson models employed here, unless they are further generalized to allow for a larger mass at 0 through the zero-inflated class [6], for instance. If one's objective is to produce a descriptive fit to our observed SNP density distribution, then the hierarchical model B is clearly preferable to all the models considered in this study due to its strikingly high likelihood value. However, if one wanted a population-genetic model with biologically interpretable parameters to fit the same data, then the best fitted coalescent mixture model with the Genethon recombination map is preferable.
It is important to bear in mind that the distribution of T will be affected not only by recombination rate but also by population structure and demography. Likewise, the distribution of Λ and T will be affected by the complex interaction between various DNA damage and repair pathways that ultimately lead to various types of mutational and recombinational events [[1], for e.g.]. Moreover, the action of selection will simultaneously affect both the distribution of T and Λ about the selected site(s). However, since only a small percentage of the genome is expected to be affected by recent selective sweeps, the overall SNP density distribution should not be significantly affected by such selective events. Thus, our MSL estimate of the genomic variation in θ, based on the Genethon map, is under the standard neutral coalescent that allows for recombinational and mutational rate heterogeneity across the genome. The true genomic variation in θ can also be affected by several other confounded historical factors including selection, population structure, and demography, besides genomic variation in mutation rate. All these confounded historical factors can be seen as alternative hypotheses to the null hypothesis of our coalescent mixture model for the SNP density distribution, i.e., the standard neutral coalescent with genomic heterogeneity in recombination and mutation rates.
Conclusion
As high resolution data for larger samples become available at a genomic scale, one can use such simulated ML methods (with appropriate sample sizes) to get the null distributions of various test statistics while accounting for heterogeneities in recombination rates (from empirical maps or finer-scaled inferred maps) and mutation rates (from the informative phylogenomic constraints imposed by additional ape genomes). Such empirically observable phenomena should be incorporated into the null hypothesis when more complex models with unobserved historical phenomena, such as population dynamics, population structure, and/or natural selection are tested in humans at the genomic scale. Current scans of the human genome tend to underestimate the costs of ignoring the empirically observable heterogeneities under the null hypothesis.
Methods
The following strategy was used to obtain a simulation-based empirical estimate of the SNP density distribution for each scaled mutation rate
θ_{ i }∈ Θ = {θ_{1}, ⋯, θ_{304}} = {0.001, 0.01, 0.1, 0.5, 1, 2, 3, 4, 5, ⋯, 298, 299, 300},
1. for each θ_{ i }∈ Θ, repeat N times:
(b) simulate the coalescent according to ρ and θ_{ i }[4, 8]
(c) record the number of SNPs
Declarations
Acknowledgements
RS is a Research Fellow of the Royal Commission for the Exhibition of 1851. RS and RTD are partially supported by the National Science Foundation/National Institutes of Health Grant DMS/NIGMS 0201037. RS thanks Arkendra De, Kevin Thornton, and Russell Zaretzki for insightful discussions and Gilean McVean for comments on an earlier draft. Critically constructive comments of two anonymous referees improved the manuscript.
Authors’ Affiliations
References
- Schärer OD: Chemistry and biology of DNA repair. Angew Chem Int Ed. 2003, 42: 2946-2974. 10.1002/anie.200200523.View ArticleGoogle Scholar
- Venter JC: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.PubMedView ArticleGoogle Scholar
- Kingman JFC: The coalescent. Stochastic Process Appl. 1982, 13: 235-248. 10.1016/0304-4149(82)90011-4.View ArticleGoogle Scholar
- Hudson RR: Properties of a neutral allele model with intra-genic recombination. Theoretical Population Biology. 1983, 23: 183-201. 10.1016/0040-5809(83)90013-8.PubMedView ArticleGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Autom Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Bhöning D: Zero-Inflated Poisson models and C.A.MAN: A tutorial collection of evidence. Biometrical Journal. 1998, 40: 833-843. 10.1002/(SICI)1521-4036(199811)40:7<833::AID-BIMJ833>3.0.CO;2-O.View ArticleGoogle Scholar
- Pakes A, Pollard D: Simulation and the Asymptotics of Optimization Estimators. Econometrica. 1989, 57: 1027-1057. 10.2307/1913622.View ArticleGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002, 18: 337-338. 10.1093/bioinformatics/18.2.337.PubMedView ArticleGoogle Scholar
- Dib C, Faure S, Fizames C, Samson D, Drouot N, Vignal , Millasseau P, Marc S, Kazan J, Seboun E, Lathrop M, Gyapay G, Morissette J, Weissenbach J: A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature. 1996, 380: 152-154. 10.1038/380152a0.PubMedView ArticleGoogle Scholar
- Broman KW, Murray JC, Sheffeld VC, White RL, Weber JL: Comprehensive human genetic map: individual and sex-specific variation in recombination. Am J Hum Genet. 1998, 63: 861-869. 10.1086/302011.PubMed CentralPubMedView ArticleGoogle Scholar
- Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir SJB, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K: A high-resolution recombination map of the human genome. Nature Genetics. 2002, 31: 241-247.PubMedGoogle Scholar
- UCSC Genome Annotation Database. [http://genome.ucsc.edu/goldenPath/gbdDescriptions.html]
- The international SNP map working group: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409: 928-933. 10.1038/35057149.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.