A neutral theory of genome evolution and the frequency distribution of genes
© Haegeman and Weitz; licensee BioMed Central Ltd. 2012
Received: 3 January 2012
Accepted: 21 May 2012
Published: 21 May 2012
Skip to main content
© Haegeman and Weitz; licensee BioMed Central Ltd. 2012
Received: 3 January 2012
Accepted: 21 May 2012
Published: 21 May 2012
The gene composition of bacteria of the same species can differ significantly between isolates. Variability in gene composition can be summarized in terms of gene frequency distributions, in which individual genes are ranked according to the frequency of genomes in which they appear. Empirical gene frequency distributions possess a U-shape, such that there are many rare genes, some genes of intermediate occurrence, and many common genes. It would seem that U-shaped gene frequency distributions can be used to infer the essentiality and/or importance of a gene to a species. Here, we ask: can U-shaped gene frequency distributions, instead, arise generically via neutral processes of genome evolution?
We introduce a neutral model of genome evolution which combines birth-death processes at the organismal level with gene uptake and loss at the genomic level. This model predicts that gene frequency distributions possess a characteristic U-shape even in the absence of selective forces driving genome and population structure. We compare the model predictions to empirical gene frequency distributions from 6 multiply sequenced species of bacterial pathogens. We fit the model with constant population size to data, matching U-shape distributions albeit without matching all quantitative features of the distribution. We find stronger model fits in the case where we consider exponentially growing populations. We also show that two alternative models which contain a "rigid" and "flexible" core component of genomes provide strong fits to gene frequency distributions.
The analysis of neutral models of genome evolution suggests that U-shaped gene frequency distributions provide less information than previously suggested regarding gene essentiality. We discuss the need for additional theory and genomic level information to disentangle the roles of evolutionary mechanisms operating within and amongst individuals in driving the dynamics of gene distributions.
The gene content of genomes of closely related bacteria can differ significantly. For example, pair-wise comparisons of genome sequences from isolates of the same species often do not share a substantial fraction of their gene content [1–10]. When a large number of genomes within a species or closely related group of bacteria are sequenced, the gene content variability can be summarized as a gene frequency distribution: given G sequenced genomes, some genes are found in a fraction 1 ≤ k ≤ G of all genomes. Empirically, such gene frequency distributions possess a characteristic U-shape, such that there are many genes which only appear in one genome, fewer genes which appear in 2 ≤ k ≤ G - 1 genomes, and many genes which appear in all genomes. Genes within each of these three categories have been labeled accessory, character and core genes, respectively . It is tempting to conflate gene frequency with relative essentiality, but is it valid? For example, is it necessarily true that a gene that appears in all genomes in a sample should be classified as a "core" gene? Could such a gene have become common through neutral processes that diminish variability, and occasionally, lead to fixation of types? Likewise, should a gene which appears in only one genome in a sample be considered as "accessory" to the function of that organism? Could such a gene have become rare via a neutral path toward extinction, or have been recently introduced to a lineage without significant effect on individual fitness?
Here, we argue that a suitable null model is necessary with which to deduce how much weight be given to gene frequency data as a means to generate hypotheses regarding essentiality. For example, a recently proposed neutral theory of biogeography and biodiversity has proved useful in clarifying when and how species abundance distributions in ecology can provide information about selective processes in complex communities [12–15]. Hence, in this manuscript we ask: is it possible to recapitulate findings of U-shaped gene frequency distributions in the absence of selective forces driving genomic and population composition? We answer this question in the affirmative by proposing a simple and analytically tractable neutral model of genome evolution that explicitly accounts for gene composition of genomes. In this model, genomes undergo birth-death processes in a neutral sense and also acquire and lose genes which we term "gene transfer". The model differs from most previous efforts to analyze genome evolution [16, 17] by self-consistently treating the dynamics at two scales: population level drift and genomic level change (for an exception, see  whose model we address in the Results and Discussion). Analysis of the current model leads to the following major results.
First, we find that gene frequency distributions derived from this model possess a characteristic U-shape for a robust range of model parameters. Hence, we propose that prevalence of a gene does not necessarily imply its essentiality, and that gene frequency distributions may be more limited than previously acknowledged in generating inferences regarding essentiality. Second, we estimate the best fit parameters for a given empirical gene-frequency distribution of sequenced genomes and in so doing find a reasonable correspondence between our neutral model and data from six distinct bacterial species with sequenced genomes from multiple isolates. However, our model assuming constant population sizes predicts gene frequency distributions with systematically fewer rare genes than the empirical distributions. Hence, we show that assuming other types of population dynamics (such as exponentially growing populations) can change the model predictions in line with empirical data, providing a basis for investigating the role of population dynamics in shaping gene frequency distributions. Further, we extend the model to include a rigid and flexible core in the genome, and show that other assumptions about genome structure are consistent with gene frequency data. Finally, we show that a recently proposed gene diversity index - genomic fluidity  - is a natural parameter emerging in the neutral models of genome evolution described here. Whereas previously this parameter was entirely statistical in nature, we discuss here how genomic fluidity can be seen as a proxy for the relative importance of gene uptake in shaping the gene composition of genomes between species. In so doing, we discuss how other observations could be combined with gene frequency distributions to improve inferences regarding evolutionary mechanisms shaping genome composition.
Individual birth-death events cause genetic drift in the population as the number of organisms having a particular gene fluctuates over time. Genetic drift has the tendency to reduce the genetic diversity in the population. Indeed, when the last organism carrying a particular gene dies, this gene disappears from the population, and has no opportunity of re-entering the population. Gene birth-death events, on the other hand, maintain the genetic diversity in the population. New genes enter at low frequency due to "gene transfer" events. These transfer events may be due to uptake of genes from the environment, insertion of genes via viruses, or conjugation with other individuals. In our model, individual birth-death events and gene transfer events have associated rate parameters: we denote by ρ the rate of reproduction per individual, and by σ the rate of gene transfer per individual. Intuitively we expect that variability in gene composition of genomes should increase when gene transfer rate σ increases relative to the individual reproduction rate ρ, and vice-versa. Further, we expect that gene frequency distributions will tend toward U-shaped distributions because of the tension between gene transfer (which would favor increasing rarity of genes) and individual birth-death (which would favor increasing commonness of genes, due to neutral drift). We evaluate this prediction of U-shaped distributions in the following section.
As the number of genes M in a genome and the number of genomes G in the sample are given by the data, the gene frequencies g k are parametrized by the dimensionless parameter θ which combines the effects of both gene transfer and birth-death processes. Hence, different combinations of N, ρ and σ lead to identical gene frequency distributions. In particular, the predicted distribution is insensitive to accelerating simultaneously gene transfer and reproduction, because the distribution depends on the ratio , and not on ρ and σ individually. Moreover, an increase of the ratio , the relative rate of gene transfer, can be compensated by a smaller population size N, increasing the intensity of genetic drift. Note that although in practice the sample size G is much smaller than the population size N, Eq. (1) is also valid for G = N, i.e., it can be used to compute the (empirically inaccessible) gene frequency distribution of the entire population.
The neutral model of genome evolution can be fit to data using Eq. (1). To do so, we determine the parameter θ that minimizes the distance between the predicted and empirical gene frequency distribution (see Materials and Methods). We find that the neutral model is in reasonable correspondence with data (see Figure 3). First, both data and predictions have a U-shape. Next, model predictions agree well with observations for the total number of core genes. These predictions are made with a single free parameter, θ. On the other hand, the model underpredicts the number of rare genes and overpredicts the number of genes in the intermediate part of the distribution. In particular, the model predicts a g k ~ 1/k dependence for the peak at small k, whereas the data are closer to a steeper g k ~ 1/k 2 dependence. In the next section we show that this deviation can be partially remedied by dropping the assumption of a constant population size. Finally, it is important to note that although our model assumes constant M, we find that there is approximately a 10% difference in total gene content within the genomes of each of the six species.
Predictions for the observed pan and core genome size in a sample of genomes can also be obtained. In the past, the pan genome size has been defined as the number of genes in all genomes of the population. Similarly, the core genome size has been defined as the number of genes found in every genome in the population. However, we and colleagues have previously shown that estimating pan and core genome sizes are unreliable because they depend on observations of rare genes and genomes, respectively, which are difficult to find in samples precisely because they are rare . Here, we define the observed pan and core genome size as the number of unique genes found in all sample genomes and the number of genes common to all sample genomes, respectively. As we derive in Additional file 1: Appendix S1, the model predicts that the pan genome size increases logarithmically with sample size, and that the core genome size decreases as a power law (with exponent -θ). The prediction that gene diversity grows without bound is unsurprising, because we assumed an infinite gene pool (and we caution that gene diversity cannot increase to infinity in reality). Nevertheless, we expect the logarithmic (power-law) dependence of pan (core) genome size also to hold for a finite gene pool (as long as the gene pool is much larger than the set of genes observed in the sample). These findings are corroborated by the data, which exhibit the same qualitative behavior, see Additional file 1: Figure S2. Notice that we used the value for σ obtained from Figure 3 in the pan and core genome fits of Additional file 1: Figure S2, so that these fits have no additional free parameters.
Overview of model fits
The parameter θ 0 is the same as θ for the constant population size model, except that the population size N is replaced by the present population size N 0; again we call θ 0 the gene transfer parameter. The parameter β is a rescaled version of the population growth rate α; we call it the population growth parameter. The constant population size model, which we denote by model A, is a one-parameter model; the variable population size model, which we denote by model B, is a two-parameter model. Hence, we expect a richer set of gene frequency distributions predicted by model B compared to model A.
Additional file 1: Figure S3 shows gene frequency distributions computed for different combinations of the parameters θ 0 and β. For small β ≤ 1 the distributions closely resemble the distributions of the model A with constant population size (α = β = 0, see Figure 2). When increasing the population growth parameter β, the U-shape becomes more pronounced. For example, the peak at small k has a power-law dependence g k ~ k -γ with γ = 1 for small β and γ increasing for increasing β. The predicted distributions are often, apart from the peak for the core genes present in all genomes, almost symmetric (see panels with θ 0 = 0.03 or θ 0 = 0.3). Notice that very similar distributions can be obtained for different parameter combinations (e.g., compare panel θ 0 = 0.03, β = 10 and θ 0 = 0.3, β = 100), which will affect the parameter estimation (see below).
The previous models assume that the gene transfer process affects all genes identically. Indeed, each gene present in a genome has the same chance to be replaced by a gene transfer event, and this replacement has no effect on the reproduction rate. Here we show how to relax this assumption without prohibitively increasing the model complexity. To illustrate this, we study a new model, in which we distinguish two parts in the genomes: one part is governed by the same gene transfer process as model A; the other part does not undergo gene transfer and hence, constitutes a rigid core genome. We term this model C and note that a similar model has also been proposed in the context of the analysis of Prochlorococcus genomes . We assume that this rigid core has the same composition for all genomes. One interpretation of the rigid core is that genes in this core are essential, and deletion of any of a subset of genes in the core would be lethal to the individual. The average gene frequency distribution is given by Eq. (1), with an additional contribution for the rigid core, see Additional file 1: Appendix S5. This distribution depends on two parameters: the fraction λ 1 of the fluid part of the genomes, corresponding to λ 1 M genes per genome, and the gene transfer parameter θ 1 for this part. The rigid core then represents a fraction λ 2 = 1 - λ 1, or λ 2 M genes.
Parameter values of model fits
θ 1 ( λ 1 )
θ 2 ( λ 2 )
θ 1 ( λ 1 )
θ 2 ( λ 2 )
where U k and U ℓ are the number of genes found in either (but not both) genomes k and ℓ respectively in a pairwise comparison, and M k and M ℓ are the total number of genes found in genomes k and ℓ respectively. Estimates of genomic fluidity within a sample should agree with the true value of genomic fluidity within the population, in part, because they do not depend on the frequency of rare genomes or genes . Note that genomic fluidity summarizes gene frequency distributions, however multiple gene frequency distributions may be compatible with the same value of genomic fluidity. Hence, here we ask whether genomic fluidity is related to the model parameters, θ, θ 0, and β, that underlie the gene frequency distributions presented here.
Hence, genomic fluidity has a one-to-one relationship with θ, the relative rate of gene uptake to genomic replacement. When genomic fluidity approaches 1, then genomes are nearly completely dissimilar, which implies large gene replacements relative to genome reproductions (large θ). When genomic fluidity approaches 0, then processes that promote convergence of genomes are more important than gene-uptake processes (small θ). Previously, we advocated for the use of genomic fluidity on statistical grounds as a means to compare gene diversity between groups of genomes and as an alternative to the estimation of pan and core genome diversity. The constant population size model demonstrates that genomic fluidity may be indicative of processes driving the uptake of genes from the environment vs. genetic drift.
For the model with exponentially growing population size, the relationship between the genomic fluidity φ and the parameters θ 0 (for gene transfer) and β (for population growth) is more intricate. Genomic fluidity increases with the gene transfer parameter θ 0 and with the population growth parameter β, but there is no simple formula for φ(θ 0, β) analogous to Eq. (5). However, the genomic fluidity is useful to clarify the estimation of the parameters θ 0 and β, see Additional file 1: Appendix S4. Indeed, the different model fits return very similar estimates for the genomic fluidity (including the models with rigid and flexible cores), see Table 1. This illustrates the robustness of genomic fluidity, confirming our previous findings . However, this robustness comes with a trade-off: because very different parameter combinations θ 0 and β have the same genomic fluidity, we are unable to infer the gene transfer parameter θ 0 and the gene transfer rate σ from the genomic fluidity alone. This is a very typical finding in dynamic models in that predictions can be robust even when inferences of exact combinations of mechanistic parameters may not always be possible from model fits .
We have presented a neutral model of genome evolution that combines birth-death processes at the population level with gene transfer events at the genome level. We find that this model generically yields U-shaped gene frequency distributions. This result suggests that a gene's prevalence is insufficient to infer its essentiality to a species. We compared our model to empirical gene frequency distributions estimated from sequenced genomes of six bacterial species and found: (i) reasonable fits to data; (ii) improved fits when assuming non-constant population size or including an explicit core genome; (iii) despite the qualitative agreement, that there still remains unexplained aspects of empirical gene frequency distributions, e.g., skewness; (iv) that our neutral model is remarkably compatible with a previous proposal for a robust gene diversity index - genomic fluidity . We have also shown that our modelling framework can easily incorporate more complexity, which not surprisingly gives improved fits. In this sense our model is also formally related to models of population genetics in which assumptions of population sizes and dynamics are meant to evaluate if and when spatial population structure and even ecological dynamics may alter allele distributions in identifiable ways [20, 21]. In the present case the excellent fits obtained, e.g., for our flexible core model (model D), should not be interpreted as an indication for the validity of its assumptions. Rather, our analysis shows that gene frequency distributions do not contain sufficient information for the inference of evolutionary mechanisms underlying the observed distributions. Moreover, the finding that neutral models can generically lead to U-shaped gene frequency distributions suggests the need to incorporate and evaluate random processes in the analysis of gene composition and its dynamics.
Horizontal gene transfer is widely recognized as being an important mechanism driving genome evolution [22, 25, 29–31]. As such, there are many other models of evolution that address how neutral and selective processes give rise to variation in the state of genes and genomes (e.g., [16, 17, 20, 32–35]). Indeed, the central model of population genetics in which individuals die and are replaced at random by other individuals is utilized here [20, 21]. However, in the current model, genetic variation arises via the uptake of a novel gene. A recent paper also proposed a model of genome dynamics in which a rigid core was imposed  in order to fit gene frequency distributions estimated from 9 Prochlorococcus genomes. As shown here, such a fit may have limited inferential value, since a rigid core is not necessary in order to model gene frequency distributions. However, prior modeling suggests multiple avenues by which our model can be unified with dynamics at different scales. First, we have not considered horizontal gene transfer within genomes of the same species, nor recombination during division, nor of other types of transduction that may help to explain the finer genetic structure of bacterial populations [35–37]. Note that within-species gene transfer makes the acceptor genome more similar to the donor genome, and therefore reduces genetic diversity in the population just like genetic drift. We expect within-species gene transfer to have a smaller impact on genetic diversity than birth-death events, although its quantitative effect on the gene frequency distribution might be different (see  for an attempt to account for within-species gene transfer in a model similar to ours). Second, we do not include the fitness effect of mutations, whether neutral, beneficial or deleterious, which would impact the fixation of novel as well as pre-existing genes in genomes [16, 17]. Including non-neutral mutational effects would obviously be a departure from our effort here to describe how much of the information on gene variation in genomes can be described using purely neutral models or simple extensions thereof. Note that although the impact of horizontally transferred genes on genome fitness remains controversial, there is evidence that such genes have no, or mild, effects on genome fitness . Finally, a number of models have taken steps toward describing how the sizes of groups of genes, protein domains, proteins, and even categories of proteins (e.g., transcription factors) have changed over long evolutionary scales [32–34, 39, 40]. These models typically describe the structure within a genome (e.g., the abundance distribution of protein domains within different domain classes ), whereas our model describes population structure. It would seem that some unification of these models may be possible.
Development of models to predict and characterize gene composition variation among genomes is motivated by improvements in sequencing technologies which have enabled whole-genome sequencing of multiple isolates of the same bacterial species . However, the gene frequency distribution data upon which we base this model is subject to two caveats. First, we treated the sequenced genomes as if they were sampled uniformly from the population. However, the genomes exhibit phylogenetic structure which should be taken into account. One option would be to use the total divergence of the core genes to correct for the non-uniform sampling (e.g., ), although alternative normalizations are possible. Second, determining whether two genes are found in a pair of genomes depends on the use of cutoffs within some comparative alignment scheme. Different cutoffs can be utilized depending on whether one is interested in gene homologs, orthologs, gene families, gene super-families, and so on. If the cutoffs are set too stringently, then nearly every gene will appear to be unique. If the cutoffs are set too loosely, then every gene will appear to be the same as every other. Prior work demonstrated that there exist metrics of gene composition dissimilarity (e.g. genomic fluidity) that are robust to changes in such cutoffs . A unification of the current model with a sequence-based gene model would present opportunities to connect more factors (including mutation and recombination) driving gene variation with empirical patterns. However, we suggest that caution may be necessary in moving forward when attempting to utilize best fit parameters to infer mechanistic rates. In the present case, we showed that our neutral model reveals a well-known phenomenon of having robust predictions within a parameter space that poses an identifiability problem . In essence, there are combinations of evolution parameters that yield similar predictions for gene frequency distributions (see Additional file 1: Figure S5 and Additional file 1: Appendix S4). Hence, more information is required concerning actual population size structure and the nature of gene uptake  before we recommend utilizing our best fits to precisely estimate gene transfer rate, effective population size, growth rate and so on. More generally, fitted parameter values are subject to numerous simplifying assumptions of the models. Although it is interesting to compare the order of magnitude of the parameter fits with experimental data, one should be cautious to interpret the parameters too strictly as directly measurable quantities.
In this manuscript we presented a purely neutral explanation for the non-equal distributions of genes within genomes. The utilization of neutral models in genetics and ecology have yielded similar results in the past: in presenting quantitative arguments for when unequal patterns of appearance imply mechanisms of selection [12, 44, 45]. For example, a recent proposal for a unified theory of biodiversity and biogeography for forest trees  started with a similar dilemma. In that case, ecologists had observed skewed rank-abundance relationships such that some tree species were found at very high abundances and others at very low abundances. Ecologists had by and large assumed that those trees with greater abundance had a fitness advantage over trees with lower abundances. However, Hubbell's model showed that finding a few common trees and many rare trees could also be derived without invoking selection. Hence, in order to determine whether or not tree species had a fitness advantage in different regions one needed to look for correlations between traits and abundance which would not have been expected from a purely neutral model [46, 47]. In the case considered here, our neutral model shows that the U-shape of gene frequency distributions provide less information than previously thought about the fitness benefit of genes. Instead, we need to find patterns of genome composition variation that can be explained by neutral models and identify those patterns or deviations from patterns that cannot be explained by neutral models - similar proposals have been advocated in other contexts . Possible examples include the analysis of gene sequences and correlations amongst those gene present or absent amongst a set of genomes. In moving forward we suggest the need to continue to build the toolbox of a quantitative evolutionary genomics specifically adapted to the mechanisms operating within and amongst microbes.
The pipeline has been described in detail elsewhere [19, 23]. In brief, it (i) finds genes; (ii) calculates homology between all genes within a group of genomes using a set of cutoffs associated with identity and coverage (here set at 70% identity and coverage); (iii) applies a maximal clustering rule to determines groups of homologous genes; (iv) determines a gene presence-absence matrix of dimension M tot × G of the total number of genes M tot in the group of G genomes. We take row-sums of this matrix to find the frequency of each of the M tot genes, and then take the histogram of these row-sums to calculate the gene frequency distribution. See Additional file 2 for the empirical gene frequency distributions of each of the six species analyzed here.
i.e., the mean square difference of the square-root transformed frequency distributions. We use a square-root transform to balance the different contributions to Δ. Without this transform large values of g k (i.e., the tips of the U-shaped distribution at k = 1 and k = G) are weighed too heavily; with a logarithm transform small values of g k (i.e., the intermediate part of gene frequency distribution, 2 ≤ k ≤ G - 1) get proportionally too much weight. The fitted model parameters are reported in Table 2, and the corresponding distance Δ in Table 1. See Additional file 3 for Matlab scripts utilized to estimate the best fit parameters for each model.
DNA uptake rates are generally presented as transformation frequencies, i.e., the fraction of colony forming units (CFUs) which have taken up a marker sequence relative to the total number of CFUs. Let us denote ϵ as the transformation frequency in an uptake experiments in which growing cells are exposed to DNA for a time T in exponential growth phase. Consider the division rate of the cells to be r, irrespective of whether they have taken up the marker sequence or not. Hence, we can write the dynamics for the population of cells without, s(t), and with, m(t), the marker sequence: ds/dt = rs-σs and dm/dt = rm + σs, where σ is the gene uptake rate we would like to estimate. The solutions to these equations are s(t) = s 0 e (r-σ)t and m(t) = s 0 e rt (1 - e -σt ), where s 0 is the initial population of cells. Hence, at the end of the experiment, ϵ = m(T)/(s(T) + m(T)) or ϵ = 1 - e -σT . Hence, σ can be estimated from the measurement of ϵ by solving .
The authors thank K. Jordan, M. Cosentino Lagomarsino, T. Read, F. Stewart, S. Yi, and the anonymous reviewers for comments and suggestions on the manuscript. The authors thank A. Kislyuk for assistance with bioinformatics analysis. JSW acknowledges the support of Defense Advanced Projects Research Agency under grant HR0011-09-1-0055. JSW holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund.