A neutral theory of genome evolution and the frequency distribution of genes

Background The gene composition of bacteria of the same species can differ significantly between isolates. Variability in gene composition can be summarized in terms of gene frequency distributions, in which individual genes are ranked according to the frequency of genomes in which they appear. Empirical gene frequency distributions possess a U-shape, such that there are many rare genes, some genes of intermediate occurrence, and many common genes. It would seem that U-shaped gene frequency distributions can be used to infer the essentiality and/or importance of a gene to a species. Here, we ask: can U-shaped gene frequency distributions, instead, arise generically via neutral processes of genome evolution? Results We introduce a neutral model of genome evolution which combines birth-death processes at the organismal level with gene uptake and loss at the genomic level. This model predicts that gene frequency distributions possess a characteristic U-shape even in the absence of selective forces driving genome and population structure. We compare the model predictions to empirical gene frequency distributions from 6 multiply sequenced species of bacterial pathogens. We fit the model with constant population size to data, matching U-shape distributions albeit without matching all quantitative features of the distribution. We find stronger model fits in the case where we consider exponentially growing populations. We also show that two alternative models which contain a "rigid" and "flexible" core component of genomes provide strong fits to gene frequency distributions. Conclusions The analysis of neutral models of genome evolution suggests that U-shaped gene frequency distributions provide less information than previously suggested regarding gene essentiality. We discuss the need for additional theory and genomic level information to disentangle the roles of evolutionary mechanisms operating within and amongst individuals in driving the dynamics of gene distributions.


Background
The gene content of genomes of closely related bacteria can differ significantly. For example, pair-wise comparisons of genome sequences from isolates of the same species often do not share a substantial fraction of their gene content [1][2][3][4][5][6][7][8][9][10]. When a large number of genomes within a species or closely related group of bacteria are sequenced, the gene content variability can be summarized as a gene frequency distribution: given G sequenced genomes, some genes are found in a fraction 1 ≤ k ≤ G of all genomes.
Empirically, such gene frequency distributions possess a characteristic U-shape, such that there are many genes which only appear in one genome, fewer genes which appear in 2 ≤ k ≤ G -1 genomes, and many genes which appear in all genomes. Genes within each of these three categories have been labeled accessory, character and core genes, respectively [11]. It is tempting to conflate gene frequency with relative essentiality, but is it valid? For example, is it necessarily true that a gene that appears in all genomes in a sample should be classified as a "core" gene? Could such a gene have become common through neutral processes that diminish variability, and occasionally, lead to fixation of types? Likewise, should a gene which appears in only one genome in a sample be considered as "accessory" to the function of that organism? Could such a gene have become rare via a neutral path toward extinction, or have been recently introduced to a lineage without significant effect on individual fitness?
Here, we argue that a suitable null model is necessary with which to deduce how much weight be given to gene frequency data as a means to generate hypotheses regarding essentiality. For example, a recently proposed neutral theory of biogeography and biodiversity has proved useful in clarifying when and how species abundance distributions in ecology can provide information about selective processes in complex communities [12][13][14][15]. Hence, in this manuscript we ask: is it possible to recapitulate findings of U-shaped gene frequency distributions in the absence of selective forces driving genomic and population composition? We answer this question in the affirmative by proposing a simple and analytically tractable neutral model of genome evolution that explicitly accounts for gene composition of genomes. In this model, genomes undergo birth-death processes in a neutral sense and also acquire and lose genes which we term "gene transfer". The model differs from most previous efforts to analyze genome evolution [16,17] by self-consistently treating the dynamics at two scales: population level drift and genomic level change (for an exception, see [18] whose model we address in the Results and Discussion). Analysis of the current model leads to the following major results.
First, we find that gene frequency distributions derived from this model possess a characteristic U-shape for a robust range of model parameters. Hence, we propose that prevalence of a gene does not necessarily imply its essentiality, and that gene frequency distributions may be more limited than previously acknowledged in generating inferences regarding essentiality. Second, we estimate the best fit parameters for a given empirical gene-frequency distribution of sequenced genomes and in so doing find a reasonable correspondence between our neutral model and data from six distinct bacterial species with sequenced genomes from multiple isolates. However, our model assuming constant population sizes predicts gene frequency distributions with systematically fewer rare genes than the empirical distributions. Hence, we show that assuming other types of population dynamics (such as exponentially growing populations) can change the model predictions in line with empirical data, providing a basis for investigating the role of population dynamics in shaping gene frequency distributions. Further, we extend the model to include a rigid and flexible core in the genome, and show that other assumptions about genome structure are consistent with gene frequency data. Finally, we show that a recently proposed gene diversity index -genomic fluidity [19] -is a natural parameter emerging in the neutral models of genome evolution described here. Whereas previously this parameter was entirely statistical in nature, we discuss here how genomic fluidity can be seen as a proxy for the relative importance of gene uptake in shaping the gene composition of genomes between species. In so doing, we discuss how other observations could be combined with gene frequency distributions to improve inferences regarding evolutionary mechanisms shaping genome composition.

Results and discussion
A neutral model of genome evolution combines birthdeath events with gene transfer events We propose the following neutral model of genome evolution, see Figure 1. Consider a population consisting of N organisms in which each organism has a genome consisting of M unique genes. The dynamics consist of a sequence of reproduction and gene transfer events. In a reproduction event, one of the N organisms (chosen at random) dies, and is replaced by offspring of one of the other organisms (chosen at random). The offspring genome is identical to the parent genome. Note that there are still N organisms after the event and hence, this step is equivalent to a birth-death event of the Moran model [20,21]. In a gene transfer event, one of the N organisms acquires a gene from the environment. We assume that this gene is new, i.e., a gene that has not been present in the population before, and hence, this step is comparable to a mutation event in the infinitely many alleles model of population genetics [20,21]. In the model, gene transfer events do not affect birth and death rates of the individual (e.g. [22] present evidence for the neutrality or near-neutrality of transferred genes). We also assume that the acquisition of the new gene induces the loss of another gene in the genome, so that the organism's genome still consists of M genes after the event. We utilize a constant value of M to facilitate mathematical analysis and note that bioinformatic based estimates of total gene counts vary approximately 10% between genomes of the same species, as considered here.
Individual birth-death events cause genetic drift in the population as the number of organisms having a particular gene fluctuates over time. Genetic drift has the tendency to reduce the genetic diversity in the population. Indeed, when the last organism carrying a particular gene dies, this gene disappears from the population, and has no opportunity of re-entering the population. Gene birth-death events, on the other hand, maintain the genetic diversity in the population. New genes enter at low frequency due to "gene transfer" events. These transfer events may be due to uptake of genes from the environment, insertion of genes via viruses, or conjugation with other individuals. In our model, individual birth-death events and gene transfer events have associated rate parameters: we denote by r the rate of reproduction per individual, and by s the rate of gene transfer per individual. Intuitively we expect that variability in gene composition of genomes should increase when gene transfer rate s increases relative to the individual reproduction rate r, and vice-versa. Further, we expect that gene frequency distributions will tend toward U-shaped distributions because of the tension between gene transfer (which would favor increasing rarity of genes) and individual birth-death (which would favor increasing commonness of genes, due to neutral drift). We evaluate this prediction of U-shaped distributions in the following section.

Neutral model of genome evolution predicts U-shaped gene frequency distributions
The distribution of genes over genomes can be characterized in detail for the neutral model of genome evolution described above. For example, the gene frequency distribution can be computed explicitly, see Additional file 1: Appendix S1. To describe the solution, we consider a sample of G genomes taken from the population, and we denote the average number of genes appearing in k of the G genomes by g k . The gene frequencies predicted by the neutral model of genome evolution are where θ is an effective gene transfer rate. The distribution in Eq.(1) appears in solutions to allele distributions in the infinitely many alleles model of population genetics [20,21].
As the number of genes M in a genome and the number of genomes G in the sample are given by the data, the gene frequencies g k are parametrized by the dimensionless parameter θ which combines the effects of both gene transfer and birth-death processes. Hence, different combinations of N, r and s lead to identical gene frequency distributions. In particular, the predicted distribution is insensitive to accelerating simultaneously gene transfer and reproduction, because the distribution depends on the ratio σ ρ , and not on r and s individually.
Moreover, an increase of the ratio σ ρ , the relative rate of gene transfer, can be compensated by a smaller population size N, increasing the intensity of genetic drift. Note that although in practice the sample size G is much smaller than the population size N, Eq. (1) is also valid for G = N, i.e., it can be used to compute the (empirically inaccessible) gene frequency distribution of the entire population. We plot Eq. (1) for cases where θ = 0.03, θ = 0.3 and θ = 3 in Figure 2. As anticipated, the weight of the gene frequency distribution shifts from the common genes for small values of θ (left panel) to the rare genes for larger values of θ (right panel). The gene frequency distributions have a U-shape for a robust range of parameters. The U-shape is generic so long as θ <1; for θ >1 the distribution is monotonically decreasing. As shown in Additional file 1: Figure S1, the gene frequency distribution changes when sampling more genomes, but the characteristic U-shape remains. These observations for the neutral model of genome evolution show that Ushaped frequency distributions do not require invoking selection at the genome level. Further, the observations suggest that findings of prevalent genes need not be an indicator of essentiality in the absence of other information about gene function.

Comparing empirical gene frequency distributions of multiply sequenced bacterial species to model predictions
We collect and analyze empirical gene frequency distributions from 6 species of bacterial pathogens: B. anthracis, E. coli, Staph. aureus, Strep. pneumoniae, Strep. pyogenes and N. meningitidis. Gene frequency distributions were compiled by applying an automated genomic pipeline to remove the impact of curation bias and to normalize comparisons between species [23]. Hence, what we compile are frequency distributions of clusters of homologous genes (for details, see [19,23]), which we denote, for simplicity, as "genes" in this manuscript. We find that the empirical gene frequency distributions have a characteristic U-shape in that there are many genes which only appear in a single genome, many genes which appear in all genomes, and fewer genes that appear in an intermediate number of genomes (see Figure 3). This characteristic U-shape is robust to reasonable changes in the values of identity and coverage utilized for comparing genes in our genomic pipeline. Notice that the gene frequency distribution is on a log scale, hence the U-shape is in fact highly pronounced, in that there may be 50 times as many genes that appear in all genomes than appear in half the genomes.
The neutral model of genome evolution can be fit to data using Eq. (1). To do so, we determine the parameter θ that minimizes the distance between the predicted and empirical gene frequency distribution (see Materials and Methods). We find that the neutral model is in reasonable correspondence with data (see Figure 3). First, both data and predictions have a U-shape. Next, model predictions agree well with observations for the total number of core genes. These predictions are made with a single free parameter, θ. On the other hand, the model underpredicts the number of rare genes and overpredicts the number of genes in the intermediate part of the distribution. In particular, the model predicts a g k~1 /k dependence for the peak at small k, whereas the data are closer to a steeper g k~1 /k 2 dependence. In the next section we show that this deviation can be partially remedied by dropping the assumption of a constant population size. Finally, it is important to note that although our model assumes constant M, we find that there is approximately a 10% difference in total gene content within the genomes of each of the six species.
Predictions for the observed pan and core genome size in a sample of genomes can also be obtained. In the past, the pan genome size has been defined as the number of genes in all genomes of the population. Similarly, the core genome size has been defined as the number of genes found in every genome in the population. However, we and colleagues have previously shown that estimating pan and core genome sizes are unreliable because they  depend on observations of rare genes and genomes, respectively, which are difficult to find in samples precisely because they are rare [19]. Here, we define the observed pan and core genome size as the number of unique genes found in all sample genomes and the number of genes common to all sample genomes, respectively. As we derive in Additional file 1: Appendix S1, the model predicts that the pan genome size increases logarithmically with sample size, and that the core genome size decreases as a power law (with exponent -θ). The prediction that gene diversity grows without bound is unsurprising, because we assumed an infinite gene pool (and we caution that gene diversity cannot increase to infinity in reality). Nevertheless, we expect the logarithmic (power-law) dependence of pan (core) genome size also to hold for a finite gene pool (as long as the gene pool is much larger than the set of genes observed in the sample). These findings are corroborated by the data, which exhibit the same qualitative behavior, see Additional file 1: Figure S2. Notice that we used the value for   Figure 3 in the pan and core genome fits of Additional file 1: Figure S2, so that these fits have no additional free parameters.

Estimating gene transfer parameters from model fits
We can utilize parameters estimated from gene frequency distribution fits to also estimate underlying mechanistic parameters driving genome evolution, albeit with caveats that we discuss below. Note that the estimated values for θ are in the range 0.1 <θ < 0.5 (see Table 1). For example, with M ≈ 2000, then the product Ns/r should be on the order of 10 3 . Conventional estimates of effective population size are approximately 10 7 -10 9 [24,25], suggesting that gene uptake is on the order of s/r ≈ 10 -4 -10 -6 , approximately once per tens of thousands or million divisions. Assuming bacteria that divide once per hour, then s ≈ 10 -4 -10 -6 hr -1 . In this model, s represents gene transfer. Here, we consider one mechanism for such gene uptake -natural uptake of DNA from the environment -and evaluate whether or not empirical parameters associated with transformation are consistent with inferred estimates of s. Natural transformation rates vary widely depending on strain type, sequence homology, and physiological conditions. Empirically estimated DNA uptake rates are generally reported as transformation frequencies, , defined as the proportion of colony forming units (CFUs) that have taken up a segment of DNA of interest at the end of some experimental time period, T. We developed a simple model that estimates s directly from and T (see Materials and Methods). We find that application of this method yields estimates of gene uptake rate for pathogens in laboratory environments that bracket the value predicted from our model. For example, DNA damaged Helicobacter pylori cells exhibit ~10 -4 -10 -8 in a T = 2.5 hr experiment [26] Hence, s~10 -4 -10 -9 hr -1 . Likewise, naturally competent Neisseria gonorrhoeae cells exhibit transformation frequencies ~10 -3 of total cells after T = 4 hr, though values range from ~10 -2 -10 -7 [27]. These values yield uptake rates of s~10 -4 hr -1 with a range of s~10 -3 -10 -8 hr -1 . In both cases, these estimates are consistent with our estimate of gene transfer rates in the multiply sequenced pathogens considered here. However, there remains substantial disagreement as to whether the lower, effective number derived from certain population genetic models or the much larger, census number is a better estimate of effective population size [25]. Hence, without species-specific information, we caution that direct estimates of either gene transfer rate or effective population size should be treated with skepticism, even if estimates of their combined effect is more robust. Moreover, as we show in the next section, the value of θ is sensitive to assumptions about population history. Indeed, there is no reason to expect that the population structure and selective effects of genes are as simple as assumed here, providing additional caution to overly rigid interpretations of estimates of either N or s.

Population structure strongly impacts gene frequency distributions
The current neutral model of genome evolution assumes a fixed population size N. This is a common, but likely unrealistic, assumption as bacterial populations can undergo large and fast size fluctuations. In this model, the introduction of novel genes is decoupled from the history of population size or structure, so that we can select an arbitrary population size or structure and then superimpose the introduction of novel genes on top of the resulting history of individual births and deaths. To illustrate this point, we consider how an exponentially growing population affects the gene frequency distributions g k . Specifically we denote the population size history as Model A assumes a constant population size, and the same gene transfer process for all genes. Model B assumes an exponentially growing population size. Model C assumes that a part of the genome is shared by all genomes (a rigid core); the other part is subjected to the same gene transfer process as in model A.
Model D assumes two parts in the genomes, governed by different gene transfer rates. We determined for the four models the parameters that minimize the distance Δ between the empirical and the theoretical gene frequency distribution (see Materials and Methods for the definition of Δ). For each of the 6 bacterial species analyzed, we report the number of analyzed genomes G, the genome size M (average number of genes per genome), the distance Δ for the model fits, the genomic fluidity obs estimated on the data, and the fluidity pred for the model fits. Recall that model A has one parameter, models B and C have two parameters, and model D has three parameters.
with a the population growth rate, t 0 the present time and N 0 = N(t 0 ) the present population size. We use a coalescence approach [20,21] to compute the average gene frequencies g k , see Additional file 1: Appendix S3. The solution for the average gene frequency distribution g k depends on two dimensionless parameters θ 0 and b, The parameter θ 0 is the same as θ for the constant population size model, except that the population size N is replaced by the present population size N 0 ; again we call θ 0 the gene transfer parameter. The parameter b is a rescaled version of the population growth rate a; we call it the population growth parameter. The constant population size model, which we denote by model A, is a oneparameter model; the variable population size model, which we denote by model B, is a two-parameter model. Hence, we expect a richer set of gene frequency distributions predicted by model B compared to model A.
Additional file 1: Figure S3 shows gene frequency distributions computed for different combinations of the parameters θ 0 and b. For small b ≤ 1 the distributions closely resemble the distributions of the model A with constant population size (a = b = 0, see Figure 2). When increasing the population growth parameter b, the U-shape becomes more pronounced. For example, the peak at small k has a power-law dependence g k~k -g with g = 1 for small b and g increasing for increasing b. The predicted distributions are often, apart from the peak for the core genes present in all genomes, almost symmetric (see panels with θ 0 = 0.03 or θ 0 = 0.3). Notice that very similar distributions can be obtained for different parameter combinations (e.g., compare panel θ 0 = 0.03, b = 10 and θ 0 = 0.3, b = 100), which will affect the parameter estimation (see below).
We can fit empirical distributions to this neutral model of genome evolution (model B). To do so, we determine the parameters θ 0 and b that minimize the distance between the predicted and empirical gene frequency distribution. This computation yields estimates for the gene transfer parameter θ 0 and the population growth parameter b. As shown in Figure 4, the gene frequency distributions for model B fit the data better than those for model A (compare with Figure 3). The predictions of model B are uniformly accurate for the number of rare genes, the number of genes in the intermediate part of the distribution and the number of common genes. However, smaller but systematic deviations remain between observed and predicted gene frequency distributions. In particular, the empirical distributions (except for B. anthracis) are left-skewed, whereas the theoretical distributions have no skew or a small right skew. The improved fit is also apparent from the distance Δ between data and model, reported in Table 1, especially for B. anthracis. The estimate of the gene transfer parameter θ 0 is an order of magnitude larger in model B than the estimate of the gene transfer parameter θ in model A. However, the population size in model B is that of the present, whereas the population size in model A is the effective population size over the entire coalescent history. Hence, differences in gene transfer parameters are to be expected because they are driven in part by changes in assumptions about population size. This suggests that caution must be applied before utilizing θ, θ 0 , or other dimensionless gene transfer parameters, to estimate an effective gene transfer rate without additional information that constrains the estimates of both population size and growth rate. For similar reasons caution should be applied to the interpretation of the estimates of the growth parameter b.

Models with an explicit core genome improve fit of empirical gene frequency distributions
The previous models assume that the gene transfer process affects all genes identically. Indeed, each gene present in a genome has the same chance to be replaced by a gene transfer event, and this replacement has no effect on the reproduction rate. Here we show how to relax this assumption without prohibitively increasing the model complexity. To illustrate this, we study a new model, in which we distinguish two parts in the genomes: one part is governed by the same gene transfer process as model A; the other part does not undergo gene transfer and hence, constitutes a rigid core genome. We term this model C and note that a similar model has also been proposed in the context of the analysis of Prochlorococcus genomes [18]. We assume that this rigid core has the same composition for all genomes. One interpretation of the rigid core is that genes in this core are essential, and deletion of any of a subset of genes in the core would be lethal to the individual. The average gene frequency distribution is given by Eq.
(1), with an additional contribution for the rigid core, see Additional file 1: Appendix S5. This distribution depends on two parameters: the fraction l 1 of the fluid part of the genomes, corresponding to l 1 M genes per genome, and the gene transfer parameter θ 1 for this part. The rigid core then represents a fraction l 2 = 1l 1 , or l 2 M genes.
We can determine the parameters for which model C fits best the empirical gene frequency distributions. The model fits, shown in Figure 5 (yellow line), are better than those for model A (Figure 3), but are worse than those for model B (Figure 4), see also Table 1. Note that the fitting error Δ B and Δ C of models B and C can be compared directly, because both models have two independent parameters. Model C predicts that about half of the genome belongs to the rigid core (l 1 ≈ l 2 ≈ 0.5, see Table 2). The other part of the genome is rather fluid, with estimated gene transfer parameter θ 1 ≈ 1 (see Table 2). This combination of parameters results (except for B. anthracis and Strep. pyogenes) in a steep dip for the common (but not core) genes of the frequency distributions. However, such a dip is not present in the data ( Figure 5).
To weaken the assumption of a rigid core, we consider another model with an explicit core genome, which we call model D, in which genes in the core genome retain some fluidity. More precisely, as for model C, we divide the genomes into two parts. Both parts are governed by the gene transfer process of model A, but the genes in the first part (fraction l 1 , gene transfer parameter θ 1 ) are more fluid than the genes in the second part (fraction l 2 = 1 -l 1 , gene transfer parameter θ 2 <θ 1 ). Hence,  model D has three independent parameters. The average gene frequency distribution is equal to the sum of two distributions (1), with parameters θ 1 and θ 2 , see Additional file 1: Appendix S5. The predicted distributions show an excellent agreement with the empirical data ( Figure 5), as can also be seen from the fitting error Δ: for E. coli and N. meningitidis the error has dropped tenfold compared to the other models, see Table 1. It is also interesting to note that the model consistently (for the 6 bacterial species, although B. anthracis clearly stands out) predicts genomes with a small part of high fluidity (l 1 ≈ 0.1, θ 1 ≈ 10) and a large part of low fluidity (l 2 ≈ 0.9, θ 2 ≈ 0.1). Moreover, the model with a flexible core genome (model D) predicts the scaling of sample pan and core genome sizes in close agreement with the data, see Figure 6. Further, because no  Data comparison for models with rigid and flexible core genomes (models C and D). Comparison of gene frequency distributions with predictions of two models which assume that a part of a genome is more susceptible to gene transfer. The genomes in model C have a rigid core, i.e., some genes cannot be removed from the genomes. The genomes in model D have a flexible core, i.e., theses core genes can be moved around between genomes, but to a lesser extent than the other genes. Model C has two parameters, whereas model D has three parameters. Black circles: data; yellow line with squares: model C; green line with squares: model D.
additional free parameters were utilized in making this fit, such scaling represents an additional prediction of each of the models.

Genomic fluidity as a mechanistic summary statistic for gene frequency distributions
In a previous work [19] we advocated for the use of robust diversity indices to describe gene variation between genomes. In doing so we proposed the use of "genomic fluidity" which captures the average dissimilarity of pairs of genomes from within a group based on gene content. Specifically, genomic fluidity is equal to the probability that a randomly chosen gene from one genome is not found in another genome within the same group of organisms. For a sample of G genomes it can be estimated using the following formula: where U k and U ℓ are the number of genes found in either (but not both) genomes k and ℓ respectively in a pairwise comparison, and M k and M ℓ are the total number of genes found in genomes k and ℓ respectively. Estimates of genomic fluidity within a sample should agree with the true value of genomic fluidity within the population, in part, because they do not depend on the frequency of rare genomes or genes [19]. Note that genomic fluidity summarizes gene frequency distributions, however multiple gene frequency distributions may be compatible with the same value of genomic fluidity. Hence, here we ask whether genomic fluidity is related to the model parameters, θ, θ 0 , and b, that underlie the gene frequency distributions presented here.
For the model with constant population size, the genomic fluidity and the gene transfer parameter θ are intimately linked. For a population in steady state, we have, see Additional file 1: Appendix S1, Hence, genomic fluidity has a one-to-one relationship with θ, the relative rate of gene uptake to genomic replacement. When genomic fluidity approaches 1, then genomes are nearly completely dissimilar, which implies large gene replacements relative to genome reproductions (large θ). When genomic fluidity approaches 0, then processes that promote convergence of genomes are more important than gene-uptake processes (small θ). Previously, we advocated for the use of genomic fluidity on statistical grounds as a means to compare gene diversity between groups of genomes and as an alternative to the estimation of pan and core genome diversity. The constant population size model demonstrates that genomic fluidity may be indicative of processes driving the uptake of genes from the environment vs. genetic drift.
For the model with exponentially growing population size, the relationship between the genomic fluidity and the parameters θ 0 (for gene transfer) and b (for population growth) is more intricate. Genomic fluidity increases with the gene transfer parameter θ 0 and with the population growth parameter b, but there is no simple formula for (θ 0 , b) analogous to Eq. (5). However, the genomic fluidity is useful to clarify the estimation of the parameters θ 0 and b, see Additional file 1: Appendix S4. Indeed, the different model fits return very similar estimates for the genomic fluidity (including the models with rigid and flexible cores), see Table 1. This illustrates the robustness of genomic fluidity, confirming our previous findings [19]. However, this robustness comes with a trade-off: because very different parameter combinations θ 0 and b have the same genomic fluidity, we are unable to infer the gene transfer parameter θ 0 and the gene transfer rate s from the genomic fluidity alone. This is a very typical finding in dynamic models in that predictions can be robust even when inferences of exact We determined for each of the four models A, B, C and D the parameters that minimize the distance Δ, see Eq. (6), between the empirical and the theoretical gene frequency distribution. Here we report the gene transfer parameter θ of the model A fit, the gene transfer parameter θ 0 and the population growth parameter b of the model B fit, the genome fractions l 1 and l 2 , and the gene transfer parameter θ 1 of the model C fit, and the genome fractions l 1 and l 2 , and the gene transfer parameters θ 1 and θ 2 of the model D fit.
combinations of mechanistic parameters may not always be possible from model fits [28].

Conclusions
We have presented a neutral model of genome evolution that combines birth-death processes at the population level with gene transfer events at the genome level. We find that this model generically yields U-shaped gene frequency distributions. This result suggests that a gene's prevalence is insufficient to infer its essentiality to a species. We compared our model to empirical gene frequency distributions estimated from sequenced genomes of six bacterial species and found: (i) reasonable fits to data; (ii) improved fits when assuming non-constant population size or including an explicit core genome; (iii) despite the qualitative agreement, that there still remains unexplained aspects of empirical gene frequency distributions, e.g., skewness; (iv) that our neutral model is remarkably compatible with a previous proposal for a robust gene diversity index -genomic fluidity [19]. We have also shown that our modelling framework can easily incorporate more complexity, which not Predictions for observed core and pan genome size for model D. We used the parameters λ 1 , θ 1 and θ 2 obtained from fitting the gene frequency distribution (see Figure 5) to evaluate the predicted core and pan genome size (see Additional file 1: Appendix S6). Black circles: data; green line: mean prediction; green shaded region: standard deviation of prediction. The increasing curves are for the pan genome; the decreasing curves are for the core genome.
surprisingly gives improved fits. In this sense our model is also formally related to models of population genetics in which assumptions of population sizes and dynamics are meant to evaluate if and when spatial population structure and even ecological dynamics may alter allele distributions in identifiable ways [20,21]. In the present case the excellent fits obtained, e.g., for our flexible core model (model D), should not be interpreted as an indication for the validity of its assumptions. Rather, our analysis shows that gene frequency distributions do not contain sufficient information for the inference of evolutionary mechanisms underlying the observed distributions. Moreover, the finding that neutral models can generically lead to U-shaped gene frequency distributions suggests the need to incorporate and evaluate random processes in the analysis of gene composition and its dynamics. Horizontal gene transfer is widely recognized as being an important mechanism driving genome evolution [22,25,[29][30][31]. As such, there are many other models of evolution that address how neutral and selective processes give rise to variation in the state of genes and genomes (e.g., [16,17,20,[32][33][34][35]). Indeed, the central model of population genetics in which individuals die and are replaced at random by other individuals is utilized here [20,21]. However, in the current model, genetic variation arises via the uptake of a novel gene. A recent paper also proposed a model of genome dynamics in which a rigid core was imposed [18] in order to fit gene frequency distributions estimated from 9 Prochlorococcus genomes. As shown here, such a fit may have limited inferential value, since a rigid core is not necessary in order to model gene frequency distributions. However, prior modeling suggests multiple avenues by which our model can be unified with dynamics at different scales. First, we have not considered horizontal gene transfer within genomes of the same species, nor recombination during division, nor of other types of transduction that may help to explain the finer genetic structure of bacterial populations [35][36][37]. Note that within-species gene transfer makes the acceptor genome more similar to the donor genome, and therefore reduces genetic diversity in the population just like genetic drift. We expect within-species gene transfer to have a smaller impact on genetic diversity than birth-death events, although its quantitative effect on the gene frequency distribution might be different (see [38] for an attempt to account for within-species gene transfer in a model similar to ours). Second, we do not include the fitness effect of mutations, whether neutral, beneficial or deleterious, which would impact the fixation of novel as well as preexisting genes in genomes [16,17]. Including non-neutral mutational effects would obviously be a departure from our effort here to describe how much of the information on gene variation in genomes can be described using purely neutral models or simple extensions thereof. Note that although the impact of horizontally transferred genes on genome fitness remains controversial, there is evidence that such genes have no, or mild, effects on genome fitness [22]. Finally, a number of models have taken steps toward describing how the sizes of groups of genes, protein domains, proteins, and even categories of proteins (e.g., transcription factors) have changed over long evolutionary scales [32][33][34]39,40]. These models typically describe the structure within a genome (e.g., the abundance distribution of protein domains within different domain classes [34]), whereas our model describes population structure. It would seem that some unification of these models may be possible.
Development of models to predict and characterize gene composition variation among genomes is motivated by improvements in sequencing technologies which have enabled whole-genome sequencing of multiple isolates of the same bacterial species [41]. However, the gene frequency distribution data upon which we base this model is subject to two caveats. First, we treated the sequenced genomes as if they were sampled uniformly from the population. However, the genomes exhibit phylogenetic structure which should be taken into account. One option would be to use the total divergence of the core genes to correct for the non-uniform sampling (e.g., [42]), although alternative normalizations are possible. Second, determining whether two genes are found in a pair of genomes depends on the use of cutoffs within some comparative alignment scheme. Different cutoffs can be utilized depending on whether one is interested in gene homologs, orthologs, gene families, gene super-families, and so on. If the cutoffs are set too stringently, then nearly every gene will appear to be unique. If the cutoffs are set too loosely, then every gene will appear to be the same as every other. Prior work demonstrated that there exist metrics of gene composition dissimilarity (e.g. genomic fluidity) that are robust to changes in such cutoffs [19]. A unification of the current model with a sequence-based gene model would present opportunities to connect more factors (including mutation and recombination) driving gene variation with empirical patterns. However, we suggest that caution may be necessary in moving forward when attempting to utilize best fit parameters to infer mechanistic rates. In the present case, we showed that our neutral model reveals a well-known phenomenon of having robust predictions within a parameter space that poses an identifiability problem [28]. In essence, there are combinations of evolution parameters that yield similar predictions for gene frequency distributions (see Additional file 1: Figure S5 and Additional file 1: Appendix S4). Hence, more information is required concerning actual population size structure and the nature of gene uptake [43] before we recommend utilizing our best fits to precisely estimate gene transfer rate, effective population size, growth rate and so on. More generally, fitted parameter values are subject to numerous simplifying assumptions of the models. Although it is interesting to compare the order of magnitude of the parameter fits with experimental data, one should be cautious to interpret the parameters too strictly as directly measurable quantities.
In this manuscript we presented a purely neutral explanation for the non-equal distributions of genes within genomes. The utilization of neutral models in genetics and ecology have yielded similar results in the past: in presenting quantitative arguments for when unequal patterns of appearance imply mechanisms of selection [12,44,45]. For example, a recent proposal for a unified theory of biodiversity and biogeography for forest trees [12] started with a similar dilemma. In that case, ecologists had observed skewed rank-abundance relationships such that some tree species were found at very high abundances and others at very low abundances. Ecologists had by and large assumed that those trees with greater abundance had a fitness advantage over trees with lower abundances. However, Hubbell's model showed that finding a few common trees and many rare trees could also be derived without invoking selection. Hence, in order to determine whether or not tree species had a fitness advantage in different regions one needed to look for correlations between traits and abundance which would not have been expected from a purely neutral model [46,47]. In the case considered here, our neutral model shows that the U-shape of gene frequency distributions provide less information than previously thought about the fitness benefit of genes. Instead, we need to find patterns of genome composition variation that can be explained by neutral models and identify those patterns or deviations from patterns that cannot be explained by neutral models -similar proposals have been advocated in other contexts [48]. Possible examples include the analysis of gene sequences and correlations amongst those gene present or absent amongst a set of genomes. In moving forward we suggest the need to continue to build the toolbox of a quantitative evolutionary genomics specifically adapted to the mechanisms operating within and amongst microbes.