A model for biased fractionation after whole genome duplication
 David Sankoff^{1}Email author,
 Chunfang Zheng^{1} and
 Baoyong Wang^{1}
https://doi.org/10.1186/1471216413S1S8
© Sankoff et al.; licensee BioMed Central Ltd. 2012
Published: 17 January 2012
Abstract
Background
Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD) is a pervasive process. Whether this loss proceeds gene by gene or through deletion of multigene DNA segments is controversial, as is the question of fractionation bias, namely whether one homeologous chromosome is more vulnerable to gene deletion than the other.
Results
As a null hypothesis, we first assume deletion events, on either homeolog, excise a geometrically distributed number of genes with unknown mean μ, and a number r of these events overlap to produce deleted runs of length l. There is a fractionation bias 0 ≤ ϕ ≤ 1 for deletions to fall on one homeolog rather than the other. The parameter r is a random variable with distribution π(·). We simulate the distribution of run lengths l, as well as the underlying π(·), as a function of μ, ϕ and θ, the proportion of remaining genes in duplicate form. We show how sampling l allows us to estimate μ and ϕ. The main part of this work is the derivation of a deterministic recurrence to calculate each π(r) as a function of μ, ϕ and θ.
Conclusions
The recurrence for π provides a deeper mathematical understanding of fractionation process than simulations. The parameters μ and ϕ can be estimated based on run lengths of singlecopy regions.
Keywords
Background
Whole genome doubling (WGD) creates two identical copies (homeologs) of each chromosome in a genome, with identical gene content and gene order. From this ensues the wholesale shedding of duplicate genes over evolutionary time through random excision  elimination of excess DNA  namely the deletion of chromosomal segments containing one or more genes, or through geneby gene events such as epigenetic silencing and pseudogenization [1–6].
When a duplicate gene is lost, it may be lost from one copy (homeolog) of a chromosome or the other, but generally not both, because of the necessity of conserving function. This fractionation creates an interleaving pattern; the full original gene complement becomes apparent only by consolidating [5] the two homeologous singlecopy regions. In most cases, there is a degree of bias, more genes being lost from one of the homeologous regions than the other [4–7]. Fractionation is an important process in many evolutionary domains, in particular the flowering plants, since it results in a genome that is highly scrambled with respect to its preWGD ancestor. For this reason as well, fractionation raises a number of interesting and difficult problems for comparative genomics.
The study of fractionation is basically a study of runs, that is runs of duplicate genes on two homeologous chromosomes alternating with runs of singlecopy genes on one or both of these chromsomes. Because of the way these runs are generated biologically, and because they involve two chromosomes evolving in a nonindependent way, standard statistical or combinatorial run analyses are not directly applicable.
In this paper, we present a detailed version of the excision model of fractionation with geometrically distributed deletion lengths, for which we previously analyzed a tractable, but biologically unrealistic, special case [8]. The key problem in this field is to determine μ, the mean of the hypothesized geometric distribution $\rho \left(\frac{1}{\mu},.\right)$, since this bears directly on the main biological question of the relative importance of random excision versus genebygene inactivation. The relevant data consist of runs of singlecopy genes (whose duplicates have been lost from the homeologous region) as well as runs of remaining duplicate pairs in two homeologous regions. The inference of μ is complicated since each run of l single copies may have been produced by an unknown number r of deletion events, either r = l events (the genebygene model) or 1 ≤ r <l  1 (the random excision model), and these r samples of the distribution ρ turn out not to be independent. Thus a fundamental aspect of finding μ, and hence $\rho \left(\frac{1}{\mu},.\right)$, is to derive π(r), the proportion of runs of singlecopy genes with r terms, for r = 1, 2, ⋯.
A further complication arises from the way deletion events accumulate into longer runs of singlecopy genes. The deletion of a certain number of duplicate genes may overlap the site of a previous deletion event on the same chromosome, but it is blocked by the functional constraint (mentioned above) as soon as it starts to overlap the site of a previous deletion event on the homeologous chromosome.
Another biologically important question is to determine ϕ, the proportion of deletion events that operate on one of the homeologous chromosomes, while a proportion 1  ϕ operates on the other. We explored this question at some length in [4], but a detailed mathematical treatment of the effects of this "fractionation bias" remains to be done.
It is not difficult to simulate the fractionation process, but this gives little insight into its mathematical structure. Given that it is unlikely for any closed form of π to exist, nor for any simple computing formula, our goal here is to develop a recurrence for the distribution of π(r) for r = 1, 2, ⋯ as a function of μ, ϕ and θ (the proportion of duplicate pairs remaining in the genome versus singlecopy genes).
This work is an attempt at creating a rigorous "null" model of duplicate loss, based on parameters μ, ϕ and θ. This should provide a principled basis for developing statistical tests on real WGD descendants, to see if the geometric excision hypothesis is acceptable and to see if fractionation is unbiased or not. We will not explicitly investigate the alternative hypothesis of genebygene deletion, nor do we take chromosomal rearrangement events into account; our task here is simply to set up the null statistical model with a view to enabling useful statistical tests of hypothesis for this problem.
The models
The structure of the data
The data on paralog reduction are of the form (G, H), where G and H are binary sequences indexed by ℤ, satisfying the condition that g(i) + h(i) > 0. This condition models the prohibition against deleting both copies of a duplicated gene. We may also assume that whatever process generated the 0s and 1s is homogeneous on ℤ.
The sequence G + H consists of alternating runs of 1s and 2s. We denote by p(l), l ≥ 1 the probability distribution of length of runs of 1s. For any finite interval of ℤ we denote by f(l), l ≥ 1 the empirical frequency distribution of length of runs of 1s.
The use of ℤ instead of a finite interval is consistent with our goal of getting to the mathematical essence of the process, without any complicating parameters such as interval length. In practice, we use long intervals of at least 100,000 so that any edge effects will be negligible. See [4, 8] for ad hoc ways of handling biological scale intervals.
The deletion events
Let ϕ, where 0 ≤ ϕ ≤ 1, be the fractionation bias. We assume a continuous time process, parameter λ(t) > 0, only to ensure no two events occur at the same time.

We start (t = 0) with h(i) = g(i) = 1 for all i.

At any t > 0, consider any i where h(i) = g(i) = 1. With probability λ(t)dt, a deletion event occurs anchored at position i: we choose a positive number a according to a geometric variable y with parameter 1/μ, i.e., $P\left[y=a\right]=\gamma \left(a\right)=\frac{1}{\mu}{\left(1\frac{1}{\mu}\right)}^{a1},\phantom{\rule{1em}{0ex}}a\ge 1$.

Then with probability ϕ we choose to carry out the deletion on G; with probability 1  ϕ, on H.

If the deletion is on G we convert g(i) = 0, g(i + 1) = 0, ⋯, g(i + a  1) = 0 unless a "collision" occurs.

One type of collision, skippable collision, arises when one or more of g(i + 1), ⋯, g(i + a  1) is already 0. In this case we skip over the existing 0 values and continue to convert the next available 1s into 0s, until a total of a 1s have been converted, or a collision of the second type is encountered.

The second type of collision, blocking collision, arises when one or more of h(i + 1), ⋯, h(i + a  1) (or a further term if skipping has already occurred during this event) is already 0. In this case, further conversions of 1s to 0s are blocked, starting with the first g(x) for which h(x) = 0.
Skippable collisions are a natural way to model the excision process, since deletion of duplicates and the subsequent rejoining of the DNA directly before and directly after the excised fragment means that this fragment is no longer "visible" to the deletion process. Observationally, however, we know deletion has occurred because we have access to the sequence H, which retains copies of the deleted terms. Blocking collisions are a natural way of modeling the constraint against deleting singlecopy genes.
When the deletion event has to skip over previous 0s, this hides the anchor i and length a of previous deletion events. Denote by r the random variable indicating the total number of deletion events responsible for a run. Then, given r = r, the run length z is distributed as the sum of r geometric variables, which would result in a negative binomial distribution were these geometric variables independent. They are not, however, since events with large a tend to group together in runs with large r, while an event with small a is more likely to constitute by itself a run with r = 1 [8].
Deletions with skipping and blocking
Event  i  a  7  6  5  4  3  2  1  0  1  2  3  4  5  6  7  8  r 

Start  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  
1  1  3  1  1  1  1  1  1  0  0  0  1  1  1  1  1  1  1  1 
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  
2  1  1  1  1  1  1  0  0  0  1  1  1  1  1  1  1  1  
4  1  1  1  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  
3  5  1  1  1  1  1  1  1  0  0  0  1  1  1  0  1  1  1  1,1 
1  1  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  
4  4  3  1  1  1  1  1  1  0  0  0  1  1  0  0  0  0  1  1,2 
1  1  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  
5  1  1  1  1  1  1  0  0  0  1  1  0  0  0  0  1  2  
5  4  1  1  0  0  0  0  1  1  1  1  1  1  1  1  1  1  3 
Results
Simulations to determine π
We carried out simulations on an interval of ℤ of length 100,000. This enabled us to use a discrete time process instead of the continuous time process on ℤ. The "anchors" for the deletion events were chosen at random among the currently undeleted genes. The remaining steps were carried out as described in the previous section and Table 1. Because each simulation run samples thousands of deletions, it sufficed to do 100 runs for each value of the parameters μ and ϕ studied.
We mention that any edge effects in our simulation are negligible. Whether we work with G and H on an interval of ℤ of length 100,000 or, as previously [8], length 300,000, gives virtually the same results.
A recurrence for π(r)
We are interested in inferring μ from the observed distribution of run lengths and the proportion θ of undeleted terms, i.e., undeleted genes. At the outset θ = 1. As t → ∞, θ → 0. We are not, however, interested in t, since it is not observable and any timebased inference we can make about μ will depend only on run lengths and θ in any case. On the other hand, r, the number of deletion events per run is an interesting variable since we can assume run length is close to rμ on average, at least for small values of θ, and we can model the evolution of r directly We consider the distribution π as a function of μ, ϕ and θ.
 1.
new runs (r = 1) falling completely within an existing run of undeleted terms, not touching the preceding or following run of deleted terms, type A in Figure 3,
 2.
runs that touch, overlap or entirely engulf exactly one previous run of deleted terms with r ≥ 1, thus lengthening that run to r + 1 events, types B and C in Figure 3,
 3.
runs that touch, overlap or engulf, by the skipping process, two previous runs of r_{1} and r_{2} events respectively, creating a new run of r_{1} + r_{2} + 1 events, and diminishing the total number of runs by 1, including types D and E in Figure 3,
 4.
runs that touch, overlap or engulf, by the skipping process, k > 2 previous runs of of r_{1}, ⋯, r_{ k }events respectively, creating a new run of r_{1} + ⋯ + r_{ k }+ 1 events, and diminishing the total number of runs by k  1, not illustrated in Figure 3. Case 3 above may be considered a special case of this for k = 2 and Case 2 for k = 1.
The distribution ρ(l) of lengths of the undeleted runs is assumed to be geometric. Similarly the lengths of successive undeleted runs (indeed all undeleted runs) are assumed to be independent. While we do not have a rigorous proof of these assumptions, they have been confirmed by extensive simulations.
Let ϕ_{1} and ϕ_{2} be the proportion of deletion events affecting homeologous chromosomes 1 and 2, respectively, so that ϕ_{1} + ϕ_{2} = 1. Let τ(r) be the proportion of runs of singlecopy genes with terms in both chromosomes. (τ(1) ≡ 0 and, initially, τ(r) = 0 for r = 2, 3, ⋯.) Note that in such a run, the term(s) at the extreme left were (was) deleted from chromosome i with probability ϕ_{ i }and the same for the terms at the extreme right.
Events of type A_{ i }create runs of deleted terms with r = 1 from one chromosome only. Note that the last line of equation (2), and equation (3), involve the collection of terms, reducing the number of nested summations in order to speed up calculation. While these are not lengthy calculations to start with, we display the speedup as a simple illustration of the important efficiencies implemented for more difficult cases to be treated below.
Events of type B_{ ii }turn a deleted run with r events from one chromosome, into a run with r + 1 events. Events of type B_{ if }, with i ≠ f, turn a deleted run with r events, into a run with r + 1 events.
which can be calculated using an expansion such as that in (6). Events of type C_{ ii }turn a deleted run with r events from one chromosome, into a run with r + 1 events.
Events of type C_{ if }, with i ≠ f, turn a deleted run with r events, into a run with r + 1 events. Note that (9) does not contains terms of form aγ(a) as do (3,5,7), since in this event deletion is blocked beyond the existing run of deletions; the probability weight is thus concentrated on deletions of lesser length.
in which the reduction of the number of nested summations is key to the computability of the entire calculation.
which can be calculated using an expansion such as that in (10). Events of type D_{ iii }turn two deleted runs with r and s events, respectively, both from the same chromosome, into a run with r + s + 1 events.
Events of type D_{ iif }, with i ≠ f, turn two deleted runs with r and s events, respectively, with the latter containing terms from both chromosomes, into a single run with r + s + 1 events.
Events of type E_{ iii }turn two deleted runs with r and s events, respectively, all from one chromosome, into a single run with r + s + 1 events. Events of type E_{ iif }, E_{ ifi }and E_{ iff },, with i ≠ f, turn two deleted runs with r and s events, respectively, into a single run with r + s + 1 events.
We reiterate here that the last lines of each of (2),(6) and (10) include the collection of terms, significantly cutting down on computing time when these formulae are implemented, especially in the case of (10).
In this initial model, we neglect the merger of three or more runs of deletions. There is no conceptual difficulty in including three or more mergers, but the proliferation of embedded summations would require excessive computation. Thus we should expect the model to be adequate until θ gets very small, when mergers of several runs at a time become common.
The new proportion θ' of undeleted terms is ${\stackrel{\u0304}{v}}^{\prime}\u2215\left({\u016b}^{\prime}+{\stackrel{\u0304}{v}}^{\prime}\right)$.
We implement equations (1) to (33) as a recurrence with a step size parameter Λ to control the number of events using the same p_{ A }, p_{ B }, p_{ C }, p_{ D }, p_{ E }, δ_{ π }(·) and δ_{ τ }(·) between successive normalizations, and using Λδ_{ π }(·) and Λδ_{ τ }(·) instead of δ_{ π }(·) and δ_{ τ }(·) in (25)(33). The choice of Λ determines the tradeoff between computing speed and accuracy.
Biased fractionation with large deletion sizes leads to slow initial growth in the proportions of events of types D and E and "other".
There are at least two reasons for the discrepancies between the simulations and the recurrences observed in Figure 4. At the outset, since we used a large step size Λ for the computationally costly recurrence, its trajectory lags behind the simulation, especially with respect to the slower decrease in p_{ A }and slower increase in p_{ B }+ p_{ C }. Later discrepancies are partially due to not accounting for the merger of three or more runs. These can be estimated and are summarized as "other " in the diagram, but the quantities involved are not fed back to the recurrence through (26).
Other possible sources of error might be due to the cutoffs in x used for calculations involving γ(x) and ρ(x). However, extensive testing of various cutoff values has indicated such errors to be negligible in our implementation.
Conclusions
We have developed a model for the fractionation process based on deletion events excising a geometricallydistributed number of contiguous paralogs from either one of a pair of homeologous chromosomes. The existence of data prompting this model is due to a functional biological constraint against deleting both copies of a duplicate pair of genes.
The mathematical framework we propose should eventually serve for testing the geometric excision hypothesis against alternatives such as single genebygene inactivations, although we have not developed this in this paper. In addition, further developments could treat the genebygene inactivation model as the null hypothesis, and the geometric excision model, with mean greater than 1, as the alternative hypothesis.
Simulations of these models indicate the feasibility of estimating the mean μ of the deletion event process and the fractionation bias ϕ from observations of the length of runs of singlecopy genes and the overall proportion of singlecopy genes.
The main question we have explored is the exact derivation of π, the distribution of the number of deletion events contributing to a run of singlecopy genes. The simulations are convenient in practice, since they depend on only the parameters μ and ϕ as they evolve over time, but they give little mathematical insight. Our most important advance is a deterministic recurrence for the π(r) as the proportion θ of undeleted genes decreases. This takes into account the appearance of new runs over time, the lengthening of existing runs, as well as the merger of two existing runs with the new deletions to form a single, longer one. This calculation fits the process as simulated rather well and seems promising for further development.
In order to validate our fractionation model empirically, we will have to expand it to incorporate the rearrangement events that are pervasive in genome evolution. Our previous work on this problem shows that the effect of rearrangement is to seriously bias the observable, credible instances of fractionation towards smaller runs of deleted genes [4, 8]. Future work on this difficult problem will have either to rely on careful modeling of this ascertainment bias or else find a way to incorporate into the model deleted runs that have been interrupted by rearrangements.
Declarations
Acknowledgements
Research funded in part by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada.
This article has been published as part of BMC Genomics Volume 13 Supplement 1, 2012: Selected articles from the Tenth Asia Pacific Bioinformatics Conference (APBC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/13?issue=S1.
Authors’ Affiliations
References
 Byrne KP, Wolfe KH: The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 2005, 15: 14561461. 10.1101/gr.3672305.PubMed CentralView ArticlePubMedGoogle Scholar
 van Hoek MJ, Hogeweg P: The role of mutational dynamics in genome shrinkage. Mol Biol Evol. 2007, 24: 24852494. 10.1093/molbev/msm183.View ArticlePubMedGoogle Scholar
 Byrnes JK, Morris GP, Li WH: Reorganization of adjacent gene relationships in yeast genomes by wholegenome duplication and gene deletion. Mol Biol Evol. 2006, 23 (6): 11361143. 10.1093/molbev/msj121.View ArticlePubMedGoogle Scholar
 Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics. 2010, 11: 31310.1186/1471216411313.PubMed CentralView ArticlePubMedGoogle Scholar
 Langham RJ, Walsh J, Dunn M, Ko C, Goff SA, Freeling M: Genomic duplication, fractionation and the origin of regulatory novelty. Genetics. 2004, 166: 935945. 10.1534/genetics.166.2.935.PubMed CentralView ArticlePubMedGoogle Scholar
 Thomas BC, Pedersen B, Freeling M: Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dosesensitive genes. Genome Res. 2006, 16: 934946. 10.1101/gr.4708406.PubMed CentralView ArticlePubMedGoogle Scholar
 Edger PP, Pires JC: Gene and genome duplications: the impact of dosagesensitivity on the fate of nuclear genes. Chromosome Res. 2009, 17: 699717. 10.1007/s1057700990559.View ArticlePubMedGoogle Scholar
 Wang B, Zheng C, Sankoff D: Fractionation statistics. BMC Bioinformatics. 2011, 12 (Suppl 9): S510.1186/1471210512S9S5.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.