Simulations to determine π
We carried out simulations on an interval of ℤ of length 100,000. This enabled us to use a discrete time process instead of the continuous time process on ℤ. The "anchors" for the deletion events were chosen at random among the currently undeleted genes. The remaining steps were carried out as described in the previous section and Table 1. Because each simulation run samples thousands of deletions, it sufficed to do 100 runs for each value of the parameters μ and ϕ studied.
The top row of Figure 1 compares π(r) when θ = 0.5 and θ = 1, for μ = 2, 3, 6, and 11, when ϕ = 0.5. We can see that the number of deletion events contributing to a run is somewhat dependent on μ when half of the the sequence has been deleted, but is strongly dependent when 90% has been deleted. In the bottom row, the graph on the left shows that run length l is distributed very differently for μ = 2, 3, 6 and μ = 11, when the proportion of the sequence deleted is exactly the same. This strongly suggests that observing the run length distribution and the overall proportion of deletions should allow us to infer μ. Moreover the shape of these distributions is sensitive to ϕ.
We mention that any edge effects in our simulation are negligible. Whether we work with G and H on an interval of ℤ of length 100,000 or, as previously [8], length 300,000, gives virtually the same results.
Figure 2 shows the relationship, for three values of the fractionation bias ϕ and for a range of values of μ, between the proportion of genes deleted, on one chromosome or the other, and the average run length. This confirms that average run length and overall proportion of deletion θ, both observable, can be used to infer μ rather accurately, and to infer ϕ, perhaps with somewhat less precision. The latter parameter can, however, be inferred from the shape of the run length distribution in Figure 1 (bottom) or estimated directly from the proportion of single-copy genes on each homolog.
A recurrence for π(r)
We are interested in inferring μ from the observed distribution of run lengths and the proportion θ of undeleted terms, i.e., undeleted genes. At the outset θ = 1. As t → ∞, θ → 0. We are not, however, interested in t, since it is not observable and any time-based inference we can make about μ will depend only on run lengths and θ in any case. On the other hand, r, the number of deletion events per run is an interesting variable since we can assume run length is close to rμ on average, at least for small values of θ, and we can model the evolution of r directly We consider the distribution π as a function of μ, ϕ and θ.
As π changes, probability weight is redistributed among several types of run:
-
1.
new runs (r = 1) falling completely within an existing run of undeleted terms, not touching the preceding or following run of deleted terms, type A in Figure 3,
-
2.
runs that touch, overlap or entirely engulf exactly one previous run of deleted terms with r ≥ 1, thus lengthening that run to r + 1 events, types B and C in Figure 3,
-
3.
runs that touch, overlap or engulf, by the skipping process, two previous runs of r1 and r2 events respectively, creating a new run of r1 + r2 + 1 events, and diminishing the total number of runs by 1, including types D and E in Figure 3,
-
4.
runs that touch, overlap or engulf, by the skipping process, k > 2 previous runs of of r1, ⋯, r
k
events respectively, creating a new run of r1 + ⋯ + r
k
+ 1 events, and diminishing the total number of runs by k - 1, not illustrated in Figure 3. Case 3 above may be considered a special case of this for k = 2 and Case 2 for k = 1.
The first process, involving a deletion event of length a requires a run of undeleted terms of at least a + 2. What can we say about runs of undeleted terms? We know that runs of deleted terms alternate with runs of undeleted terms, so that there is one run of the former for each of the latter. The mean lengths and of the deleted runs and the undeleted runs, respectively, should satisfy:
The distribution ρ(l) of lengths of the undeleted runs is assumed to be geometric. Similarly the lengths of successive undeleted runs (indeed all undeleted runs) are assumed to be independent. While we do not have a rigorous proof of these assumptions, they have been confirmed by extensive simulations.
Let ϕ1 and ϕ2 be the proportion of deletion events affecting homeologous chromosomes 1 and 2, respectively, so that ϕ1 + ϕ2 = 1. Let τ(r) be the proportion of runs of single-copy genes with terms in both chromosomes. (τ(1) ≡ 0 and, initially, τ(r) = 0 for r = 2, 3, ⋯.) Note that in such a run, the term(s) at the extreme left were (was) deleted from chromosome i with probability ϕ
i
and the same for the terms at the extreme right.
The proportion of undeleted terms in runs of length l is lρ(l)/E
ρ
, where E
ρ
= ∑
l
>0 lρ(l). As depicted in Figure 3, the probabilities and that a deletion event affects chromosomes 1 or 2, respectively, and falls within a run of undeleted terms of length l without deleting the terms at either end is, for i ∈ {1, 2}
(2)
where j indexes the starting position of the deletion within the run, and a is the number of terms deleted in the event. We define the contribution to mean run length of A events to be
(3)
Events of type A
i
create runs of deleted terms with r = 1 from one chromosome only. Note that the last line of equation (2), and equation (3), involve the collection of terms, reducing the number of nested summations in order to speed up calculation. While these are not lengthy calculations to start with, we display the speed-up as a simple illustration of the important efficiencies implemented for more difficult cases to be treated below.
The probability that a deletion event on chromosome i touches only the run of deletions on chromosome f on the left of the run of undeleted terms is, for i ∈ {1, 2} and f ∈ {1, 2},
(4)
We define the contribution to mean run length of B events to be
(5)
Events of type B
ii
turn a deleted run with r events from one chromosome, into a run with r + 1 events. Events of type B
if
, with i ≠ f, turn a deleted run with r events, into a run with r + 1 events.
The probability that a deletion event, on either chromosome, does not touch the run of deletions on the left, does touch or overlap the run of deletions on the right entirely on the same chromosome (homeolog), but does not extend over the entire run of undeleted terms beyond that is, for i ∈ {1, 2}:
(6)
We define the contribution to mean run length of C
ii
events to be
(7)
which can be calculated using an expansion such as that in (6). Events of type C
ii
turn a deleted run with r events from one chromosome, into a run with r + 1 events.
The probability that a deletion event, on either chromosome, does not touch the run of deletions on the left but does touch the run of deletions on the right, partly or entirely on the other chromosome, is, for i ≠ f ∈ {1, 2}:
(8)
We define the contribution to mean run length of C
if
events to be
(9)
Events of type C
if
, with i ≠ f, turn a deleted run with r events, into a run with r + 1 events. Note that (9) does not contains terms of form aγ(a) as do (3,5,7), since in this event deletion is blocked beyond the existing run of deletions; the probability weight is thus concentrated on deletions of lesser length.
The probability that a deletion event completely overlaps the run of deletions on the right and touches or overlaps the run of deletions beyond that, all on the same chromosome, but does not extend over a further run of undeleted terms:
(10)
in which the reduction of the number of nested summations is key to the computability of the entire calculation.
We define the contribution to mean run length of D
iii
events to be
(11)
which can be calculated using an expansion such as that in (10). Events of type D
iii
turn two deleted runs with r and s events, respectively, both from the same chromosome, into a run with r + s + 1 events.
The probability that a deletion event completely overlaps the run of deletions on the right, on the same chromosome, and touches the run of deletions beyond that, partly or entirely on the other chromosome, is:
(12)
and the contribution to mean run length is
(13)
Events of type D
iif
, with i ≠ f, turn two deleted runs with r and s events, respectively, with the latter containing terms from both chromosomes, into a single run with r + s + 1 events.
The probability that a deletion event touches the run of deletions on the left of the run of undeleted terms and touches or overlaps the run of deletions on the right, all on the same chromosome, but does not extend over the entire run of undeleted terms beyond that is:
(14)
where
(15)
The probability that a deletion event touches the run of deletions on the left of the run of undeleted terms, both from the same chromosome, and touches the run of deletions on the right, partly or entirely on the other chromosome, is:
(16)
and
(17)
The probability that a deletion event touches the run of deletions on the left of the run of undeleted terms and touches or overlaps the run of deletions on the right, all on the same chromosome, but does not extend over the entire run of undeleted terms beyond that is:
(18)
and
(19)
The probability that a deletion event touches the run of deletions on the left of the run of undeleted terms and touches or overlaps the run of deletions on the right, all on the same chromosome, but does not extend over the entire run of undeleted terms beyond that is:
(20)
and
(21)
Events of type E
iii
turn two deleted runs with r and s events, respectively, all from one chromosome, into a single run with r + s + 1 events. Events of type E
iif
, E
ifi
and E
iff
,, with i ≠ f, turn two deleted runs with r and s events, respectively, into a single run with r + s + 1 events.
We reiterate here that the last lines of each of (2),(6) and (10) include the collection of terms, significantly cutting down on computing time when these formulae are implemented, especially in the case of (10).
In this initial model, we neglect the merger of three or more runs of deletions. There is no conceptual difficulty in including three or more mergers, but the proliferation of embedded summations would require excessive computation. Thus we should expect the model to be adequate until θ gets very small, when mergers of several runs at a time become common.
Let , and similarly let each of p
B
, ⋯, p
E
be the sums of their respective subscripted terms (with all combinations of i and f). We define the change δ
π
(r) in the number of runs of deleted terms with r = 1, 2, ....
(22)
(23)
For r > 2,
(24)
In an implementation on a finite interval of ℤ, the number of runs of deleted terms will change from some value R to R', where
(25)
The distribution of number of events per run will also change from π to π', where
(26)
and where the mean of the number of deleted genes per run increases from to , so that
(27)
The mean of the new distribution ρ' of run lengths of undeleted terms satisfies
(28)
The new proportion θ' of undeleted terms is .
In the same interval of ℤ, we define the change δ
τ
(r) in the number of runs containing single copy genes in both chromosomes with r = 1, 2, ....
(30)
For r > 2,
(31)
In the implementation, the number of runs of deleted terms with genes on both chromosomes will change from T(r) to T'(r), where
(32)
The proportions of runs with deletion events from both chromosomes will also change from τ to τ', where
(33)
We implement equations (1) to (33) as a recurrence with a step size parameter Λ to control the number of events using the same p
A
, p
B
, p
C
, p
D
, p
E
, δ
π
(·) and δ
τ
(·) between successive normalizations, and using Λδ
π
(·) and Λδ
τ
(·) instead of δ
π
(·) and δ
τ
(·) in (25)-(33). The choice of Λ determines the trade-off between computing speed and accuracy.
Figure 4 shows the results of our current implementation of our deterministic recurrence for the cases μ = 2 and μ = 11, for unbiased fractionation (ϕ = 0.5) and for extremely biased fractionation (ϕ = 1). The results fit simulations of the stochastic model quite well and reveal a number of tendencies. One is that unbiased fractionation with small deletions leads to the fastest drop in events of type A as θ decreases.
Biased fractionation with large deletion sizes leads to slow initial growth in the proportions of events of types D and E and "other".
There are at least two reasons for the discrepancies between the simulations and the recurrences observed in Figure 4. At the outset, since we used a large step size Λ for the computationally costly recurrence, its trajectory lags behind the simulation, especially with respect to the slower decrease in p
A
and slower increase in p
B
+ p
C
. Later discrepancies are partially due to not accounting for the merger of three or more runs. These can be estimated and are summarized as "other " in the diagram, but the quantities involved are not fed back to the recurrence through (26).
Other possible sources of error might be due to the cutoffs in x used for calculations involving γ(x) and ρ(x). However, extensive testing of various cutoff values has indicated such errors to be negligible in our implementation.