We propose a new method for estimating the evolutionary distance between two genomes with high accuracy under the FBM. A key component of our method is the analytical estimation of c
ℓ
for various ℓ≥2 in terms of k and n, which we describe in the
Theoretical analysis
subsection below. From this estimation, in the
Estimation algorithm
subsection we solve the inverse problem to find the true evolutionary distance k (which is our goal) and the number of fragile regions n (as a by-product). In our analysis, we consider only relatively small ℓ and assume that n and k are sufficiently large (see Theorem 1).
Theoretical analysis
Theorem 1
Let genome P
n
be a genome with n fragile regions and genome Q
n
be obtained from P
n
with k=⌊γ
n/2⌋ random DCJs for some γ>0. Then, for any fixed ℓ, the proportion of edges that belong to ℓ-cycles in G(P
n
,Q
n
) is
$$\frac{\ell c_{\ell}}{n} = e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} +O_{p}\left(\frac{1}{\sqrt{n}}\right)as\ n \to \infty \,, $$
where \(O_{p}\left (\frac {1}{\sqrt {n}}\right)\) denotes a term stochastically bounded4.
To prove Theorem 1, we will need some lemmas.
Let us consider a sequence \(\mathcal {D}\) of k random DCJs transforming the graph G(P
n
,P
n
) into the graph G(P
n
,Q
n
). Each DCJ in \(\mathcal {D}\) either merges two cycles (merging DCJ), splits one cycle into two (splitting DCJ), or preserves the cycle structure (preserving DCJ). We call set of ℓ black edges proper for the transformation \(\mathcal {D}\) if in the breakpoint graph G(P
n
,Q
n
) these edges form an ℓ-cycle resulted entirely from ℓ−1 merging DCJs. We remark that each ℓ-cycle in G(P
n
,Q
n
) is a result of ℓ−1 merging DCJs (we denote the number of such cycles by \(\tilde {c}_{\ell }\)) or is formed with at least one splitting or preserving DCJ. The following lemmas provide an asymptotic for \(\tilde {c}_{\ell }\) and an upper bound for \(c_{\ell } - \tilde {c}_{\ell }\).
Lemma 1
The mean value of \(\tilde {c}_{\ell }\) has the following asymptotic:
$$\mathrm{E}\left(\tilde{c}_{\ell}\right) = \frac{n}{\ell} e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} + O(1). $$
Proof
Let us fix a set A of ℓ black edges and find the probability that A is proper for \(\mathcal {D}\). We remark that this probability does not depend on the content of A but only on its size |A|=ℓ, and denote it by \(\tilde {p}_{n,k,\ell }\). There are 2ℓ−1(ℓ−1)! ways to arrange ℓ black edges into an ℓ-cycle, since there exist (ℓ−1)! circular permutations of length ℓ and 2ℓ−1 ways to assign directions to their elements. For any ℓ-cycle, there are ℓ
ℓ−2 ways to obtain it as a result of ℓ−1 DCJs [8].
We represent \(\mathcal {D}\) as the union \(\mathcal {S} \cup \mathcal {\bar {S}}\), where the subsequence \(\mathcal {S}\) contains ℓ−1 DCJs connecting edges from A into an ℓ-cycle, and the subsequence \(\mathcal {\bar {S}}\) contains the rest of the DCJs. Since \(|\mathcal {D}|=k\) and \(|\mathcal {S}|=\ell -1\), there are \(\binom {k}{\ell -1}\) ways to choose positions for elements of \(\mathcal {S}\) in \(\mathcal {D}\). The k−ℓ+1 DCJs from \(\mathcal {\bar {S}}\) operate on the n−ℓ red edges that are not incident to any black edge from A, and for each pair of red edges there are two possible ways to recombine them with a DCJ. This gives
$$2^{k-\ell+1}\binom{n-\ell}{2}^{k-\ell+1} $$
possible subsequences \(\mathcal {\bar {S}}\). So, there are
$$2^{\ell-1}(\ell-1)! \ell^{\ell-2} \binom{k}{\ell-1} 2^{k-\ell+1} \binom{n-\ell}{2}^{k-\ell+1} $$
transformations \(\mathcal {D}\) such that A is proper for \(\mathcal {D}\). The total number of k-step transformations is equal to \(2^{k} \binom {n}{2}^{k} \,.\) Then the probability \(\tilde {p}_{n,k,\ell }\) can be found as follows
$$ \tilde{p}_{n,k,\ell} = \frac{\ell^{\ell-2}(\ell-1)!\binom{k}{\ell-1}\binom{n-\ell}{2}^{k-\ell+1}}{\binom{n}{2}^{k}} \,. $$
(1)
Since there are \(\binom {n}{\ell }\) ways to choose the set A of black edges, the average number \(\mathrm {E}(\tilde {c}_{\ell })\) of ℓ-cycles obtained by only merging DCJs equals
$$ \begin{aligned} \binom{n}{\ell} \tilde{p}_{n,k,\ell} &= \ell^{\ell-2}(\ell-1)!\binom{n}{\ell}\binom{k}{\ell-1} \cdot\frac{\binom{n-\ell}{2}^{k-\ell+1}}{\binom{n}{2}^{k}} \\ &= \ell^{\ell-2}(\ell-1)! \, \frac{n^{\ell}}{\ell!}\cdot\frac{k^{\ell-1}}{(\ell-1)!}\frac{(n-\ell)^{2(k-\ell+1)}}{2^{k-\ell+1}}\\ &\quad\times\frac{2^{k}}{n^{2k}}+O(1) \\ &= \ell^{\ell-2} k^{\ell-1}\,\cdot \frac{n^{\ell}}{\ell!}\cdot \frac{2^{\ell-1}}{n^{2\ell - 2}} \left(1-\frac{\ell}{n}\right)^{2(k-\ell+1)}\\&\quad+O(1) \\ &= \left(\frac{2k\ell}{n}\right)^{\ell-1} \frac{n}{\ell}\cdot \frac{1}{\ell!}e^{-\gamma \ell}+O(1) \\ &= \frac{n}{\ell} e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} + O(1) \,. \end{aligned} $$
(2)
□
Lemma 2
The variance of \(\tilde {c}_{\ell }\) is bounded: \(\text {Var} \left (\tilde {c}_{\ell }\right) = O(n)\).
Proof
For a transformation \(\mathcal {D}\) and a set A of black edges, we define a random variable X
A
:
$$X_{A} = \left\{\begin{array}{ll} 1, & \text{if}\ A\ \text{is proper for}\ \mathcal{D},\\ 0, & \textrm{otherwise.} \end{array}\right. $$
By definition, \(\tilde {c}_{\ell } = \sum _{A} X_{A},\) where the sum is taken over all possible sets of ℓ black edges. The variance of \(\tilde {c}_{\ell }\) is equal to
$$\mathrm{E} \left(\tilde{c}_{\ell}^{2}\right) - \mathrm{E} \left(\tilde{c}_{\ell}\right)^{2} \,. $$
Since
$$\mathrm{E} \left(\tilde{c}_{\ell}^{2}\right) = \sum_{A} \sum_{B} \mathrm{E} \left(X_{A} X_{B}\right) \,, $$
we will find E(X
A
X
B
) for each pair A,B of sets of ℓ black edges:
$$\mathrm{E} \left(X_{A} X_{B}\right) = \left\{\begin{array}{ll} \tilde{p}_{n,k,\ell}, & \textrm{if }A=B \,,\\ \tilde{p}_{n,k,\ell}\cdot\tilde{p}_{n-\ell,k-\ell+1,\ell}, & \textrm{if }A\cap B = \emptyset \,,\\ 0, & \textrm{otherwise,} \end{array}\right. $$
where \(\tilde {p}_{n,k,\ell }\) is defined by (1). This implies
$$\mathrm{E} \left(\tilde{c}_{\ell}^{2}\right) = \binom{n}{\ell} \tilde{p}_{n,k,\ell} +\binom{n}{\ell} \binom{n-\ell}{\ell} \tilde{p}_{n,k,\ell}\cdot \tilde{p}_{n-\ell,k-\ell+1,\ell} \,. $$
Finally, using (2), we obtain
$${\begin{aligned} \text{Var}\left(\tilde{c}_{\ell}\right) &= \binom{n}{\ell} \tilde{p}_{n,k,\ell} +\binom{n}{\ell} \binom{n-\ell}{\ell} \tilde{p}_{n,k,\ell}\cdot \tilde{p}_{n-\ell,k-\ell+1,\ell} - \mathrm{E} \left(\tilde{c}_{\ell}\right)^{2}\\ &=\binom{n}{\ell} \tilde{p}_{n,k,\ell}\left(1+\binom{n-\ell}{\ell}\tilde{p}_{n-\ell,k-\ell+1,\ell}-\binom{n}{\ell} \tilde{p}_{n,k,\ell}\right)\\ &=\left(\frac{n}{\ell} e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} + O(1)\right)\left(1+\frac{n-\ell}{\ell} e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!}\right.\\ &\left.\quad-\frac{n}{\ell} e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} + O(1)\right) \\ &= \left(\frac{n}{\ell} e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} + O(1)\right) \left(1- e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} + O(1)\right)\\ &= O(n). \end{aligned}} $$
□
For a positive integer M, we call a splitting DCJ M-splitting, if it splits some m-cycle into an i-cycle and an (m−i)-cycle for some i≤M. Similarly, we call a preserving DCJ M-preserving if it operates on an m-cycle for some m≤M.
Lemma 3
Let M be any positive integer. Then (i) the number of M-splitting DCJs in \(\mathcal {D}\) is stochastically dominated by a Poisson random variable with parameter γ
M/2; (ii) the number of M-preserving DCJs in \(\mathcal {D}\) is stochastically dominated by a Poisson random variable with parameter γ
M
2/4.
Proof
To prove (i), we notice that the probability that a DCJ from \(\mathcal {D}\) splits a fixed m-cycle into an i-cycle and an (m−i)-cycle is \(\frac {m}{n(n-1)}\) (if i≠m/2) or \(\frac {m}{2n(n-1)}\) (if i=m/2). The probability that a DCJ splits a cycle into an i-cycle with i≤M and another cycle can be bounded by
$$\sum_{i=1}^{M} \sum_{m>i} \frac{m}{n(n-1)}c_{m} \leq \frac{M}{n-1} \,. $$
This bound implies that a number of M-splitting DCJs in \(\mathcal {D}\) is stochastically dominated by the random variable Y that equals j with the probability:
$$ \binom{k}{j}\left(\frac{M}{n-1}\right)^{j} \left(1-\frac{M}{n-1}\right)^{k-j}\!= e^{-\gamma M/2} \frac{(\gamma M/2)^{j}}{j!} + o(1)\,. $$
Since \(e^{-\gamma M/2} \frac {(\gamma M/2)^{j}}{j!}\) represents the probability that a Poisson random variable with parameter γ
M/2 is equal to j, the proof of (i) is completed.
To prove (ii), we notice that the probability that a DCJ from \(\mathcal {D}\) operates on two red edges from the fixed m-cycle and does not split this cycle equals \(\frac {m(m-1)}{2n(n-1)}\). Summing over all m≤M, we bound the probability that a fixed DCJ is M-preserving as
$$\sum_{m \leq M} \frac{m^{2}}{2n(n-1)}c_{m} \leq \frac{M^{2}}{2(n-1)} $$
and the proof is completed similarly to the proof of (i) above. □
Now we can prove Theorem 1.
Proof
Lemma 3 implies that \(c_{\ell } - \tilde {c}_{\ell }\), the number of ℓ-cycles obtained with at least one splitting or preserving DCJ, is of order O(1). Indeed, each such ℓ-cycle uniquely determines the last splitting or preserving DCJ that participates in the formation of this cycle (i.e., operates on some of the cycle vertices). Clearly, this last splitting (resp. preserving) DCJ is ℓ-splitting (resp. ℓ-preserving). Since such last DCJ correspond to at most two cycles, it follows that the number \(c_{\ell } - \tilde {c}_{\ell }\) does not exceed twice the number of ℓ-splitting and ℓ-preserving DCJs. By Lemma 3, the number of ℓ-splitting and ℓ-preserving DCJs is bounded (stochastically dominated by a Poisson random variable), we have \(c_{\ell } - \tilde {c}_{\ell } = O(1)\).
Since by Lemmas 1 and 2, the number \(\tilde {c}_{\ell }\) has the mean value \(\frac {n}{\ell } e^{-\gamma \ell }\frac {(\gamma \ell)^{\ell -1}}{\ell !} + O(1)\) and the standard deviation of order \(O(\sqrt {n})\). The fraction of edges in ℓ-cycles is \(\frac {\ell c_{\ell }}{n}\). Applying Chebyshev’s inequality we obtain:
$$\frac{\ell c_{\ell}}{n} = e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!} +O_{p}\left(\frac{1}{\sqrt{n}}\right)\,, $$
which completes the proof. □
We remark that for γ<1, the sequence
$$p_{\ell} = e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!},\qquad \ell=1,2,\dots $$
defines a probability mass function, which characterizes a Borel distribution [9]. So, if γ<1, for a randomly chosen edge, the length of the cycle that contains it follows a Borel distribution with parameter γ. Moreover, the results of Erdös and Rényi [10] imply that
$$1-\sum_{\ell=1}^{\infty}{e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell\cdot\ell!}} = \frac{\gamma}{2} \,, $$
which can also be seen empirically in Fig. 3. It follows that for γ<1, d=k(1+o(1)) and the DCJ distance closely approximates the true distance. While d=k corresponds to the parsimony assumption, we say that the process is in the parsimony phase as soon as γ<1.
From Theorem 1, we have
Corollary 1
$$\frac{b}{n} = 1 - e^{-\gamma} +o_{p}(1) \,. $$
Proof
Indeed, b=n−c
1 and \(\frac {c_{1}}{n} = e^{-\gamma }+o_{p}\,(1)\,\). □
Corollary 2
$$\frac{d}{n} = 1-\sum_{\ell=1}^{\infty} \frac{p_{\ell}}{\ell} +o_{p}(1) \,. $$
Proof
By definition,
$$d = b - \sum_{\ell=2}^{\infty} c_{\ell} = n - \sum_{\ell=1}^{\infty} c_{\ell} \,. $$
For any fixed M, the number of m-cycles, where m>M, is bounded by \(\frac {n}{M}\). By Theorem 1, for any fixed ℓ, we have \(\frac {c_{\ell }}{n} = \frac {p_{\ell }}{\ell }+\frac {\xi _{\ell }}{\sqrt {n}}\), where ξ
ℓ
is some random variable with E(ξ
ℓ
)=0 and Var(ξ
ℓ
)<1. Hence, for any fixed M we have:
$$\frac{1}{n}\sum_{\ell=1}^{\infty} c_{\ell} = \frac{1}{n}\sum_{\ell=1}^{M} c_{\ell} + \frac{1}{n}\sum_{\ell=M+1}^{\infty} c_{\ell} \leq \sum_{\ell=1}^{M}\left(\frac{p_{\ell}}{\ell}+\frac{\xi_{\ell}}{\sqrt{n}}\right)+\frac{1}{M} \,. $$
Let \(\eta = \frac {1}{M}\sum _{\ell =1}^{M} \xi _{\ell }\), then
$$\frac{1}{n}\sum_{\ell=1}^{\infty} c_{\ell} \leq \sum_{\ell=1}^{M}\frac{p_{\ell}}{\ell}+\frac{M\eta}{\sqrt{n}}+\frac{1}{M} \,. $$
From the definition of η it follows that the random variable η has the mean value E(η)=0 and the variance Var(η)<1. Let
$$r(m) = \sum_{\ell=m+1}^{\infty}\frac{p_{\ell}}{\ell} \,, $$
so that \(r(0) = \sum _{\ell =1}^{\infty } \frac {p_{\ell }}{\ell }\). Then
$$\frac{1}{n}\sum_{\ell=1}^{\infty} c_{\ell} - r(0) \leq \frac{1}{M} - r(M) +\frac{M\eta}{\sqrt{n}} \,. $$
The lower bound is obvious
$$\frac{1}{n}\sum_{\ell=1}^{\infty} c_{\ell} - r(0) \geq - r(M)+\frac{M\eta}{\sqrt{n}} \,. $$
Since the series \(\sum _{\ell =1}^{\infty }\frac {p_{\ell }}{\ell }\) converges, for each ε>0 one can choose M and then choose n in such a way that
$$-\varepsilon \leq \frac{1}{n}\sum_{\ell=1}^{\infty} c_{\ell} - r(0) \leq \varepsilon(1+\eta) \,. $$
□
The estimations given in Corollaries 1 and 2 are very precise, as shown in Figs. 4 and 3. All the simulations are performed for n=1000 (see
Discussion
).
Estimation algorithm
From Corollaries 1 and 2, we obtain the following approximation for the ratio \(\frac {d}{b}\):
$$ \frac{d}{b} \approx \frac{1-\sum_{\ell=1}^{\infty}{e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell\cdot\ell!}}}{1 - e^{-\gamma}} \,, $$
(3)
and then estimate γ with the bisection method. The plot of \(\frac {d}{b}\) as a function of γ is shown in Fig. 5. As one can see, this function is increasing, and each value of \(\frac {d}{b}\) in the interval [0.5,1] uniquely determines the value of γ. In particular, γ=1 corresponds to \(\frac {d}{b} \approx 0.79\). That is, if \(\frac {d}{b}<0.79\), then the process is in the parsimony phase and the true distance k is accurately approximated by the DCJ distance d.
Our simulations demonstrate that this estimation method is robust for γ<2. For larger values of γ, the estimator is too sensitive to small random deviations.
Alternatively, the value of γ can be approximated by \(\frac {b}{\sum _{\ell =2}^{m} \ell c_{\ell }}\) for some m. In our simulations, the best results were observed for m=6. Again, the function
$$\begin{array}{*{20}l} f_{m}(\gamma) = \frac{1 - e^{-\gamma}}{\sum_{\ell=2}^{m}e^{-\gamma\ell}\frac{(\gamma\ell)^{\ell-1}}{\ell!}} \end{array} $$
(4)
is increasing (Fig. 6). The applicability limits for this estimator depend on the value of n. Namely, if \(\sum _{\ell =2}^{m} \ell c_{\ell }\) is close to zero, then the estimator becomes sensitive to small deviations. In our simulations with n=1000 (see
Discussion
), the estimator is very precise for values γ<2.5 with the relative error below 10% in 95% of observations.
Once we obtain an estimated value γ
e
for γ, it is easy to estimate the values of n and k as follows:
$$ n_{e} = \frac{b}{1-e^{-\gamma_{e}}}\quad\text{and}\quad k_{e} = \frac{\gamma_{e} \cdot n_{e}}2. $$
(5)
We report a value of k
e
as our estimation for the true evolutionary distance.