Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints

Jamshidpey, Arash; Jamshidpey, Aryo; Sankoff, David

doi:10.1186/1471-2164-15-S6-S3

Volume 15 Supplement 6

Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Research
Open access
Published: 17 October 2014

Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints

Arash Jamshidpey¹,
Aryo Jamshidpey² &
David Sankoff¹

BMC Genomics volume 15, Article number: S3 (2014) Cite this article

1168 Accesses
4 Citations
Metrics details

Abstract

Background

The breakpoint median in the set S_n of permutations on n terms is known to have some unusual behavior, especially if the input genomes are maximally different to each other. The mathematical study of the set of medians is complicated by the facts that breakpoint distance is not a metric but a pseudo-metric, and that it does not define a geodesic space.

Results

We introduce the notion of partial geodesic, or geodesic patch between two permutations, and show that if two permutations are medians, then every permutation on a geodesic patch between them is also a median. We also prove the conjecture that the input permutations themselves are medians.

Backgound

Among the common measures of gene order difference between two genomes, the edit distances, such as reversal distance or double-cut-and-join distance, contrast with the breakpoint distance in that the former are defined in a geodesic space while the latter is not. Another characteristic of breakpoint distance that it does not share with most other genomic distances is that it is a pseudometric rather than a metric.

A problem in computational comparative genomics that has been extensively studied under many definitions of genomic distance is the gene order median problem [1], the archetypical instance of the gene order small phylogeny problem. The median genome is meant, in the first instance, to embody the information in common among k ≥ 3 given genomes, and second, to estimate the ancestral genome of these k genomes. We have shown that the second goal becomes unattainable as n → ∞, where n is the length of the genomes, if there are more than 0.5n mutational steps changing the gene order [2]. Moreover, we have conjectured, and demonstrated in simulation studies, that where there is little or nothing in common among the k input genomes, the median tends to reflect only one (actually, any one) of them, with no incorporation of information from the other k − 1 [3].

In the present paper, we investigate this conjecture mathematically in the context of a wider study of medians for the breakpoint distance between unsigned linear unichromosomal genomes, although the methods and results are equally valid for genomes with signed and/or circular chromosomes, as well as those with χ >1 chromosomes, where χ is a fixed parameter. Our approach involves first a rigorous treatment of the pseudometric character of the breakpoint distance. Then, given the non-geodesic nature of the space we are able to define a weaker concept of geodesic patch, that we use later, given two or more medians, to locate further medians. We also prove the conjecture that for k genomes containing no gene order information among them, the normalized (divided by n) median score tends to k − 1, with high probability.

Results

From pseudometric to metric

We denote by S_n the set of all permutations of length n. Each permutation represents a unichromosomal linear genome where the numbers all represent different genes. For a permutation π := π₁ ... π_n we define the set of adjacencies of π to be all the unordered pairs {π_i, π_i+1} = {π_i+1, π_i} for i = 1, ..., n − 1. For I ⊆ S_n we denote by $A_{I} : = A_{I}^{(n)}$ the set of all common adjacencies of the elements of I. Then $A_{S_{n}} = \emptyset$ , and we also write $A_{\emptyset}$ for the set of all pairs {i, j}, i ≠ j. For any I, J ⊆ S_n $A_{I \cup J} = A_{I} \cap A_{J}$ . It will sometimes be convenient to write $A_{I}$ , the set of common adjacencies in I = {x₁, ..., x_k }, as $A_{x_{1}}, . . ., x_{k}$ . For example A_x,y,z represents the set of adjacencies common to permutations x, y and z.

For x, y ∈ S_nwe define the breakpoint distance (bp distance) between x and y by

d^{(n)} (x, y) : = n - 1 - | A_{x, y} | .

(1)

This distance is not a metric on S_n but rather a pseudometric because of nonreflexiveness: cases where d⁽ⁿ⁾(x, y) = 0 but x ≠ y, namely x = π₁ ... π_n and y = π_n ... π₁, for any x ∈ S_n. In these cases, the permutations x and y are said to be equivalent, denoted by x ~ y. The equivalence class containing π is represented by [π] and contains exactly two permutations, π₁, ..., π_n and π_n, ..., π₁. The number of classes is thus n!/2. For any π, we denote the other element of [π] by $\bar{π}$ . The bp distance, a metric on the set of all equivalence classes of S_n, denoted by $Ŝ_{n} : = S_{n} / ~$ is defined by

d^{(n)} ([x], [y]) : = d^{(n)} (x, y) .

(2)

Where there is no risk of ambiguity, we can simplify the notation by using x and y instead of [x] and [y], and/or drop the superscript n.

It is clear that the maximum possible bp distance between two permutation classes is n − 1 when they have no common adjacencies. Bp distance is symmetric on S_n and hence on $Ŝ_{n}$ . By construction, it is reflexive on $Ŝ_{n}$ . To verify the triangle inequality, consider three permutations x, y, z. We have

A_{x, z} \supseteq A_{x, y, z} = A_{x, y} \cap A_{y, z}

(3)

Therefore

d (x, z) = n - 1 - | A_{x, z} | \leq n - 1 - | A_{x, y} | - | A_{y, z} | + | A_{x, y} \cup A_{y, z} | .

(4)

But $| A_{x, y} \cup A_{y, z} | = | A_{y} \cap (A_{x} \cup A_{z}) | \leq n - 1$ and hence the triangle inequality holds.

We say a pseudometric (or a metric) $\tilde{ρ}$ is right invariant on a group G if for any $x, y, z \in G, \tilde{ρ} (x, y) = \tilde{ρ} (x z, y z)$ . The definition of the left invariance is similar. A pseudometric (metric) which is both right and left invariant is called invariant. Bp distance is an invariant pseudometric on S_n.

Definition 1 Given a set {x₁, . . . , x_k} ⊆ S and a pseudometric space ρ on S, a median for the set is µ ∈ S such that $\sum_{i = 1}^{k} ρ (μ, x_{i})$ is minimal.

Defining the geodesic patch

A discrete metric space (S, ρ) is a geodesic space if for any two points x, y ∈ S there exists a finite subset of S containing x, y that is isometric with the discrete line segment [0, 1, ..., ρ(x, y)]. Any subset of S with this property, and there may be several, is called a geodesic between x and y. For example, all connected graphs are geodesic spaces. In a geodesic space the medians of two points x and y consist of all the points located on geodesics between x and y.

What can we say when the space is not a geodesic space? To answer this, we extend the concept of geodesic by introducing the concept of a geodesic patch. A geodesic patch between x and y is a maximal subset of S containing x, y which is isometric to a subsegment (not necessarily contiguous) of the line segment [0, 1, ..., ρ(x, y)]. For any two points x, y in an arbitrary metric space (S, ρ) there exists at least one geodesic patch between them because x, y is isometric to {0, ρ(x, y)}. In addition, any geodesic is a geodesic patch. Any point z on a geodesic patch between x, y satisfies:

ρ (x, y) = ρ (x, z) + ρ (z, y) .

(5)

Therefore all the medians of two points x and y must lie on a geodesic patch between them. We denote the set of all permutations lying on geodesic patches connecting x, y ∈ S_nby $\bar{[x, y]}$ , as in Figure 1.

$(Ŝ_{n}, d)$ is not a geodesic space. For example there is no geodesic connecting the identity permutation id and π := 1 2 x₁ x₂ ... x_n−4n − 1 n when x₁ x₂ ... x_n−4is a non-identical permutation on {3, ..., n − 2}. The smallest change to id is to cut one of its adjacencies, say {i, i + 1}, and rejoin the two segments in one of the three possible ways: 1 to n, 1 to i + 1 or n to i. Now if we cut the adjacencies {1, 2} or {n − 1, n} in id the distance of the new permutation to both id and π increases. If on the other hand we cut one of the other adjacencies in id all the ways of rejoining, which increase the distance to id, either increase or leave unchanged the distance to π, since {1, n}, {1, i + 1} and {n, i} are not adjacencies in $A_{π}$ . Therefore there is no geodesic connecting id to π.

Although $Ŝ_{n}$ is not a geodesic space there may still exist permutations with a geodesic between them. For example

{i d = 123456, 213456, 312456, 421356, 531246, π = 135246)

(6)

is a geodesic between id and π. Note d(id, π) = 5, the maximum possible distance in $Ŝ_{6}$ .

The median value and medians of permutations with maximum pairwise distances

In this section we investigate the bp median problem in the case of k permutations with maximum pairwise distances. As we shall see later, this situation is very similar to the case of k uniformly random permutations. Let (S, ρ) be a pseudometric space.

The total distance of a point x ∈ S to a finite subset ∅ ≠ B ⊆ S is defined to be

ρ (x, B) : = \sum_{y \in B} ρ (x, y) .

(7)

The median value of B, $m^{S, ρ} (B)$ , is the infimum of the total distance when the infimum is over all the points x ∈ S, that is

m^{S, ρ} (B) : = inf_{x \in S} ρ (x, B) .

(8)

We can extend this definition to sets with multiplicities. Let ∅ ≠ B ⊆ S. We define a multiplicity function n_B from B to $N$ and write n_B (x) = n_x. We call A = (B, n_B ) a set with multiplicities. We define the total distance of a point x ∈ S to A to be

ρ (x, A) : = \sum_{y \in B} n_{y} ρ (x, y) .

(9)

The definition of median value in Equation (8) can be extended in an analogous way to the median value of a set with multiplicity A. When S is finite then the total distance function takes its minimum on S and "inf" turns into "min" in the above formulation. The points of the space S that minimize the total distance to A are called the median points or medians of A and the set of all these medians is called the median set of A, denoted by M ^S,ρ(A).

Let B and A = (B, n_B) be a subset and a subset with multiplicities of S_n. We define [B] to be the set of all permutation classes of S_n that have at least one of their permutations in B. That is

[B] = {[x] \in Ŝ_{n} such that \exists y \in B with x ~ y} .

(10)

Two nonempty subsets B, B′ ⊆ S_n are said to be equivalent, denoted by B ~ B', if [B] = [B′]. Also we define [n_B] to be a function from [B] to $N$ with

[n_{B}] ([x]) = n_{[x]} : = \sum_{x ~ y \in B} n_{y} .

(11)

Then the definition of [A] is straightforward:

[A] : = ([B], [n_{B}]),

(12)

and we say two nonempty subsets of S_nwith multiplicities, namely A and A′ are equivalent, denoted by A ~ A′, if [A] = [A′]. In fact [A] is the equivalence class containing A. We call [A] a subset of $Ŝ_{n}$ with multiplicities. We use the notations "[ ]" and " ~ " for all the above concepts without restriction.

With these definitions we can readily verify that in the context of bp distance, for A ~ A′ and x ~ x′, we have

d (x, A) = d (x^{'}, A^{'}) = d ([x], [A]) .

(13)

Recall that we use d as both a metric on $Ŝ_{n}$ and a pseudometric on S_n. Therefore we can conclude that

m^{S_{n}, d} (A) = m^{S_{n}, d} (A^{'}) = m^{Ŝ_{n}, d} ([A])

(14)

and similarly

[M^{S_{n}, d} (A)] = [M^{S_{n}, d} (A^{'})] = M^{Ŝ_{n}, d} ([A]) .

(15)

Henceforward, we will simplify by replacing the notation $m^{S_{n}, d} (A)$ and $M^{S_{n}, d} (A)$ by m_n(A) and M_n(A), respectively. Also for a subset [A] of $Ŝ_{n}$ with multiplicities, we will use the notation m_n([A]) and M_n([A]) instead of $m^{Ŝ_{n}, d} ([A])$ and $M^{Ŝ_{n}, d} ([A])$ respectively. Where there is no ambiguity we will suppress the subscript n.

Proposition 1 Suppose $X : = {x_{1}, \dots, x_{k}} \subset Ŝ_{n}$ such that d(x_i, x_j) = n − 1 for any i ≠ j, i ≤ i, j ≤ n. Then the bp median value of × is (k − 1)(n − 1). Moreover, m∗ is a median of X, m∗∈ M (X), if and only if $A_{m *} \subset \cup_{i = 1}^{k} A_{x_{i}}$ .

Proof Let $π \in Ŝ_{n}$ be an arbitrary permutation class. Since $A_{π, x_{i}} \subset A_{x_{i}}$ and $A_{π, x_{j}} \subset A_{x_{j}}$ for any 1 ≤ i, j ≤ k, we have $A_{π, x_{i}} \cap A_{π, x_{j}} = 0̸$ . Also

\cup_{i = 1}^{k} A_{π, x_{i}} \subset A_{π}

(16)

Therefore

\sum_{i = 1}^{k} | A_{π, x_{i}} | \leq | A_{π} | = n - 1

(17)

Hence

\sum_{i = 1}^{k} d (π, x_{i}) \geq (k - 1) (n - 1)

(18)

The equality holds letting π = x_i for any 1 ≤ i ≤ k. This proves the first part of the proposition. For the second part we know that m^∗ ∈ M (X) is equivalent with the fact that the total distance of m^∗ to X is (k − 1)(n − 1), and this is equivalent to $\sum_{i = 1}^{k} |A_{m^{*}, x_{i}}| = n - 1$ and $\cup_{i = 1}^{k} A_{m^{*}, x_{i}} = A_{m^{*}}$ be written as $A_{m^{*}} \cap (\cup_{i = 1}^{k} A_{x_{i}})$ . This finishes the proof of the equivalence relation in the proposition.

Lemma 1 Let x, y, z be three permutation classes in $Ŝ_{n}$ that are pairwise at a maximum distance n − 1 from each other. Then for any $w \in \bar{[x, y]}$ we have d (w, z) = n − 1.

Proof Having $w \in \bar{[x, y]}$ we have A_w ⊂ A_x ∪ A_y. Also we know that $A_{z} \cap (A_{x} \cup A_{y}) = 0̸$ . This concudes the result.

The above lemma simply indicates that for any two points x_i, x_j in the set X in the proposition above $\bar{[x_{i}, x_{j}]} \subset M (X)$ since the total distance of each point in $\bar{[x_{i}, x_{j}]}$ to X is (k − 1)(n − 1).

Corollary 1 Suppose $X : = {x_{1, \dots,} x_{k}} \subset Ŝ_{n}$ such that d(x_i, x_j) = n − 1 for any i ≠ j. Then $\cup_{i, j} \bar{[x_{i}, x_{j}]} \subset M (X)$ .

What more can we say about the median positions? The notion of "accessibility" will help us to keep track of some other medians of the set X that are not in $\cup_{i, j} \bar{[x_{i}, x_{j}]}$ . Before defining this concept, we first need more information about the properties of $\bar{[x, y]}$ for $x, y \in Ŝ_{n}$ .

Lemma 2 Let $x, y \in Ŝ_{n}$ . Then $z \in \bar{[x, y]}$ if and only if $A_{x, y} \subset A_{z} \subset A_{x} \cup A_{y}$ .

Proof We know $z \in \bar{[x, y]}$ if and only if d(x, z) + d(z, y) = d(x, y). On the other hand we can write A_z as follows

A_{z} = A_{z, x, y} \cup (A_{z, x} \ A_{y}) \cup (A_{z, y} \ A_{x}) \cup (A_{z} \ (A_{x} \cup A_{y})),

(19)

where the pairwise intersection of the sets in the right hand side is empty. We can also write

d (x, z) = (n - 1) - | A_{z, x, y} | - | A_{z, x} \ A_{y} |

(20)

and

d (z, y) = (n - 1) - | A_{z, x, y} | - | A_{z, y} \ A_{x} | .

(21)

Furthermore

d (x, y) \leq (n - 1) - | A_{z, x, y} |

(22)

and

(n - 1) - | A_{z, x, y} | - | A_{z, x} \ A_{y} | - | A_{z, y} \ A_{x} | = | A_{z} \ (A_{x} \cup A_{y}) | .

(23)

Now for "sufficiency", we have

(n - 1) - | A_{z, x, y} | - | A_{z, x} \ A_{y} | - (n - 1) - | A_{z, x, y} | - | A_{z, y} \ A_{x} |

(24)

= (n - 1) - | A_{x, y} | \leq (n - 1) - | A_{x, y, z} |

(25)

Therefore by Equation (23) we have

(n - 1) - | A_{z, x, y} | - | A_{z, x} \ A_{y} | - | A_{z, y} \ A_{x} | = | A_{z} \ (A_{x} \cup A_{y}) | \leq 0

(26)

This results in |A_x,y| = |A_x,y,z| and hence in A_x,y ⊂ A_z. Otherwise the inequality in (26) will be strict, which is impossible. On the other hand the inequality in (26) shows $A_{z} \ (A_{x} \cup A_{y}) = 0̸$ which concludes at $A_{z} \subset A_{x} \cup A_{y}$ .

For "necessity", we have

(n - 1) - | A_{z, x, y} | - | A_{z, x} \ A_{y} | - | A_{z, y} \ A_{x} | + (n - 1) - | A_{x, y} | = (n - 1) - | A_{x, y} |

(27)

This is true because of A_z ⊂ A_x ∪ A_y and Equation (23). But since A_x,y ⊂ A_z ⊂ A_x ∪ A_y we have |A_x,y| = |A_x,y,z| and we can replace |A_x,y| by |A_x,y,z| in the left hand side of the last equality. This finishes the "necessity" proof.

Definition 2 Let × := {x₁, ..., x_k} be a subset of $Ŝ_{n}$ . We say a permutation class $z \in Ŝ_{n}$ is 1-accessible from X if there exists an m ∈ $N$ , a finite sequence y₁, ..., y_m where y_i ∈ X and z₁, ..., z_m, where $z_{i} \in Ŝ_{n}$ such that z₁ = y₁, z_m = z and $z_{i + 1} \in \bar{[z_{i}, y_{i + 1}]}$ for $i = 1 . . . m - 1$ . See Figure 2.

We denote the set of all 1-accessible points of X by Z(X). We define Z₀(X) := X. Also for r ∈ $N$ ∪ {0}, by induction, we define Z_r+1(X) to be Z(Z_r(X)) and we call it the set of all r+1-accessible permutation classes. That is Z₁(X) = Z(X), Z₂(X) = Z(Z(X)) and so on. It is clear that Z_r+1(X) includes Z_r (X) and also $\cup_{x, y \in Z_{r} (X)} \bar{[x, y]}$ . A permutation class z is said to be accessible from × if there exists r ∈ $N$ such that z ∈ Z_r(X). We denote the set of all accessible points by $\bar{Z} (X) = \cup_{r \in I N \cup {0}} Z_{r} (X)$ .

Note that $Z (\bar{Z} (X)) = \bar{Z} (X)$ . This holds because for any 1-accessible permutation class z from $\bar{Z} (X)$ , there must exist $m \in N, r_{0} \in N, \cup {0}, y_{1}, . . ., y_{m} \in {\bar{Z}}_{r_{0}} (X)$ , (the y_i's must be in $\bar{Z} (X)$ , thus there must be such an r₀) and z₁, ..., z_m where $z_{i} \in Ŝ_{n}$ such that z₁ = y₁, z_m = z and $z_{i + 1} \in \bar{[z_{i}, y_{i + 1}]}$ . Therefore $z \in Z_{r_{0} + 1} (X) \subset \bar{Z} (X)$ . We can then conclude that $\bar{Z} (\bar{Z} (X)) = \bar{Z} (X)$ .

Proposition 2 Suppose $X : = {x_{1}, . . ., x_{k}} \subset Ŝ_{n}$ such that d (x_i, x_j) = n−1 for any i ≠ j. Then for any permutation class $z \in \bar{Z} (X)$ the total distance d (z, X) between z and × is (k −1)(n−1) and hence $\bar{Z} (X) \subset M (X)$ Furthermore if m₁, m₂ ∈ M (X) then $\bar{[m_{1}, m_{2}]} \subset M (X)$ .

Proof Suppose m₁, m₂ ∈ M (X) and $m^{*} \in \bar{[m_{1}, m_{2}]}$ . By Lemma 2 and Proposition 1 we have $A_{m^{*}} \subset A_{m_{1}} \cup A_{m_{2}} \subset \cup_{i = 1}^{k} A_{x_{i}}$ . Applying Proposition 1 again, we have m^∗∈ M (X). Now it suffices to show that for any r ∈ IN ∪ {0}, Z_r (X) ⊂ M (X). We prove this by induction. For r = 0 this follows from Corollary 1. Suppose Z_r (X) ⊂ M (X). By definition we have Z_r+1(X) = Z(Z_r(X)). That is for z ∈ Z_r+1(X) there exists an m ∈ $N$ , y₁, ..., y_m ∈ Z_r (X) and z₁, ..., z_m, where $z_{i} \in Ŝ_{n}$ , such that z₁ = y₁, z_m = z and ${z_{i}}_{+ 1} \in \bar{[z_{i}, y_{i + 1}]} . z_{1} \in \bar{[y_{1}, y_{2}]}$ and by the fact we proved above z₁ ∈ M (X) since y₁, ..., y_m ∈ Z_r (X) ⊂ M (X). Continuing this we conclude that z₁, z₂, ..., z_m = z ∈ M (X). Hence Z_r+1(X) ⊂ M (X). This finishes the proof.

Conjecture 1 Every median point of X is accessible from X, that is $M (X) = \bar{Z} (X)$ .

The median value and medians of k random permutations

In this section we study the median value and median points of k independent random permutation classes uniformly chosen from $Ŝ_{n}$ . This is equivalent to studying the same problem for k random permutations sampled from S_n. All the results of this section carry over to permutations without any problem.

We make use of the fact that the bp distance of two independent random permutations tends to be close to its maximum value, n − 1. Xu et al. [4] showed that if we fix a reference linear permutation id and pick a random permutation x uniformly, the expected number and variance of $| A_{i d, x}^{(n)} |$ both are very close to 2 for large enough n. Because of the symmetry of the group S_n and the fact that bp distance is an invariant pseudometric the same results hold for two random permutations x and y. We first summarize the results we need from [4].

Let ${\tilde{ν}}_{n}$ be the uniform measure on S_n. Let $Π : S_{n} \to Ŝ_{n}$ be the natural surjective map sending each permutation onto its corresponding permutation class.

Define

ν_{n} : = Π * {\tilde{ν}}_{n}

(28)

to be the push-forward measure of ${\tilde{ν}}_{n}$ induced by the map Π. It is clear that $ν_{n}$ is the uniform measure on $Ŝ_{n}$ . The following proposition is a reformulation of Theorems 6 and 7 in [4].

Proposition 3 [Xu-Alain-Sankoff ] Let × and y be two independent random permutation classes (irpc) chosen uniformly from $Ŝ_{n}$ . Then

E [d (x, y)] = n - 3 - \frac{2}{n} - o (\frac{2}{n})

(29)

Var [d (x, y)] = 2 - \frac{2}{n} - o (\frac{2}{n})

(30)

Define the error function for the distance of x, y by

ε_{n} (x, y) : = (n - 1) - d (x, y) = | A_{x, y} | .

(31)

Corollary 2 Suppose × and y are two irpc's sampled from the uniform measure $ν_{n}$ and $a_{n}$ is an arbitrary sequence of real numbers diverging to +∞. Then $\frac{ε_{n} (x, y)}{a_{n}}$ converges to zero asymptotically $ν_{n}^{* 2}$ -almost surely (a.a.s.), that is

\frac{ε_{n} (x, y)}{a_{n}} \to 0 i n p r o b a b i l i t y .

(32)

Proof The proof is straightforward from [4] and Chebyshev's inequality.

Now we are ready to study the median value of k irpc's. Let [A] be a subset of $Ŝ_{n}$ with multiplicities and with k elements. Define

e_{n} ([A]) : = (k - 1) (n - 1) - m_{n} ([A]) .

(33)

Theorem 1 Let $X^{(n)} : = {x_{1}^{(n)}, x_{2}^{(n)}, \dots ., x_{k}^{(n)}}$ be a set of k irpc in $Ŝ_{n}$ sampled from the measure $ν_{n}^{* k}$ . Then their breakpoint median value $m_{n}^{*}; = m_{n} (X^{(n)})$ tends to be close to its maximum after a convenient rescaling with high probability, that is for any arbitrary sequence $a_{n}$ → ∞ as $n \to \infty, \infty \frac{e_{n}^{*}}{a_{n}} \to 0$ in $ν_{n}^{* k}$ -probability where $e_{n}^{*} : = e_{n} (X^{(n)})$

Proof Let π be an arbitrary point of S_n. Let $A_{π \ X} = A_{π} \ A_{X}$ . We have

\sum_{i = 1}^{k} | A_{π, x_{i}} | \leq | A_{π \ X} | + \sum_{i = 1}^{k} | A_{π, x_{i}} | \leq (n - 1) + (\begin{matrix} k \\ 2 \end{matrix}) α_{n}

(34)

where $α_{n}$ is max_i,j ε_n(x_i, x_j). On the other hand m_n(X⁽ⁿ⁾) ≤ (k − 1)(n − 1). The reason is the same as has already been discussed in the proof of Proposition 1. Therefore subtracting (k − 1)(n − 1) we have

0 \leq e_{n}^{*} \leq (\begin{matrix} k \\ 2 \end{matrix}) α_{n} .

(35)

Dividing by $a_{n}$ and letting n go to ∞ the result follows from the last corollary.

Theorem 2 Let $X^{(n)} : = {x_{1}^{(n)}, x_{2}^{(n)}, \dots, x_{k}^{(n)}}$ be a set of k irpc's in $Ŝ_{n}$ sampled from the measure $v_{n}^{* k}$ . Then for any permutation class $z^{(n)} \in \bar{Z} (X^{(n)})$ the total distance of z⁽ⁿ⁾to × is close to (k −1)(n−1) with high probability after a convenient rescaling. More explicitly, for any arbitrary sequence of real numbers $a_{n}$ converging to ∞

\frac{(k - 1) (n - 1) - d^{(n)} (z^{(n)}, X^{(n)})}{a_{n}} \to 0 i n v_{n}^{* k} - probability .

(36)

Therefore

\frac{d^{(n)} (z^{(n)}, X^{(n)}) - m_{n} (X^{(n)})}{a_{n}} \to 0 i n v_{n}^{* k} - probability .

(37)

Furthermore if $m_{1}^{(n)}, m_{2}^{(n)} \in M_{n} (X^{(n)})$ then for any ${\tilde{m}}^{(n)} \in \bar{[m_{1}^{(n)}, m_{2}^{(n)}]}$

\frac{d^{(n)} ({\tilde{m}}^{(n)}, X^{(n)}) - m_{n} (X^{(n)})}{a_{n}} \to 0 i n v_{n}^{* k} - probability .

(38)

Proof The structure of the proof is similar to the proof of Proposition 1. Suppose $o \in Ŝ_{n}$ with $A_{o} \subset_{i = 1}^{k} \cup A_{x_{i}}$ . Let $α_{n}$ be as defined in the proof of Theorem 1. Then by the same discussion we have

n - 1 \leq \sum_{i = 1}^{k} | A_{o, x_{i}} | \leq n - 1 + (\begin{matrix} k \\ 2 \end{matrix}) α_{n} .

(39)

Therefore

(k - 1) (n - 1) \geq d (o, X) \geq (k - 1) (n - 1) - (\begin{matrix} k \\ 2 \end{matrix}) α_{n}

(40)

and

\frac{(k - 1) (n - 1) - d (o, X)}{a_{n}} \to 0 i n p r o b a b i l i t y .

(41)

From Theorem 1 we have

\frac{(k - 1) (n - 1) - m_{n} (X)}{a_{n}} \to 0 i n p r o b a b i l i t y .

(42)

Hence

\frac{d (o, X) - m_{n} (X)}{a_{n}} \to 0 i n p r o b a b i l i t y .

(43)

It suffices to show that $z : = Z^{(n)} \in \bar{Z} (X)$ has the same property, that is $A_{z} \in \cup_{i = 1}^{k} A_{x_{i}}$ . But this is clear by induction. For the second part of the theorem let $m_{1, n}^{*}, m_{2, n}^{*} \in M (X)$ . Suppose $m^{*} \in [m_{1, n}^{*}, m_{2, n}^{*}]$ . By Theorem 1 $\frac{| A_{m_{i n}^{*} \ X} |}{a_{n}} \to 0$ in probability for i = 1, 2. On the other hand we have $A_{m^{*} \ X} \subset A_{m_{1, n}^{*} \ X} \cup A_{m_{2, n}^{*} \ X}$ .

Therefore

\frac{| A_{m^{*} \ X} |}{a_{n}} \to 0 i n p r o b a b i l i t y .

(44)

Therefore

(k - 1) (n - 1) \leq d (m^{*}, X) \leq (k - 1) (n - 1) + (\begin{matrix} k \\ 2 \end{matrix}) α_{n}

(45)

since

\frac{| A_{m^{*}, x_{i}} \cap A_{m^{*}, x_{j}} |}{a_{n}} \to 0 i n p r o b a b i l i t y .

(46)

The statement follows from the last inequality.

Conclusions

We have shown that the median value for a set of random permutations tends to be close to its extreme value with high probability. Also it has been shown that every permutation accessible from a set of random permutations can be considered as a median of that set asymptotically almost surely, and conjectured that the converse is true, that every median is accessible from the original set in this way.

Further work is needed to characterize the existence and size of non-trivial geodesic patches, in order to assess how extensive the set of medians is.

References

Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009, 10: 120-10.1186/1471-2105-10-120.
Article PubMed PubMed Central Google Scholar
Jamshidpey A, Sankoff D: Phase change for the accuracy of the median value in estimating divergence time. BMC Bioinformatics. 2013, 14: S15:S7-10.1186/1471-2105-14-157.
Article Google Scholar
Haghighi M, Sankoff D: Medians seek the corners, and other conjectures. BMC Bioinformatics. 2012, 13: S19:S5-10.1186/1471-2105-13-195.
Article Google Scholar
Xu AW, Alain B, Sankoff D: Poisson adjacency distributions in genome comparison: multichromosomal, circular, signed and unsigned cases. Bioinformatics. 2008, 24: i146-i152. 10.1093/bioinformatics/btn295.
Article PubMed Google Scholar

Download references

Acknowledgements

Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). DS holds the Canada Research Chair in Mathematical Genomics.

Declarations

The publication charges for this article were funded by the Canada Research Chair in Mathematical Genomics, and by the University of Ottawa.

This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Canada, K1N 6N5
Arash Jamshidpey & David Sankoff
Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences, Gava Zang, Zanjan, 45195-1159, Iran
Aryo Jamshidpey

Authors

Arash Jamshidpey
View author publications
You can also search for this author in PubMed Google Scholar
Aryo Jamshidpey
View author publications
You can also search for this author in PubMed Google Scholar
David Sankoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sankoff.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors participated in the research, wrote the paper, read and approved the manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Jamshidpey, A., Jamshidpey, A. & Sankoff, D. Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics 15 (Suppl 6), S3 (2014). https://doi.org/10.1186/1471-2164-15-S6-S3

Download citation

Published: 17 October 2014
DOI: https://doi.org/10.1186/1471-2164-15-S6-S3

Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints

Abstract

Background

Results

Backgound

Results

From pseudometric to metric

Defining the geodesic patch

The median value and medians of permutations with maximum pairwise distances

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Genomics

Contact us

Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints

Abstract

Background

Results

Backgound

Results

From pseudometric to metric

Defining the geodesic patch

The median value and medians of permutations with maximum pairwise distances

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us