Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints

Background The breakpoint median in the set Sn of permutations on n terms is known to have some unusual behavior, especially if the input genomes are maximally different to each other. The mathematical study of the set of medians is complicated by the facts that breakpoint distance is not a metric but a pseudo-metric, and that it does not define a geodesic space. Results We introduce the notion of partial geodesic, or geodesic patch between two permutations, and show that if two permutations are medians, then every permutation on a geodesic patch between them is also a median. We also prove the conjecture that the input permutations themselves are medians.


Backgound
Among the common measures of gene order difference between two genomes, the edit distances, such as reversal distance or double-cut-and-join distance, contrast with the breakpoint distance in that the former are defined in a geodesic space while the latter is not. Another characteristic of breakpoint distance that it does not share with most other genomic distances is that it is a pseudometric rather than a metric.
A problem in computational comparative genomics that has been extensively studied under many definitions of genomic distance is the gene order median problem [1], the archetypical instance of the gene order small phylogeny problem. The median genome is meant, in the first instance, to embody the information in common among k ≥ 3 given genomes, and second, to estimate the ancestral genome of these k genomes. We have shown that the second goal becomes unattainable as n ∞, where n is the length of the genomes, if there are more than 0.5n mutational steps changing the gene order [2]. Moreover, we have conjectured, and demonstrated in simulation studies, that where there is little or nothing in common among the k input genomes, the median tends to reflect only one (actually, any one) of them, with no incorporation of information from the other k − 1 [3].
In the present paper, we investigate this conjecture mathematically in the context of a wider study of medians for the breakpoint distance between unsigned linear unichromosomal genomes, although the methods and results are equally valid for genomes with signed and/or circular chromosomes, as well as those with c >1 chromosomes, where c is a fixed parameter. Our approach involves first a rigorous treatment of the pseudometric character of the breakpoint distance. Then, given the non-geodesic nature of the space we are able to define a weaker concept of geodesic patch, that we use later, given two or more medians, to locate further medians. We also prove the conjecture that for k genomes containing no gene order information among them, the normalized (divided by n) median score tends to k − 1, with high probability.

From pseudometric to metric
We denote by S n the set of all permutations of length n. Each permutation represents a unichromosomal linear genome where the numbers all represent different genes. For a permutation π := π 1 ... π n we define the set of adjacencies of π to be all the unordered pairs {π i , π i+1 } = {π i+1 , π i } for i = 1, ..., n − 1. For I ⊆ S n we denote by A I := A (n) I the set of all common adjacencies of the elements of I. Then A S n = ∅, and we also write A ∅ for the set of all pairs {i, j}, i ≠ j. For any I, J ⊆ S n A I ∪ J = A I ∩ A J . It will sometimes be convenient to write A I , the set of common adjacencies in I = {x 1 , ..., x k }, as A x1 , ..., x k . For example A x,y,z represents the set of adjacencies common to permutations x, y and z.
For x, y ∈ S n we define the breakpoint distance (bp distance) between x and y by This distance is not a metric on S n but rather a pseudometric because of nonreflexiveness: cases where d (n) (x, y) = 0 but x ≠ y, namely x = π 1 ... π n and y = π n ... π 1 , for any x ∈ S n . In these cases, the permutations x and y are said to be equivalent, denoted by x~y. The equivalence class containing π is represented by [π] and contains exactly two permutations, π 1 , ..., π n and π n , ..., π 1 . The number of classes is thus n!/2. For any π, we denote the other element of [π] byπ. The bp distance, a metric on the set of all equivalence classes of S n , denoted byŜ n := S n / ∼ is defined by Where there is no risk of ambiguity, we can simplify the notation by using x and y instead of [x] and [y], and/or drop the superscript n.
It is clear that the maximum possible bp distance between two permutation classes is n − 1 when they have no common adjacencies. Bp distance is symmetric on S n and hence onŜ n . By construction, it is reflexive onŜ n . To verify the triangle inequality, consider three permutations x, y, z. We have Therefore But |A x,y ∪ A y,z | = |A y ∩ (A x ∪ A z )| ≤ n − 1 and hence the triangle inequality holds.
We say a pseudometric (or a metric)ρ is right invariant on a group G if for any x, y, z ∈ G,ρ(x, y) =ρ(xz, yz). The definition of the left invariance is similar. A pseudometric (metric) which is both right and left invariant is called invariant. Bp distance is an invariant pseudometric on S n . Definition 1 Given a set {x 1 , . . . , x k } ⊆ S and a pseudometric space r on S, a median for the set is µ ∈ S such that k i=1 ρ(μ, x i )is minimal.

Defining the geodesic patch
A discrete metric space (S, r) is a geodesic space if for any two points x, y ∈ S there exists a finite subset of S containing x, y that is isometric with the discrete line segment [0, 1, ..., r(x, y)]. Any subset of S with this property, and there may be several, is called a geodesic between x and y. For example, all connected graphs are geodesic spaces. In a geodesic space the medians of two points x and y consist of all the points located on geodesics between x and y.
What can we say when the space is not a geodesic space? To answer this, we extend the concept of geodesic by introducing the concept of a geodesic patch. A geodesic patch between x and y is a maximal subset of S containing x, y which is isometric to a subsegment (not necessarily contiguous) of the line segment [0, 1, ..., r(x, y)]. For any two points x, y in an arbitrary metric space (S, r) there exists at least one geodesic patch between them because x, y is isometric to {0, r(x, y)}. In addition, any geodesic is a geodesic patch. Any point z on a geodesic patch between x, y satisfies: Therefore all the medians of two points x and y must lie on a geodesic patch between them. We denote the set of all permutations lying on geodesic patches connecting x, y ∈ S n by [x, y], as in Figure 1.
(Ŝ n , d) is not a geodesic space. For example there is no geodesic connecting the identity permutation id and π := 1 2 x 1 x 2 ... x n−4 n − 1 n when x 1 x 2 ... x n−4 is a non-identical permutation on {3, ..., n − 2}. The smallest change to id is to cut one of its adjacencies, say {i, i + 1}, and rejoin the two segments in one of the three possible ways: 1 to n, 1 to i + 1 or n to i. Now if we cut the adjacencies {1, 2} or {n − 1, n} in id the distance of the new permutation to both id and π increases. If on the other hand we cut one of the other adjacencies in id all the ways of rejoining, which increase the distance to id, either increase or leave unchanged the distance to π, since {1, n}, {1, i + 1} and {n, i} are not adjacencies in A π . Therefore there is no geodesic connecting id to π.
AlthoughŜ n is not a geodesic space there may still exist permutations with a geodesic between them. For example is a geodesic between id and π. Note d(id, π) = 5, the maximum possible distance inŜ 6 .

The median value and medians of permutations with maximum pairwise distances
In this section we investigate the bp median problem in the case of k permutations with maximum pairwise distances. As we shall see later, this situation is very similar to the case of k uniformly random permutations. Let (S, r) be a pseudometric space.
The total distance of a point x ∈ S to a finite subset The median value of B, m S,ρ (B), is the infimum of the total distance when the infimum is over all the points x ∈ S, that is We can extend this definition to sets with multiplicities. Let ∅ ≠ B ⊆ S. We define a multiplicity function n B from B to N and write n B (x) = n x . We call A = (B, n B ) a set with multiplicities. We define the total distance of a point x ∈ S to A to be The definition of median value in Equation (8) can be extended in an analogous way to the median value of a set with multiplicity A. When S is finite then the total distance function takes its minimum on S and "inf" turns into "min" in the above formulation. The points of the space S that minimize the total distance to A are called the median points or medians of A and the set of all these medians is called the median set of A, denoted by M S,r (A). Let Then the definition of [A] is straightforward: and we say two nonempty subsets of S n with multiplicities, namely A and A′ are equivalent, denoted by AÃ

′, if [A] = [A′]. In fact [A]
is the equivalence class containing A. We call [A] a subset ofŜ n with multiplicities. We use the notations "[ ]" and "~" for all the above concepts without restriction.
With these definitions we can readily verify that in the context of bp distance, for A~A′ and x~x′, we have Recall that we use d as both a metric onŜ n and a pseudometric on S n . Therefore we can conclude that and similarly  median value of × is (k − 1)(n − 1). Moreover, m* is a median of X, m*∈ M (X), if and only if A m * ⊂ ∪ k i=1 A x i . Proof Let π ∈Ŝ n be an arbitrary permutation class.
The equality holds letting π = x i for any 1 ≤ i ≤ k. This proves the first part of the proposition. For the second part we know that m * ∈ M (X) is equivalent with the fact that the total distance of m * to X is (k − 1) (n − 1), and this is equivalent to This finishes the proof of the equivalence relation in the proposition.
Lemma 1 Let x, y, z be three permutation classes inŜ n that are pairwise at a maximum distance n − 1 from each other. Then for any w ∈ [x, y] we have d(w, z) = n − 1.
Proof Having w ∈ [x, y] we have A w ⊂ A x ∪ A y . Also we know that A z ∩ (A x ∪ A y ) = 0. This concudes the result.
The above lemma simply indicates that for any two points x i , x j in the set X in the proposition above What more can we say about the median positions? The notion of "accessibility" will help us to keep track of some other medians of the set X that are not in Before defining this concept, we first need more information about the properties of [x, y] for x, y ∈Ŝ n .
Proof We know z ∈ [x, y] if and only if d(x, z) + d(z, y) = d(x, y). On the other hand we can write A z as follows where the pairwise intersection of the sets in the right hand side is empty. We can also write and and Now for "sufficiency", we have Therefore by Equation (23) we have This results in |A x,y | = |A x,y,z | and hence in A x,y ⊂ A z . Otherwise the inequality in (26) will be strict, which is impossible. On the other hand the inequality in (26) For "necessity", we have This is true because of A z ⊂ A x ∪ A y and Equation (23). But since A x,y ⊂ A z ⊂ A x ∪ A y we have |A x,y | = |A x,y,z | and we can replace |A x,y | by |A x,y,z | in the left hand side of the last equality. This finishes the "necessity" proof.
We denote the set of all 1-accessible points of X by Z (X). We define Z 0 (X) := X. Also for r ∈ N ∪ {0}, by induction, we define Z r+1 (X) to be Z(Z r (X)) and we call it the set of all r+1-accessible permutation classes. That is Z 1 (X) = Z(X), Z 2 (X) = Z(Z(X)) and so on. It is clear that Z r+1 (X) includes Z r (X) and also ∪ x,y∈Z r (X) [x, y]. A permutation class z is said to be accessible from × if there exists r ∈ N such that z ∈ Z r (X). We denote the set of all accessible points byZ(X) = ∪ r∈IN∪{0} Z r (X).
Conjecture 1 Every median point of X is accessible from X, that is M(X) =Z(X).
The median value and medians of k random permutations.
In this section we study the median value and median points of k independent random permutation classes uniformly chosen fromŜ n . This is equivalent to studying the same problem for k random permutations sampled from S n . All the results of this section carry over to permutations without any problem.
We make use of the fact that the bp distance of two independent random permutations tends to be close to its maximum value, n − 1. Xu et al. [4] showed that if we fix a reference linear permutation id and pick a random permutation x uniformly, the expected number and variance of |A id,x | both are very close to 2 for large enough n. Because of the symmetry of the group S n and the fact that bp distance is an invariant pseudometric the same results hold for two random permutations x and y. We first summarize the results we need from [4].
Letν n be the uniform measure on S n . Let : S n →Ŝ n be the natural surjective map sending each permutation onto its corresponding permutation class. Define ν n := * ν n (28) to be the push-forward measure ofν n induced by the map Π. It is clear that ν n is the uniform measure onŜ n . The following proposition is a reformulation of Theorems 6 and 7 in [4].
Proposition 3 [Xu-Alain-Sankoff ] Let × and y be two independent random permutation classes (irpc) chosen uniformly fromŜ n . Then Define the error function for the distance of x, y by ε n (x, y) := (n − 1) − d(x, y) = |A x,y |.
Corollary 2 Suppose × and y are two irpc's sampled from the uniform measure ν nand a nis an arbitrary sequence of real numbers diverging to +∞. Then ε n (x, y) a n converges to zero asymptotically ν * 2 n -almost surely (a.a.s.), that is ε n (x, y) a n → 0 in probability.
Proof The proof is straightforward from [4] and Chebyshev's inequality.
Now we are ready to study the median value of k irpc's. Let [A] be a subset ofŜ n with multiplicities and with k elements. Define (33) Figure 2 Accessibility. Illustration of howZ is constructed.