- Research
- Open access
- Published:
Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints
BMC Genomics volume 15, Article number: S3 (2014)
Abstract
Background
The breakpoint median in the set S n of permutations on n terms is known to have some unusual behavior, especially if the input genomes are maximally different to each other. The mathematical study of the set of medians is complicated by the facts that breakpoint distance is not a metric but a pseudo-metric, and that it does not define a geodesic space.
Results
We introduce the notion of partial geodesic, or geodesic patch between two permutations, and show that if two permutations are medians, then every permutation on a geodesic patch between them is also a median. We also prove the conjecture that the input permutations themselves are medians.
Backgound
Among the common measures of gene order difference between two genomes, the edit distances, such as reversal distance or double-cut-and-join distance, contrast with the breakpoint distance in that the former are defined in a geodesic space while the latter is not. Another characteristic of breakpoint distance that it does not share with most other genomic distances is that it is a pseudometric rather than a metric.
A problem in computational comparative genomics that has been extensively studied under many definitions of genomic distance is the gene order median problem [1], the archetypical instance of the gene order small phylogeny problem. The median genome is meant, in the first instance, to embody the information in common among k ≥ 3 given genomes, and second, to estimate the ancestral genome of these k genomes. We have shown that the second goal becomes unattainable as n → ∞, where n is the length of the genomes, if there are more than 0.5n mutational steps changing the gene order [2]. Moreover, we have conjectured, and demonstrated in simulation studies, that where there is little or nothing in common among the k input genomes, the median tends to reflect only one (actually, any one) of them, with no incorporation of information from the other k − 1 [3].
In the present paper, we investigate this conjecture mathematically in the context of a wider study of medians for the breakpoint distance between unsigned linear unichromosomal genomes, although the methods and results are equally valid for genomes with signed and/or circular chromosomes, as well as those with χ >1 chromosomes, where χ is a fixed parameter. Our approach involves first a rigorous treatment of the pseudometric character of the breakpoint distance. Then, given the non-geodesic nature of the space we are able to define a weaker concept of geodesic patch, that we use later, given two or more medians, to locate further medians. We also prove the conjecture that for k genomes containing no gene order information among them, the normalized (divided by n) median score tends to k − 1, with high probability.
Results
From pseudometric to metric
We denote by S n the set of all permutations of length n. Each permutation represents a unichromosomal linear genome where the numbers all represent different genes. For a permutation π := π1 ... π n we define the set of adjacencies of π to be all the unordered pairs {π i , πi+1} = {π i+1 , π i } for i = 1, ..., n − 1. For I ⊆ S n we denote by the set of all common adjacencies of the elements of I. Then , and we also write for the set of all pairs {i, j}, i ≠ j. For any I, J ⊆ S n . It will sometimes be convenient to write , the set of common adjacencies in I = {x1, ..., x k }, as . For example A x,y,z represents the set of adjacencies common to permutations x, y and z.
For x, y ∈ S n we define the breakpoint distance (bp distance) between x and y by
This distance is not a metric on S n but rather a pseudometric because of nonreflexiveness: cases where d(n) (x, y) = 0 but x ≠ y, namely x = π1 ... π n and y = π n ... π1, for any x ∈ S n . In these cases, the permutations x and y are said to be equivalent, denoted by x ~ y. The equivalence class containing π is represented by [π] and contains exactly two permutations, π1, ..., π n and π n , ..., π1. The number of classes is thus n!/2. For any π, we denote the other element of [π] by . The bp distance, a metric on the set of all equivalence classes of S n , denoted by is defined by
Where there is no risk of ambiguity, we can simplify the notation by using x and y instead of [x] and [y], and/or drop the superscript n.
It is clear that the maximum possible bp distance between two permutation classes is n − 1 when they have no common adjacencies. Bp distance is symmetric on S n and hence on . By construction, it is reflexive on . To verify the triangle inequality, consider three permutations x, y, z. We have
Therefore
But and hence the triangle inequality holds.
We say a pseudometric (or a metric) is right invariant on a group G if for any . The definition of the left invariance is similar. A pseudometric (metric) which is both right and left invariant is called invariant. Bp distance is an invariant pseudometric on S n .
Definition 1 Given a set {x1, . . . , x k } ⊆ S and a pseudometric space ρ on S, a median for the set is µ ∈ S such that is minimal.
Defining the geodesic patch
A discrete metric space (S, ρ) is a geodesic space if for any two points x, y ∈ S there exists a finite subset of S containing x, y that is isometric with the discrete line segment [0, 1, ..., ρ(x, y)]. Any subset of S with this property, and there may be several, is called a geodesic between x and y. For example, all connected graphs are geodesic spaces. In a geodesic space the medians of two points x and y consist of all the points located on geodesics between x and y.
What can we say when the space is not a geodesic space? To answer this, we extend the concept of geodesic by introducing the concept of a geodesic patch. A geodesic patch between x and y is a maximal subset of S containing x, y which is isometric to a subsegment (not necessarily contiguous) of the line segment [0, 1, ..., ρ(x, y)]. For any two points x, y in an arbitrary metric space (S, ρ) there exists at least one geodesic patch between them because x, y is isometric to {0, ρ(x, y)}. In addition, any geodesic is a geodesic patch. Any point z on a geodesic patch between x, y satisfies:
Therefore all the medians of two points x and y must lie on a geodesic patch between them. We denote the set of all permutations lying on geodesic patches connecting x, y ∈ S n by , as in Figure 1.
is not a geodesic space. For example there is no geodesic connecting the identity permutation id and π := 1 2 x1 x2 ... xn−4 n − 1 n when x1 x2 ... xn−4 is a non-identical permutation on {3, ..., n − 2}. The smallest change to id is to cut one of its adjacencies, say {i, i + 1}, and rejoin the two segments in one of the three possible ways: 1 to n, 1 to i + 1 or n to i. Now if we cut the adjacencies {1, 2} or {n − 1, n} in id the distance of the new permutation to both id and π increases. If on the other hand we cut one of the other adjacencies in id all the ways of rejoining, which increase the distance to id, either increase or leave unchanged the distance to π, since {1, n}, {1, i + 1} and {n, i} are not adjacencies in . Therefore there is no geodesic connecting id to π.
Although is not a geodesic space there may still exist permutations with a geodesic between them. For example
is a geodesic between id and π. Note d(id, π) = 5, the maximum possible distance in .
The median value and medians of permutations with maximum pairwise distances
In this section we investigate the bp median problem in the case of k permutations with maximum pairwise distances. As we shall see later, this situation is very similar to the case of k uniformly random permutations. Let (S, ρ) be a pseudometric space.
The total distance of a point x ∈ S to a finite subset ∅ ≠ B ⊆ S is defined to be
The median value of B, , is the infimum of the total distance when the infimum is over all the points x ∈ S, that is
We can extend this definition to sets with multiplicities. Let ∅ ≠ B ⊆ S. We define a multiplicity function n B from B to and write n B (x) = n x . We call A = (B, n B ) a set with multiplicities. We define the total distance of a point x ∈ S to A to be
The definition of median value in Equation (8) can be extended in an analogous way to the median value of a set with multiplicity A. When S is finite then the total distance function takes its minimum on S and "inf" turns into "min" in the above formulation. The points of the space S that minimize the total distance to A are called the median points or medians of A and the set of all these medians is called the median set of A, denoted by M S,ρ(A).
Let B and A = (B, n B ) be a subset and a subset with multiplicities of S n . We define [B] to be the set of all permutation classes of S n that have at least one of their permutations in B. That is
Two nonempty subsets B, B′ ⊆ S n are said to be equivalent, denoted by B ~ B', if [B] = [B′]. Also we define [n B ] to be a function from [B] to with
Then the definition of [A] is straightforward:
and we say two nonempty subsets of S n with multiplicities, namely A and A′ are equivalent, denoted by A ~ A′, if [A] = [A′]. In fact [A] is the equivalence class containing A. We call [A] a subset of with multiplicities. We use the notations "[ ]" and " ~ " for all the above concepts without restriction.
With these definitions we can readily verify that in the context of bp distance, for A ~ A′ and x ~ x′, we have
Recall that we use d as both a metric on and a pseudometric on S n . Therefore we can conclude that
and similarly
Henceforward, we will simplify by replacing the notation and by m n (A) and M n (A), respectively. Also for a subset [A] of with multiplicities, we will use the notation m n ([A]) and M n ([A]) instead of and respectively. Where there is no ambiguity we will suppress the subscript n.
Proposition 1 Suppose such that d(x i , x j ) = n − 1 for any i ≠ j, i ≤ i, j ≤ n. Then the bp median value of × is (k − 1)(n − 1). Moreover, m∗ is a median of X, m∗∈ M (X), if and only if .
Proof Let be an arbitrary permutation class. Since and for any 1 ≤ i, j ≤ k, we have . Also
Therefore
Hence
The equality holds letting π = x i for any 1 ≤ i ≤ k. This proves the first part of the proposition. For the second part we know that m∗ ∈ M (X) is equivalent with the fact that the total distance of m∗ to X is (k − 1)(n − 1), and this is equivalent to and be written as . This finishes the proof of the equivalence relation in the proposition.
Lemma 1 Let x, y, z be three permutation classes in that are pairwise at a maximum distance n − 1 from each other. Then for any we have d (w, z) = n − 1.
Proof Having we have A w ⊂ A x ∪ A y . Also we know that . This concudes the result.
The above lemma simply indicates that for any two points x i , x j in the set X in the proposition above since the total distance of each point in to X is (k − 1)(n − 1).
Corollary 1 Suppose such that d(x i , x j ) = n − 1 for any i ≠ j. Then .
What more can we say about the median positions? The notion of "accessibility" will help us to keep track of some other medians of the set X that are not in . Before defining this concept, we first need more information about the properties of for .
Lemma 2 Let . Then if and only if .
Proof We know if and only if d(x, z) + d(z, y) = d(x, y). On the other hand we can write A z as follows
where the pairwise intersection of the sets in the right hand side is empty. We can also write
and
Furthermore
and
Now for "sufficiency", we have
Therefore by Equation (23) we have
This results in |A x,y | = |A x,y,z | and hence in A x,y ⊂ A z . Otherwise the inequality in (26) will be strict, which is impossible. On the other hand the inequality in (26) shows which concludes at .
For "necessity", we have
This is true because of A z ⊂ A x ∪ A y and Equation (23). But since A x,y ⊂ A z ⊂ A x ∪ A y we have |A x,y | = |A x,y,z | and we can replace |A x,y | by |A x,y,z | in the left hand side of the last equality. This finishes the "necessity" proof.
Definition 2 Let × := {x1, ..., x k } be a subset of . We say a permutation class is 1-accessible from X if there exists an m ∈ , a finite sequence y1, ..., y m where y i ∈ X and z1, ..., z m , where such that z1 = y1, z m = z and for . See Figure 2.
We denote the set of all 1-accessible points of X by Z(X). We define Z0(X) := X. Also for r ∈ ∪ {0}, by induction, we define Zr+1(X) to be Z(Z r (X)) and we call it the set of all r+1-accessible permutation classes. That is Z1(X) = Z(X), Z2(X) = Z(Z(X)) and so on. It is clear that Zr+1(X) includes Z r (X) and also . A permutation class z is said to be accessible from × if there exists r ∈ such that z ∈ Z r (X). We denote the set of all accessible points by .
Note that . This holds because for any 1-accessible permutation class z from , there must exist , (the y i 's must be in , thus there must be such an r0) and z1, ..., z m where such that z1 = y1, z m = z and . Therefore . We can then conclude that .
Proposition 2 Suppose such that d (x i , x j ) = n−1 for any i ≠ j. Then for any permutation class the total distance d (z, X) between z and × is (k −1)(n−1) and hence Furthermore if m1, m2 ∈ M (X) then .
Proof Suppose m1, m2 ∈ M (X) and . By Lemma 2 and Proposition 1 we have . Applying Proposition 1 again, we have m∗∈ M (X). Now it suffices to show that for any r ∈ IN ∪ {0}, Z r (X) ⊂ M (X). We prove this by induction. For r = 0 this follows from Corollary 1. Suppose Z r (X) ⊂ M (X). By definition we have Zr+1(X) = Z(Z r (X)). That is for z ∈ Zr+1(X) there exists an m ∈ , y1, ..., y m ∈ Z r (X) and z1, ..., z m , where , such that z1 = y1, z m = z and and by the fact we proved above z1 ∈ M (X) since y1, ..., y m ∈ Z r (X) ⊂ M (X). Continuing this we conclude that z1, z2, ..., z m = z ∈ M (X). Hence Zr+1(X) ⊂ M (X). This finishes the proof.
Conjecture 1 Every median point of X is accessible from X, that is .
The median value and medians of k random permutations
In this section we study the median value and median points of k independent random permutation classes uniformly chosen from . This is equivalent to studying the same problem for k random permutations sampled from S n . All the results of this section carry over to permutations without any problem.
We make use of the fact that the bp distance of two independent random permutations tends to be close to its maximum value, n − 1. Xu et al. [4] showed that if we fix a reference linear permutation id and pick a random permutation x uniformly, the expected number and variance of both are very close to 2 for large enough n. Because of the symmetry of the group S n and the fact that bp distance is an invariant pseudometric the same results hold for two random permutations x and y. We first summarize the results we need from [4].
Let be the uniform measure on Sn. Let be the natural surjective map sending each permutation onto its corresponding permutation class.
Define
to be the push-forward measure of induced by the map Π. It is clear that is the uniform measure on . The following proposition is a reformulation of Theorems 6 and 7 in [4].
Proposition 3 [Xu-Alain-Sankoff ] Let × and y be two independent random permutation classes (irpc) chosen uniformly from . Then
Define the error function for the distance of x, y by
Corollary 2 Suppose × and y are two irpc's sampled from the uniform measure and is an arbitrary sequence of real numbers diverging to +∞. Then converges to zero asymptotically -almost surely (a.a.s.), that is
Proof The proof is straightforward from [4] and Chebyshev's inequality.
Now we are ready to study the median value of k irpc's. Let [A] be a subset of with multiplicities and with k elements. Define
Theorem 1 Let be a set of k irpc in sampled from the measure . Then their breakpoint median value tends to be close to its maximum after a convenient rescaling with high probability, that is for any arbitrary sequence → ∞ as in -probability where
Proof Let π be an arbitrary point of S n . Let . We have
where is max i,j ε n (x i , x j ). On the other hand m n (X(n)) ≤ (k − 1)(n − 1). The reason is the same as has already been discussed in the proof of Proposition 1. Therefore subtracting (k − 1)(n − 1) we have
Dividing by and letting n go to ∞ the result follows from the last corollary.
Theorem 2 Let be a set of k irpc's in sampled from the measure . Then for any permutation class the total distance of z(n) to × is close to (k −1)(n−1) with high probability after a convenient rescaling. More explicitly, for any arbitrary sequence of real numbers converging to ∞
Therefore
Furthermore if then for any
Proof The structure of the proof is similar to the proof of Proposition 1. Suppose with . Let be as defined in the proof of Theorem 1. Then by the same discussion we have
Therefore
and
From Theorem 1 we have
Hence
It suffices to show that has the same property, that is . But this is clear by induction. For the second part of the theorem let . Suppose . By Theorem 1 in probability for i = 1, 2. On the other hand we have .
Therefore
Therefore
since
The statement follows from the last inequality.
Conclusions
We have shown that the median value for a set of random permutations tends to be close to its extreme value with high probability. Also it has been shown that every permutation accessible from a set of random permutations can be considered as a median of that set asymptotically almost surely, and conjectured that the converse is true, that every median is accessible from the original set in this way.
Further work is needed to characterize the existence and size of non-trivial geodesic patches, in order to assess how extensive the set of medians is.
References
Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009, 10: 120-10.1186/1471-2105-10-120.
Jamshidpey A, Sankoff D: Phase change for the accuracy of the median value in estimating divergence time. BMC Bioinformatics. 2013, 14: S15:S7-10.1186/1471-2105-14-157.
Haghighi M, Sankoff D: Medians seek the corners, and other conjectures. BMC Bioinformatics. 2012, 13: S19:S5-10.1186/1471-2105-13-195.
Xu AW, Alain B, Sankoff D: Poisson adjacency distributions in genome comparison: multichromosomal, circular, signed and unsigned cases. Bioinformatics. 2008, 24: i146-i152. 10.1093/bioinformatics/btn295.
Acknowledgements
Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). DS holds the Canada Research Chair in Mathematical Genomics.
Declarations
The publication charges for this article were funded by the Canada Research Chair in Mathematical Genomics, and by the University of Ottawa.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors participated in the research, wrote the paper, read and approved the manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Jamshidpey, A., Jamshidpey, A. & Sankoff, D. Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics 15 (Suppl 6), S3 (2014). https://doi.org/10.1186/1471-2164-15-S6-S3
Published:
DOI: https://doi.org/10.1186/1471-2164-15-S6-S3