Near-medians that avoid the corners; a combinatorial probability approach
© Larlee et al.; licensee BioMed Central Ltd. 2014
Published: 17 October 2014
The breakpoint median for a set of k ≥ 3 random genomes tends to approach (any) one of these genomes ("corners") as genome length increases, although there are diminishing proportion of medians equidistant from all k ("medians in the middle"). Algorithms are likely to miss the latter, and this has consequences for the general case where input genomes share some or many gene adjacencies, where the tendency for the median to be closer to one input genome may be an artifact of the corner tendency.
We present a simple sampling procedure for constructing a "near median" that represents a compromise among k random genomes and that has only a slightly greater breakpoint distance to all of them than the median does. We generalize to the realistic case where genomes share varying proportions of gene adjacencies. We present a supplementary sampling scheme that brings the constructed genome even closer to median status.
Our approach is of particular use in the phylogenetic context where medians are repeatedly calculated at ancestral nodes, and where the corner effect prevents different parts of the phylogeny from communicating with each other.
The "steinerization" approach to the small phylogeny problem, dating from 1976 in sequence space  and from 1997 in gene order space , is illustrated in Figure 1c. We decompose T into h overlapping median problems, each problem focusing on one of the h ancestral points and the three or more vertices of T that are adjacent to it. Suppose we have a way of solving the median problem. After initializing the ancestral points in T randomly or in some other way, the ancestral positions can be improved one at a time. For each ancestral node, we simply apply the median solution to the current positions of its immediate neighbors. At each step the new median is retained only if it has a lower score than the current value. The process is iterated until no further improvement is possible. The output, not necessarily unique and not necessarily optimal, where every ancestor vertex is at the median position of its adjacent vertices, is called a "steinerized" solution for the small phylogeny problem, after the well-known Steiner problem in Euclidean 2-space.
The value of the median as a prototype and as a component step for the construction of gene-order phylogenies has been undermined by simulations that show the median for a set of k ≥ 3 random input genomes tends, as genome length n increases, to coincide with any one of these k genomes themselves [3, 5]. The median thus reflects no gene-order information from any of the other k − 1 genomes: the "medians in the corner" effect. There are some medians that are "in the middle", containing information drawn from several, or all, the k genomes, but these become relatively rare as genome length increases . Because these observations come from simulations followed by application of complex optimization algorithm , we can derive no precise analytic solutions about the probabilities of different kinds of median. These observations hold for all kinds of genomes, signed, unsigned, with single or multiple (with a bounded number of chromosomes), circular or linear chromosomes, and for all kinds of genomic distance: e.g., breakpoint distance and double-cut-and-join distance.
How important these findings are, for sets of genomes that have substantial gene order commonalities, is not clear. Even for random genomes, not only do we observe some medians in the middle, but any fixed genome can be shown to be the median, or close to the median, of k ≥ 3 other genomes that are essentially random with respect to each other.
The corner tendency has serious implications for "small phylogeny" analyses using a steinerization strategy, where each ancestral node in a given unrooted tree is the median of its three neighbours. Here, an ancestral genome is determined by iterating a median algorithm over the tree, starting with arbitrary initial genomes. When medians tend to fall at or near corners, the iteration process cannot effectively transfer relevant genomic commonalities between remote branches of the tree. How can we avoid this pathology in a principled way?
In this paper, we propose a simple initial construction for a genome which includes gene-order information from all the k given genomes. These are "near medians" in a well-defined sense, and they approximate true medians as k increases, and as the common gene-order information between subsets of the genomes increase. Because the construction is based on a binomial sampling scheme, we can analytically derive quantitative predictions about all the results. A second sampling step, somewhat more complicated, then substantially improves the initial construction.
The basic construction
Consider three signed genomes, I, II and III, each consisting of one or more circular chromosomes, containing the same n genes and each containing n gene adjacencies. Since the genomes are signed, the genes have polarity, from the − end to the + end, or from + to −. Each adjacency is thus an unordered pair of the 2n gene ends, chosen from among possibilities. The analysis is essentially the same for linear, circular, unichromosomal or multichromosomal genomes; the effect of allowing a bounded number >1 of chromosomes would be o(n) as would be the differences between circular and linear models. We assume that the genomes are randomly ordered, which means, for all intents and purposes, that they share virtually no adjacencies; the expected number of shared adjacencies is actually 0.5, a constant, even for very large n.
Thus we can expect ample adjacencies in genome II from which to pick θn which do not conflict with any of those selected from genome I.
Similarly, having then selected θn pairs from each of genomes I and II, the expected proportion of pairs in genome III with two free ends is (1 − 2θ)2.
so . Setting produces a set of at most compatible adjacencies, drawn equally from genomes I, II and III. Compatibility simply means that no two adjacencies contain the same gene end. The value is attainable by fixing .
We can construct an additional compatible adjacencies to bring up to full genome status (containing linear and/or circular chromosomes), simply by using any (graph-theoretical) matching on the gene ends that do not occur in .
is called its score. With , the score of is ≤ 2.25.
The second step in the construction of may be done in such a way as to decrease its score below 2.25, but this depends on the input genomes as well as our initial sampling of θ adjacencies per genome, so we cannot make statistical predictions as easily as we can for the first step. The second step will be discussed in detail later in this paper.
It is remarkable that by blindly sampling adjacencies, with no view towards optimization, we can construct a genome which has a score of 2.25 or less, when a median would have a score of 2.0, not that much smaller. Moreover, represents each input genome equally, so is close to the "middle", whereas a median found through optimization is more likely to be in a "corner", especially as n increases. In addition, as we detail in the next subsection, we can obtain the probability distribution of properties of analytically. Finally and most important, this basic construction is the key to a number of other developments that we will detail later in this paper.
Statistical properties of the construction
taking into account the continuity correction (±0.5).
Probability of at least θn adjacencies with two free ends in genome III.
θ : 0.10
n = 5
The case of k >3 genomes
The root of this is θ = 0.18858, so that , leading to an having a score, the sum of the normalized distances to the four input genomes, of at most 3.24568, compared to the median score of 3.0 .
Approach of normalized sum of distances to median score, with increasing k.
k - 1
We have hitherto analyzed the case of genomes that are completely random with respect to each other. In practice this is of limited interest, but the techniques and bounds developed for this case are essential to more realistic situations where an input genome shares a non-trivial proportion of its adjacencies with some or all of the other input genomes.
Suppose all three genomes share ψn adjacencies. If we have selected, from each of two genomes containing the same n genes, θn + ψn adjacencies, the number of remaining adjacencies with two free ends in the third genome is (1 − 2θ − ψ)2n.
Approach to true median for ψ = 0.625 as k increases
true median value
Note that in this case, there may be some asymmetry among the three genomes.
Improving the set of compatible adjacencies
Consider our original construction where genomes I, II and III, with no adjacencies in common, each contribute adjacencies to make up . Clearly there must be gene ends, out of a total of 2n, not selected for inclusion in . We earlier suggested that any matching of these genes ends could be added to to make up a full genome. In fact this can be done in such a way as to decrease the score of below 2.25.
We revisit the selection of adjacencies from the three genomes. After selecting these from genome I, a proportion of the gene ends will remain free, namely ends. This means that in each of genomes II and III, of the adjacencies will have had two ends already selected, will have one free and one selected and will have two free ends. After selecting adjacencies from the to contribute to , we are left with with two free ends in genome II. These ends are individually available during the selection of adjacencies from genome III or, eventually, for the supplementary sampling.
At the same time, of the free ends left after the genome I selection, only ends still remain free after the genome II selection. In other words of the originally free genome ends still remain free. Of the adjacencies originally with two free ends in genome I, only still have two free ends. These are available for selection from genome III or, eventually, for the supplementary sampling.
Now consider the adjacencies left in genome III after selecting the from genome I. Of these only will have two free ends, and all of these must be contributed to .
adjacencies are available to be added to to construct with a better score than 2.25.
Some of these adjacencies may conflict, i.e., share an end, so the actual reduction in the score must be found by maximum matching and will differ substantially from one instance to another. Simulations show that the average number of compatible adjacencies is about . The score of , once additional adjacencies not in genomes I, II or III have been formed by matching the remaining free ends, thus completing the set of n adjacencies, has been reduced to around 2.13, which is closer to the median value of 2.0.
We have thus constructed something very close to the median in two stages, the first being the random sampling of compatible adjacencies from each of the three genomes. The second stage is the search, in all of the genomes, for residual adjacencies where neither end has been chosen in the first stage, as in Figure 2b.
In this paper, we started with a simple construction of a set of adjacencies drawing equally on three random genomes, in an effort to avoid degenerate medians consisting of one of the input medians. This construction produced a normalized sum of distances score of 2.25 instead of the median value of 2. (See reference  for a proof of the median value.) We then generalized this to k ≥ 3 input genomes, leading to better and better approximations to the median value of k − 1.
The results for k = 3 could then be generalized to the more realistic situation where the genomes have adjacencies in common.
Finally for the case of k = 3, we found a way of improving the construction to come within 6-7% of a median genome, containing roughly equal contributions from each of the inputs.
It should be noted that the basic construction we start out with is severely non-unique, since it is based on one random sample among possibilities. On the other hand, it does contain information about all three input genomes, rather than just one as in the case of a corner median. In realistic situation where the input genomes have some or many adjacencies in common, these are all captured (for k = 3, at least) and the basic construction only organizes the rest of the near-median.
The sampling we describe here runs in time linear with n, including the supplementary sampling as long as each gene end carries sufficient information. If a maximum matching algorithm is used at the end, this is theoretically O(n3), but in practice only a small number of gene ends are involved at this step. This contrasts with classical median solvers, e.g., , which are severely exponential in running time, especially for genomes that are highly rearranged with respect to each other.
We propose that near-medians may be more useful than corner medians in the phylogenetic context, lacking any technical capacity to detect medians in the middle (also non-unique) for large n without costly multiple runs of median solvers.
Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada. DS holds the Canada Research Chair in Mathematical Genomics.
The publication charges for this article were funded by the Canada Research Chair in Mathematical Genomics, and by the University of Ottawa.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
- Sankoff D, Cedergren RJ, Lapalme J: Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA. Journal of Molecular Evolution. 1976, 7: 133-149. 10.1007/BF01732471.PubMedView ArticleGoogle Scholar
- Sankoff D, Blanchette M: Multiple genome rearrangement and breakpoint phylogeny. Journal of Computational Biology. 1998, 5: 555-570. 10.1089/cmb.1998.5.555.PubMedView ArticleGoogle Scholar
- Haghighi M, Sankoff D: Medians seek the corners, and other conjectures. BMC Bioinformatics. 2012, 13: S19:S5-10.1186/1471-2105-13-195.View ArticleGoogle Scholar
- Boyd S, Haghighi M: A fast method for large-scale multichromosomal breakpoint median problems. Journal of Bioinformatics and Computational Biology. 2012, 10: 1240008-10.1142/S0219720012400082.PubMedView ArticleGoogle Scholar
- Jamshidpey Arash, Jamshidpey Aryo, Sankoff D: Sets of medians in the non-geodesic pseudometric space of unsigned genomes with breakpoints. BMC Genomics. 2014, 15 (Suppl 6): S3-10.1186/1471-2164-15-S6-S3.PubMedPubMed CentralView ArticleGoogle Scholar
- Xu AW: A fast and exact algorithm for the median of three problem: A graph decomposition approach. Journal of Computational Biology. 2009, 1369-1381. 16Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.