Inferring gene orders and gene content of ancestral genomes has a wide range of applications. High-level rearrangement events such as inversions, transpositions that change gene order are important because they are "rare genomic events" [1] and can be used to estimate ancestral genomes and infer the number of steps that evolve one genome into another.
The fundamental question in building phylogenies is how far apart two species are from each other. Hannenhalli and Penvzner [2] provided the first polynomial algorithm for computing inversion distance, which can be used as a representation of the evolution distance between species. Then by using a direct optimization approach based on median calculation, which is to optimize each ancestral node in terms of its three or more immediate neighbors, the phylogeny and ancestral genomes could be reconstructed. Current popular methods (such as GRAPPA [3], MGR [4] and their new improved versions) all use this approach and can infer ancestral gene orders with high accuracy. However these methods are extremely slow especially when the rearrangement rate is high. A new method based on the maximum likelihood approach is recently proposed [5]. In this method, the probabilities of all possible ancestral gene order are calculated based on the present species' gene order data and the ancestral gene orders are reconstructed to maximize the overall probability. This method is much faster than the direct optimization methods under high evolution rearrangement event rate, but the accuracy of the reconstructed gene order is lower than that of the direct optimization methods.
In this paper, we propose a mixture method to enhance the inference of ancestral gene orders. We first use the maximum likelihood method to reconstruct an initial ancestral genome. Then we randomly select a number of gene adjacencies from the ancestral genome and fix them to run the median calculation. By analyzing the results of each median calculation, we try to infer the correct gene adjacencies generated by the maximum likelihood method. Finally we fix these adjacencies and perform median calculation to get the ancestral genome. We have conducted extensive experiments on simulated datasets and the results show that this mixture approach is more accurate than the maximum likelihood method. Also our method is much faster than solely using the median calculation when the rearrangement rate is high. Using this hybrid approach, for those datasets that are previously too difficult for existing methods, we will be able to analyze them within a reasonable time frame with very high accuracy.
Genome rearrangements
Given a set S of n genes {1, 2, ⋯, n}, a genome can be represented by an ordering of these genes. A gene with a positive orientation is written as i, otherwise it is written as a -i. A genome can be linear or circular. A linear genome is a permutation on the gene set, while a circular genome can be represented in the same way under the implicit assumption that the permutation closes back on itself. Let G be the genome with signed ordering of 1, 2, ⋯, n. An inversion (also called reversal) between indices i and j (i ≤ j), transforms G to a new genome with ordering
There are some additional events for multiple-chromosome genomes, such as translocation (when the end of one chromosome switches with the end of another chromosome), fission (when one chromosome splits to form two) and fusion (when two chromosomes combine to become one).
We use a breakpoint graph [6] to represent the permutation with respect to the identity permutation. Given a genome with permutation π, let the breakpoint graph that corresponds to it be M (π). The vertex set V of M (π) is the collection of {2i - 1, 2i}, and i is any distinct gene of permutation π. Two genes i and j are said to be adjacent in genome G if i is immediately followed by j and can be represented by an edge (2i, 2j-1). If i and -j are adjacent, then it can be represented by an edge (2i, 2j). The edge set E of M (π) consists of all the adjacencies in π. For example, for a circular unichromosomal genome G2 = 1,4, 2, -3, 5, the vertex set is V = {1, 2, 3, 4, 5, 6, 7, 8, 9,10} and the edge set is E = {(10,1), (2, 7), (3, 8), (4, 6), (5, 9)}, as shown in Figure 1. The breakpoint graph extends naturally to a multiple breakpoint graph (MBG), representing a set of three or more genomes.
Inferring ancestral gene order by median calculation
One popular approach to reconstruct ancestral genome is through direct optimization and the task is to seek the tree with the minimum number of events, which is in spirit similar to the maximum parsimony approach used in sequence phylogeny. The core of these methods is to solve the median problem defined as follows: given a set of N genomes with permutations {π
i
}1≤i ≤ Nand a distance measure d, find a permutation π
M
that minimizes median score defined as .
Widely used methods like GRAPPA [3] and MGR [7] (as well as their new improved versions) both use median calculation to infer phylogenies and ancestral gene orders. For example, GRAPPA examines every possible tree topology and reports the one with the minimum events. For a given tree, it iteratively updates the gene order on each ancestral node in terms of its three neighbors through median calculation until the whole score of the tree reaches minimum. At this point the gene order obtained on each internal node corresponds to the inferred ancestral genome data and the total number of events is simply the summation of pairwise distances along all edges.
The median problem is known to be NP-hard [6] under most rearrangement distances proposed so far. Recently Xu proposed the concept of Adequate Subgraph [8] and applied this theory to create the ASMedian solver, which is the fastest median solver to date. The Multiple Breakpoint Graph (MBG) is used to model the median problem and an Adequate Subgraph is defined as a subgraph that has number of cycles larger or equal to 3m/2 (m is half the number of the vertices of the subgraph). Adequate subgraphs allow a decomposition of an MBG into smaller, more easily solved graphs. The median solution is the combination of the median solutions for the adequate subgraphs and the remaining MBG, so we can shrink the known median edges of the adequate subgraph and reduces the original MBG size.
Figure 2 shows an example of adequate subgraph decomposition. Given three circular genomes {1, 2, 3}, {1, 2, -3} and {1, 3, 2}, the MBG of these three genomes are shown on the left. Vertices (2, 3) composes an adequate subgraph according to the definition of adequate subgraph, so we can shrink (2, 3) from the MBG and get a reduced MBG as shown on the right. The final median solution is also shown in the figure and we can see that the edges of the median solution are within the reduced MBG.
By searching the existing adequate subgraphs up to a certain size and incorporate them into an exhaustive algorithm for the median problem, the ASMedian solver could significantly reduce the computation time of median calculation, though it still runs very slowly with high gene rearrangement rate. Given n the number of genes in the dataset and r the expected number of events per edge, when the ratio of r/n is larger than 25%, all direct optimization methods have great difficulty in finishing the analysis even after months of computation. As ASMedian (as well as other median algorithms) use a branch-and-bound approach, its performance relies on how to quickly prune branches. If we can fix some adjacencies before performing median calculation, the search space will be further decreased and the problem would be solved much faster.
Reconstruct the ancestral genomes based on maximum likelihood
Another new approach to reconstruct the ancestral genome is based on maximum likelihood. Ma proposed a probabilistic framework for inferring ancestral genomic orders with rearrangements including inversion, translocation and fusion [5]. When the phylogeny tree is given, the tree is re-rooted so that the ancestral node to be inferred becomes the root of the new tree. In this way, all the leave nodes can be taken into account for reconstructing the target ancestral node.
Because the gene order of any ancestral node is unknown, the probabilities of all possible gene adjacencies within the genome need to be calculated. Since each internal node of the phylogeny tree has two children and we assume that both children evolved from the parent node, the posterior probability of any gene adjacency within the parent node can be computed when treating the gene orders of the children as observed data. The calculation can be performed recursively from the bottom leaves level to the top level until the root is reached. At this point the probabilities of all the possible gene adjacencies within the target ancestral genome are computed. Finally an approximate algorithm is used to pick up and connect the adjacencies together to maximize the overall probability, and these selected adjacencies form the final inferred ancestral genome.