While phylogenetic studies in the pre-genome era primarily focused on DNA or protein sequence differences among organisms, informative comparisons can in fact be made at various organizational levels. Higher-level evolutionary events of relevance to phylogenetics include inversion, transposition, deletion, insertion and duplication. Phylogenetic analyses of whole genomes that model these types of events are proving to be extremely useful in elucidating the evolutionary relationships among organisms [1]. Since the pioneering papers of Sankoff [2], genome rearrangement data has attracted increasing attention from both biologists and computer scientists as a new type of data for phylogenetic analysis and comparative genomics.

During the past several years, computer scientists have been able to make substantial progress in genome rearrangement research. With solutions for inversion distance [3] and inversion median [4], we were able to estimate phylogenies and ancestral genomes based on inversions. The main software packages for reconstructing the inversion (or breakpoint) phylogeny are GRAPPA [5] and MGR [6]. Their basic optimization tool is an algorithm for computing the inversion (or breakpoint) median of three genomes.

Much of the research on genome rearrangement has focused on organellar genomes, such as mitochondrial [7] and chloroplast genomes [8]. GRAPPA and MGR have been applied successfully to chloroplast genomes in which inversion is the most important event. In other datasets (e.g., mitochondrial genomes), transpositions are viewed as more likely, although their relative preponderance with respect to inversions is unknown.

Existing methods can still be applied when transposition is the dominant event. For example, given genome 1, 2,⋯, *n*, a transposition acts on three indices *i*, *j*, *k* (*i* ≤ *j* and *k* ∉ [*i*, *j*]) resulting in a genome: 1,⋯, (*i* - 1), (*j* + 1),⋯, *k*, *i*, (*i* + 1),⋯, *j*, (*k* + 1),⋯, *n*, which can also be obtained by using three inversions: one inversion acts on indices *i*, *k*, followed by one acts on indices *i*, *k* - *j* + *i* - 1 and another one acts on *k* - *j* + *i*, *k*. Based on the above observation, it is possible to estimate the transposition distance by inversions and use distance-based method (such as neighbor-joining) to reconstruct the phylogeny. We can also apply GRAPPA or MGR to obtain the phylogeny, using either breakpoint median solver or inversion median solver. However, since the evolutionary model is mismatched, their performance on transposition datasets is questionable, as indicated by our experimental results shown in the next section. In this paper, we introduce a new method to solve the transposition median problem and use it to infer phylogenies and ancestral genomes from datasets where transposition is the only event. The new method (GRAPPA-TP) is an extension of GRAPPA and is available free from http://phylo.cse.sc.edu/.

### Genome rearrangements

We represent a genome as a signed ordering of *n* genes, and each gene *i* is given an orientation that is either positive, written *i*, or negative, written -*i*. Genomes can evolve through events such as inversions, transpositions and transversions, as well as other events. When transposition is the only event, the sign of each gene is irrelevant and can be ignored. Let *G* be the genome with signed ordering of 1, 2,⋯, *n*. An inversion (also called reversal in some literatures) between indices *i* and *j* (*i* ≤ *j*), transforms *G* to a new genome with linear ordering1, 2,⋯, (*i* - 1), -*j*, -(*j* - 1),⋯, -*i*, (*j* + 1),⋯, *n*

A *transposition* on genome *G* acts on three indices *i*, *j*, *k*, with *i* ≤ *j* and *k* ∉ [*i*, *j*], picking up the interval *i*, (*i* + 1),⋯, *j* and inserting it immediately after *k*. Thus genome *G* is replaced by (assume *k* >*j*):1,⋯, (*i* - 1), (*j* + 1),⋯, *k*, *i*, (*i* + 1),⋯, *j*, (*k* + 1),⋯, *n*

An *transversion* is a transposition followed by an inversion of the transposed subsequence; it is also called an *inverted transposition*.

There are additional events for multiple-chromosome genomes, such as *translocation* (the end of one chromosome is broken and attached to the end of another chromosome), *fission* (one chromosome splits and becomes two) and *fusion* (two chromosomes combine to become one).

### Distance computation

Given two genomes *G*
_{1} and *G*
_{2}, we define the *edit distance d*(*G*
_{1}, *G*
_{2}) as the minimum number of events required to transform one genome into the other.

The *breakpoint distance* [2] is not a direct evolutionary distance measurement. A breakpoint in *G*
_{1} is defined as an ordered pair of genes (*i*, *j*) such that *i* and *j* are adjacent in *G*
_{1} but not in *G*
_{2}. The breakpoint distance is simply the number of breakpoints in *G*
_{1} relative to *G*
_{2}.

When only inversions are allowed, the edit distance is the *inversion distance*. Hannenhalli and Pevzner [3] developed a mathematical and computational framework for signed gene-orders and provided a polynomial-time algorithm to compute the edit distance between two signed gene-orders under inversions; Bader et al. [9] later showed that this edit distance can be computed in linear time. However, computing the inversion distance is NP-hard in the unsigned case [4].

The *transposition distance* is the minimum number of transpositions needed. Computing the transposition distance is of unknown complexity and after 10 years of research, the best available method is only a 1.375-approximation [10].

Yancopoulos et al. [11] proposed a "universal" double-cut-and-join (DCJ) operation that accounts for inversions, translocations, fissions and fusions, resulting in a new genomic distance that can be computed in linear time. A DCJ operation makes a pair of cuts and proceeds to reglue cut ends, which can yield an inversion, a fission, a fusion, and a translocation. Combining two DCJ operations can create a block interchange and sometime a transposition. Although there is no direct biological evidence for DCJ operations, these operations are very attractive because it provides a unifying model for genome rearrangement [12] and it is simple to compute the DCJ distance.

### Median problem of three

The median problem on three genomes is to find a single genome that minimizes the sum of pairwise distances between itself and each of the three given genomes. This problem is computationally very hard even for the simplest breakpoint distance [13].

The *breakpoint median* problem can be transformed into a special instance of the well-studied Traveling Salesperson Problem [2], hence can be solved relatively efficient. The *inversion median* problem is to find a median genome that minimizes the summation of inversion distances on the three edges. Two exact median solvers have been proposed, all using a branch-and-bound strategy. Caprara's solver [4] is based on an extension of the breakpoint graph, while the one developed by Siepel and Moret [14] runs a direct search. Using the inversion median has dramatically improved the accuracy of genome rearrangement analysis [15]. Two heuristic methods, MGR [6] and rEvoluzer [16], are also proposed to improve the speed of inversion median, at a sacrifice of accuracy. Zhang et al. later improved Caprara's inversion median solver so that it can handle the DCJ distance [17].

### Phylogenetic reconstruction from genome rearrangements

Reconstructing phylogenies from genome rearrangement data is computationally much harder than from sequence data. For example, finding the minimum number of evolutionary events given a fixed tree can be done in linear time if the leaves are labeled with DNA or protein sequences, whereas such task for genome rearrangement data is NP hard even when the tree has only three leaves.

Methods for reconstructing trees based on genome rearrangement data include distance-based methods (for example, neighbor-joining [18]), maximum parsimony methods based on encodings [19, 20], and direct optimization methods. The latter, pioneered by Sankoff and Blanchette [2] in their package BPAnalysis and improved by GRAPPA [5] and MGR, is the most accurate method. Besides returning a phylogeny, these three methods can also give an estimate of ancestral gene orders, which will have great utility for biologists interested in the process of genome rearrangement.