 Research
 Open access
 Published:
Gene order alignment on trees with multiOrthoAlign
BMC Genomics volume 15, Article number: S5 (2014)
Abstract
We relate the comparison of gene orders to an alignment problem. Our evolutionary model accounts for both rearrangement and contentmodifying events. We present a heuristic based on dynamic programming for the inference of the median of three genomes and apply it in a phylogenetic framework. multiOrthoAlign is shown accurate on simulated and real datasets, and shown to significantly improve the runningtime of DupLoCut, an "almost" exact algorithm based on linear programming, developed recently for the same problem.
Introduction
A major requirement in comparative genomics is to be able to compare genomes based on their whole content. This is necessary for a myriad of applications such as phylogenetic reconstruction, orthology and paralogy identification, ancestral reconstruction and the study of evolutionary events. Consequently, a large variety of algorithms have been developed for the comparison of wholegenome sequences with partial or no information on gene annotation. Most of them are based on first identifying, in a pairwise alignment dotplot, local alignments (anchors, syntenies) with high similarity score, and then chaining them in a way maximizing an alignment score (cf. e.g. MUMmer [1], BLASTZ [2], LAGAN [3], DAGchainer [4], progressiveMauve [5]). Similarity scores are computed according to the local mutations (nucleotide substitutions and indels) inferred from the alignment. Other approaches compare genomes in terms of building block organization. Although a recently developed method does not require any preliminary information on gene families [6], most of them assume a full or partial annotation of genomes, or a previously established large coverage of genomes in terms of syntenic blocks. Given two genomes represented as ordered sequences of genes (or building blocks), the rearrangement approach consists in finding a sequence of global evolutionary events transforming one gene order to the other. Early work on genome rearrangement focused on sorting permutations (no duplicates) by rearrangements (inversions, translocations, transpositions) [7–9]. More recently, a variety of studies have considered the more difficult case of genomes with duplicates evolving through rearrangements, but also through content modifying operations such as duplications and losses (reviews in [10, 11]). Other modelfree approaches based on conserved synteny, with no assumption on the evolutionary mechanisms, have also been developed [6, 12–16].
In a recent set of papers [17–19] we related the comparison of two gene orders to an alignment problem: find an alignment between the two gene orders that can be interpreted by a minimum number of evolutionary events (rearrangements and contentmodifying operations). Although alignments are a priori simpler to handle than rearrangements, this problem has been shown NPhard for the duplicationloss model of evolution [17, 18, 20]. Exact exponentialtime algorithms based on linear programming [19, 20] and a polynomialtime heuristic based on dynamic programming [17] have been developed for this model. Recently [21], we developed OrthoAlign (alignment of orthologs), a timeefficient heuristic for the gene order alignment problem, that extends the dynamic programming approach to a model including rearrangements (inversions and transpositions).
Sequence and gene order alignments are useful for ancestral inference purposes. As explained in [19], a "labeled" pairwise gene order alignment can be translated into a common ancestor and an evolutionary scenario leading to the two compared gene orders. Such an alignment approach for ancestral inference is relevant if the two gene orders reflect enough conservation so that we can assume that only few events have occurred since the divergence of the lowest common ancestor of the two genomes. For such closely related species, events can be assumed to be nonoverlapping (each gene involved in at most one event) and thus still visible in the alignment. The geneorder alignment approach has been shown useful to decipher the evolutionary mechanisms that have shaped the tRNA gene repertoires of the bacterium Bacillus [19].
Here, we undertake the next step, which is using the alignment approach on a phylogeny: infer ancestral genomes identified with each speciation node of a phylogenetic tree. The alignment on a tree problem introduced by Sankoff et al. in [22], consists in finding assignments of internal nodes in a way minimizing the total branch length of the tree according to a given distance. The result is, not only a set of ancestral genomes, but also a multiple alignment for extant sequences. As trying all possibilities for internal node assignments is intractable, iterative heuristics on subtrees are usually considered, the most popular being the medianbased heuristic [10, 23]: (1) find an initial assignment for internal nodes; (2) in a postorder traverse of the tree, improve the assignment of each internal node u by considering the median of the leafassignments of the 3star tree centered on u, i.e., the tree formed by the three neighbouring nodes of u; (3) repeat until no improvement on the tree distance can be made. In the case of genomes represented as gene orders, applying the exact 2SPP (2Small Phylogeny Problem) algorithm [19] or OrthoAlign [21] to the cherries of the phylogeny can be used for an initial assignment. As for the iterative step, an efficient algorithm for the median problem has to be found. Although NPhard for most versions of the problem [24–26], efficient heuristics have been developed for various nucleotide and rearrangement distances. As for the duplicationloss model of evolution, DupLoCut, an "almost" exact algorithm based on linear programming has been presented in [20].
In this paper, we present multiOrthoAlign for the alignment of a set of gene orders related through a phylogenetic tree. It is based on a dynamic programming approach generalizing OrthoAlign [17, 21] to a 3star tree, under a model involving a wide range of evolutionary events. multiOrthoAlign is compared with DupLoCut [20], the most closely related algorithm. Experiments on simulated and real datasets reveal similar accuracy for both algorithms, but with a significant improvement in running time for multiOrthoAlign.
Method
We consider unichromosomal genomes represented as strings of signed characters from an alphabet Σ, where each character represents a gene family. Each character may appear many times in a genome G, all such positions corresponding to genes belonging to the given gene family. The sign of a gene represents its transcriptional orientation. Let X = x_{1}x_{2} · · · x_{ n } be a string. We call the reverse of X the string −X = −x_{ n } · · · − x_{2} − x_{1}. We denote by X[i, i + k] the substring of X formed by the consecutive genes of the interval [i, i + k].
A phylogeny or species tree \mathcal{S} for a set Γ of genomes is a tree with a onetoone correspondence between the leaves of \mathcal{S} and the species of Γ, reflecting the evolution of the genomes through speciation. Although the method developed in this paper does not require any assumption on the species tree, for ease of presentation, we consider binary and rooted phylogenies. An internal node of \mathcal{S} corresponds to a speciation event and an assignment for that node corresponds to the genome at the moment of speciation. A phylogenetic alignment \mathcal{S} for \mathcal{S} is the tree \mathcal{S} augmented with an assignment of one string for each internal node of \mathcal{S}. When no ambiguity, we will make no difference between a node and its assignment. Two nodes are related if they belong to the same path from a leaf of \mathcal{S} to the root, and unrelated otherwise. For two related nodes A ≠ X, A is an ancestor of X if A is closer to the root of \mathcal{S} than X. For two unrelated nodes X ≠ Y, they are siblings if they share the same parental node. A pair of siblings is called a cherry. Moreover, we call a 3star of \mathcal{S} and we denote by AXY a startree with three leaves A, X, Y such that X and Y are two siblings in \mathcal{S} and A is the immediate ancestor of the parent M of X and Y. M is called the center of AXY.
The evolutionary model
We assume that presentdays genomes have evolved from an ancestral genome through rearrangement and contentmodifying events, each event (operation) acting on a unichromosomal genome X and leading to a new unichromosomal genome Y. An operation is denoted by O(k) = (O^{S}, O^{T}), where O is the operation type, k is the length of the operation, O^{S} is the source, i.e., the substring affected by the event and O^{T} is the target, i.e., the new substring resulting from the event. Characters of O^{S} and O^{T} are said to be covered by the operation. The mostly considered contentmodifying operations are duplications and losses, where:

A Duplication D(k) = (D^{S}= X[i, i + k − 1], D^{T}= Y [j, j + k − 1]), where Y[j, j + k − 1] = X[i, i + k − 1], is an operation that copies the substring X[i, i + k − 1] of size k to a location j outside the interval [i, i + k − 1] (i.e. preceding i or following i + k − 1);

A Loss L(k) = (X[i, i + k − 1], ∅) (∅ for empty string) is an operation that removes the substring X[i, i + k − 1] from genome X.
The mostly considered unichromosomal rearrangements are reversals and transpositions, where:

A Reversal (or inversion) R(k) = (X[i, i + k − 1], Y [i, i + k − 1]), where Y [i, i + k − 1] = −X[i, i + k − 1], is an operation that transforms the substring X[i, i + k − 1] into its reverse;

A Transposition T(k) = (X[i, i + k − 1], Y [j, j + k − 1]), where Y[j, j + k − 1] = X[i, i + k − 1], is an operation that moves the substring X[i, i + k − 1] to another position j outside the interval [i, i + k − 1].
Denote by \mathcal{O} the set of operation types. We will describe our approach for \mathcal{O}=\left\{D,L,R,T\right\}. Including other events, such as inverted duplications or inverted transpositions with target being the reverse of the source, insertions which are the counterparts of losses, or substitutions replacing a string with another of the same size, do not add any complexity to the problem. Notice however that the more operations we include to the model, the more challenging is the problem of assigning appropriate operations costs.
Let \mathcal{S} be a phylogeny and X, Y be two nodes of \mathcal{S}. If X and Y are related, say X is an ancestor of Y, then a history O_{X→Y}for X and Y is a sequence of events (possibly of length 0) transforming X into Y. Otherwise, if X and Y are unrelated, then a history for X and Y is a triplet (A, O_{A→X}, O_{A→Y}), where A is an assignment of the lowestcommon ancestral node of X and Y. We call a visible history for X and Y a history where the source and target of each operation is a substring of X or Y.
Finally, let AXY be a 3star of \mathcal{S}. A history for AXY is a quadruplet (M, O_{A→M}, O_{ M } _{→X}, O_{ M } _{→Y}) where M is an assignment of the center of the 3star. A visible history for AXY is a history where the source and target of each operation is a substring of A, X or Y.
Notice that a duplication with source and target in two different genomes can be interpreted as a duplication followed by the loss of the source (a relaxation of visibility), or alternatively as a transposition, or even as a horizontal gene transfer between the two considered genomes. We will take this general view of a duplication, which implicitly integrates transpositions.
Genome alignment
We begin by recalling the classical notion of an alignment of strings (genomes) Γ = {X_{ k } : 1 ≤ k ≤ γ}. Let Σ^{−} = Σ ∪ {−} be the alphabet Σ augmented with an additional character '' called a gap. Then an alignment for Γ is a set \overline{\text{\Gamma}}=\left\{\overline{{X}_{k}}:1\le k\le \gamma \right\} of strings obtained by filling X_{ k } with gaps, such that the resulting aligned genomes have equal length λ, and for each position i, 1 ≤ i ≤ λ, the column i is not empty in the sense that at least one of \overline{{X}_{k}}\left[i\right], for 1 ≤ k ≤ γ, is not a gap. The induced alignment for a subset Γ' ⊂ Γ is the alignment Γ' obtained by removing from \overline{\text{\Gamma}} all genomes that are not in Γ' and all empty columns. Given a pair \left(\overline{{X}_{l}}\left[i\right],\overline{{X}_{m}}\left[i\right]\right) of aligned characters, it is a match if \overline{{X}_{l}}\left[i\right]=\overline{{X}_{m}}\left[i\right]\in \text{\Sigma}, a mismatch if \overline{{X}_{l}}\left[i\right]\ne \overline{{X}_{m}}\left[i\right] both being in Σ and a gap if \overline{{X}_{l}}\left[i\right]\in \text{\Sigma} and \overline{{X}_{m}}\left[i\right]{=}^{\prime}{}^{\prime}.
A multiple alignment is expected to reflect the evolutionary events that have led to the presentday genomes. The notion of an alignment labeling has been introduced in [19] for a pairwise alignment. It relates each column of the alignment to a given operation. Generalization to an arbitrary number of genomes is given bellow. We will make use of this definition later in the context of a 3star history.
Definition 1 Let \overline{\text{\Gamma}}=\left\{\overline{{X}_{k}}:1\le k\le \gamma \right\}be an alignment of length λ. A labeling \mathcal{L}\left(\overline{\text{\Gamma}}\right) for \overline{\text{\Gamma}}is a set of operations covering the characters of the given sequences. For any l and m in [1, γ] with l ≠ m and any i, 1 ≤ i ≤ λ, such that Xl\left[i\right]{\ne}^{\prime}{}^{\prime},
(X_{ l }[i], X_{ m }[i]) is covered by at most one operation of \overline{\text{\Gamma}} as follows:

if a match, then it is covered by no operation;

if a mismatch, then it is covered by a reversal;

if a gap, then it is covered by one of the other operations of \mathcal{O}.
with the restriction that, if the two genomes are related, say X _{ l } is an ancestor of X _{ m } , then the source of the operation should be in X _{ l } and the target should be in X _{ m } .
A labeled alignment is an alignment \overline{\text{\Gamma}} accompanied with a labeling \mathcal{L}\left(\overline{\text{\Gamma}}\right). We simply refer to a labeled alignment by its labeling \mathcal{L}\left(\overline{\text{\Gamma}}\right). The cost of a labeled alignment is the sum of costs of all its labeling events.
The above definition does not ensure a valid interpretation of a labeled alignment in terms of an evolutionary history (A, O_{A→X}, O_{A→Y}) for two genomes X and Y. We showed in [19] that a pairwise labeled alignment is valid if and only if it is free from cycles, where cycles are defined as follows.
Definition 2 Let \mathcal{O}be a set of operations. It induces a cycle if there is a permutation O_{1}, O_{2}, · · · O_{ h } of \mathcal{O} events such that the substrings {O}_{p}^{T} and {O}_{p+1}^{S} overlap (a suffix of {O}_{p}^{T} is a prefix of {O}_{p+1}^{S}), for each 1 ≤ p ≤ h − 1, and the substrings {O}_{h}^{T} and {O}_{1}^{S} overlap.
A feasible labeled alignment is a labeled alignment with no cycles. We showed in [19] the onetoone correspondence between feasible labeled alignments and visible histories for two genomes X and Y in case of an evolution through duplications and losses.
Phylogenetic alignment
Let \mathcal{S} be a species tree for a genome set Γ. Call a feasible labeled phylogenetic alignment for \mathcal{S} a phylogenetic alignment \overline{\mathcal{S}} accompanied with a feasible labeled alignment for each cherry (X, Y) of \overline{\mathcal{S}}, in other words a visible history (A, O_{A→X}, O_{A→Y}) for each (X, Y). Such a feasible labeled phylogenetic alignment leads to a multiple alignment for Γ: traverse \overline{\mathcal{S}} in postorder and iteratively incorporate alignments of cherries in a current multiple alignment which is initially empty.
Let A and X be two genomes of \mathcal{S} with A being an ancestor of X and let O_{A→X}= {O_{1}(k_{1}), · · · O_{ m }(k_{ m })} be a history for A and X. The cost of O_{A→X}is defined as C\left({O}_{A\to X}\right)={\sum}_{i=1}^{m}c\left({O}_{i}\left({k}_{i}\right)\right), where c(O_{ i }(k_{ i })) is the cost of the operation O_{ i }(k_{ i }). Let {\mathcal{O}}_{A\to X} be the set of all possible histories transforming A into X. We define C\left(A\to X\right)={\displaystyle {\mathsf{\text{min}}}_{{O}_{A\to X}\in {\mathcal{O}}_{A\to X}}}C\left({O}_{A\to X}\right). Now, the phylogenetic alignment problem, is to infer a feasible labeled phylogenetic alignment for \mathcal{S} minimizing the sum of costs of all branches of \mathcal{S}.
The relaxed phylogenetic alignment problem with no restriction on visibility, i.e. the problem of assigning ancestral configurations leading to a minimum cost for the tree, has been shown to be NPhard for most formulations in terms of type of genomes and different distances. A classical heuristic strategy is known as the steinerization approach [23]. It begins with an initial assignment for the internal nodes of \mathcal{S}, and in a postorder traversal it improves each internal node assignment by solving a 3star problem defined as follows.
3star Problem:
INPUT: A 3star phylogeny AXY.
OUTPUT: A visible history (M, O_{A→M}, O_{M→X}, O_{M→Y}) for AXY minimizing the cost:
In the case of symmetrical operations, such as nucleotide substitutions or indels, or gene order rearrangements, the direction of evolution can be ignored, which leads to the median problem: find M minimizing C(M, A) + C(M, X) + C(M, Y). However, this is not the case for contentmodifying operations, as for example a duplication from A to X is rather a loss from X to A, and therefore the evolutionary direction cannot be ignored in this case.
For the evolutionary model of interest, the restriction of the phylogenetic alignment problem to a cherry has been considered in [17, 19]. The developed algorithm can be used for the initialization step: traverse the tree in a depthfirst manner and compute successive ancestors of pairs of nodes. Here, we extend our study to a 3star phylogeny, which allows for the application of the aforementioned steinerization approach. Notice that the phylogenetic alignment problem has been shown NPcomplete for the duplicationloss model of evolution, already for two species [20, 17, 18].
The 3star Problem
We first show that the 3star problem for a 3star AXY reduces to finding a feasible labeled alignment for {A, X, Y} of minimum cost. It is easy to see that any visible history for AXY leads to a unique feasible labeled alignment for {A, X, Y}. Conversely, let \mathcal{L}\left(\u0100,\overline{X},\u0232\right) be a feasible labeled alignment for a 3star AXY. A corresponding visible history for AXY can be obtained as follows (see Figure 1 for an example):

Define (M, O_{M→X}, O_{M→Y}) as the visible history corresponding to the induced feasible labeled alignment for X and Y.

Consider the alignment \left(\u0100,\overline{M}\right), where \overline{M} is the aligned genome M corresponding to the above history.

Define \mathcal{L}\left(\u0100,\overline{M}\right) as follows. For each i such that \left(\u0100\left[i\right],\overline{M}\left[i\right]\right) is not a match:

If \overline{X}\left[i\right]=\u0232\left[i\right] then include in \mathcal{L}\left(\u0100,\overline{M}\right) the operation of \mathcal{L}\left(\u0100,\overline{X},\u0232\right) covering the column \left(\u0100\left[i\right],\overline{X}\left[i\right]\right) (or alternatively \left(\u0100\left[i\right],\u0232\left[i\right]\right)).

Otherwise \overline{M}\left[i\right] should be equal to \overline{X}\left[i\right] or \u0232\left[i\right]. Assume w.l.o.g. that \overline{M}\left[i\right]=\overline{X}\left[i\right]. Then include in \mathcal{L}\left(\u0100,\overline{M}\right) the operation of \mathcal{L}\left(\u0100,\overline{X},\u0232\right) covering the column \left(\u0100\left[i\right],\overline{X}\left[i\right]\right).

Therefore, given a 3star AXY, we focus here on the problem of finding a feasible labeled alignment for {A, X, Y} of minimum cost.
Let C(i, j, k) (C^{f}(i, j, k) respectively) be the minimum cost of a labeled (feasible labeled respectively) alignment of three prefixes A[1, i], X[1, j] and Y [1, k] of A, X and Y, for all 1 ≤ i ≤ A, 1 ≤ j ≤ X and 1 ≤ k ≤ Y . Step 1 described bellow gives a heuristic for computing C(i, j, k) and Step 2 a heuristic for computing

STEP 1. FINDING A LABELED ALIGNMENT BY A DYNAMIC PROGRAMMING APPROACH.
As explained previously, transpositions are implicitly considered by allowing the source and target of a duplication to belong to two different genomes. Therefore,
we will restrict our presentation to the model \mathcal{O}=\left\{D,L,R\right\}.
To compute C(i, j, k), we consider all the possibilities for the last column of an alignment of the three prefixes A[1, i], X[1, j] and Y[1, k] and interpret it by the minimum number of operations. In the following, a column is represented as a triplet of characters from Σ^{−}, were different letters denote different characters of Σ. Clearly, each column can be interpreted by no more than 2 operations. If two operations are required to interpret a given column, then we assume them to be of the same size. This eliminates the case of a column of the form [a, x, y], as this would require two reversals of different sizes.
C(i, j, k) is the minimum over all the computed costs.
1 [a, a, a]: All matches.
2 [a, x, x]: Reversal in both X and Y (i.e. in M).
where E is the set {e_{1}, e_{2}, . . . , e_{ l }} of maximum cardinality such that A[i−e_{ p }, i] is the reverse of both X[j − e_{ p }, j] and Y [k − e_{ p }, k] for all 1 ≤ p ≤ l.
3 [a, x, a]: Reversal in X. (The case [a, a, y] is treated similarly)
where E is the set {e_{1}, e_{2}, . . . , e_{ l }} of maximum cardinality such that A[i − e_{ p }, i] = Y [k − e_{ p }, k] and A[i−e_{ p }, i] is the reverse of X[j − e_{ p }, j] for all 1 ≤ p ≤ l.
4 [−, x, x]: Duplication in both X and Y (i.e. in M)
where l is the largest value such that X[j − l, j] = Y [k − l, k] and X[j − l, j] has an occurrence in A.
5 [a, x, −]: Reversal in both X and Y, and loss in Y. (The case [a, −, y] is treated similarly)
where E is the set {e_{1}, e_{2}, . . . , e_{ l }} of maximum cardinality such that A[i−e_{ p }, i] is the reverse of X[j − e_{ p }, j] for all 1 ≤ p ≤ l.
6 [−, x, y]: Duplication in both X and Y, and reversal in Y.
where E is the set {e_{1}, e_{2}, . . . , e_{ l }} of maximum cardinality such that X[j−e_{ p }, j] is the reverse of Y [k − e_{ p }, k] for all 1 ≤ p ≤ l and X[j − e_{ p }, j] has an occurrence in A.
(similar formulae for DR_{ Y } _{ /X } (i, j, k))
7 [a, −, a]: Loss in X. (The case [a, a, −] is treated similarly)
where A[i − l, i] is the longest suffixe of A[1, i] such that A[i − l, i] = Y [k − l, k].
8 [a, −, −]: Loss in both X and Y.
L_{ XY }(i, j, k) = min_{0≤m≤i−1}(C[m, j, k] + c(L(i − m)))
9 [−, x, −]: Duplication in X. (The case [−, −, y] is treated similarly)
where l is the largest value such that X[j − l, j] has an occurrence in A, X or Y.
After computing all the values leading to C(A, X, Y ), the labeled alignment \mathcal{L}\left(\u0100,\overline{X},\u0232\right) obtained by a backtracking approach is not necessarily a feasible alignment as it may contain cycles. Notice that, since A is an ancestor of both X and Y, the target of an event cannot belong to A. Therefore only events with source and target in X or Y may belong to a cycle.

STEP 2. RESOLVING CYCLES.
Let {\mathcal{O}}_{c}=\left\{{O}_{1},{O}_{2},\dots ,{O}_{h}\right\} be a cycle of a labeled alignment \mathcal{L}\left(\u0100,\overline{X},\u0232\right) output by the above algorithm.
Lemma 1 Any event of {\mathcal{O}}_{c} is a duplication event.
Proof: Suppose the contrary and let O_{ p } be an event which is not a duplication. Then, by definition, the target {O}_{p}^{T} of O_{ p } overlaps the source of {O}_{p+1}^{S} of O_{p+1}. Clearly, O_{ p } cannot be a loss as otherwise {O}_{p}^{T} is empty and cannot have a nonempty intersection with {O}_{p+1}^{S}. Therefore O_{ p } should be a reversal. Assume w.l.o.g. that {O}_{p}^{T} is in Y and let Y[q] be an element of both {O}_{p}^{T} and {O}_{p+1}^{S}. Let X[r] be the character of X aligned with Y[r] in \mathcal{L}\left(\u0100,\overline{X},\u0232\right). Then X[r] should be in the source of O_{ p } and in the target of O_{p+1}. But this leads to an interpretation of the corresponding column of \mathcal{L}\left(\u0100,\overline{X},\u0232\right) with two events instead of one, which is in contradiction with the recurrences leading to a minimum number of events for each column. □
We resolve cycles as follows. Let \mathcal{Z} be the set of all overlapping strings {Z_{1}, Z_{2}, . . . , Z_{ h }} of {\mathcal{O}}_{c}. Let {\epsilon}_{i}=\left\{{z}_{{i}_{1}},{z}_{{i}_{2}},\cdots {z}_{{i}_{l}}\right\} be a set of substrings of Z_{i(1≤i≤h}) of minimum cardinality such that {z}_{{i}_{1}}{z}_{{i}_{1}}\cdots {z}_{{i}_{l}}={Z}_{i} and {z}_{{i}_{k}\left(1\le k\le l\right)} has an occurrence in A. Let Z_{ t } be the string for which \left{\mathcal{E}}_{t}\right=min\left(\left{\mathcal{E}}_{1}\right,\left{\mathcal{E}}_{2}\right,\dots \left{\mathcal{E}}_{h}\right\right). Assume w.l.o.g. that Z_{ t } is a substring of X. Then Z_{ t } in \mathcal{L}\left(\overline{X},\u0232\right) is covered by a loss in Y, and each substring of Z_{ t } in \mathcal{L}\left(\u0100,\overline{X}\right) is covered by a duplication in X (source in A) (see Figure 2 for details).
Complexity: For simplicity, assume that A = X = Y = n. From the recurrences detailed above, each C(i, j, k) can be computed in linear time, leading to an O(n^{4}) worsttime complexity for Step 1. Now, the complexity of Step 2 depends on the complexity for finding all cycles and resolving them. As cycles can only involve strings from X and Y, the problem reduces to the case of cycleresolution for a pairwise alignment, which has been shown quadratic (submitted journal version of [17]). This leads to a worsttime complexity of O(n^{4}) for the whole algorithm.
Experimental results
We call multiOrthoAlign our algorithm for the phylogenetic alignment problem based on the steinerization approach described in Section and using our 3star algorithm for the iteration step.
In this section, we compare multiOrthoAlign with DupLoCut [20], on simulated and realworld instances. DupLoCut is an "almost" exact heuristic based on linear programming. For the sake of comparison with DupLoCut [20], we consider a model restricted to duplications and single gene losses. Indeed, DupLoCut is restricted to this evolutionary model. Moreover, we consider the default cost of one for each event.
Simulations
We generate phylogenetic trees with 3 extant genomes. The genome at the root is generated in 2 steps. First, a random sequence R of length n on an alphabet of size σ is generated. Then, l moves (duplications and single gene losses) are applied to R where duplication length follows the geometric distribution of parameter 0.5. All other genomes along the tree are generated by applying l moves to their direct ancestor.
Execution time: We compare the runningtime of our 3star algorithm with that used in DupLoCut for the reoptimization steps. Running times were recorded using a 8core Intel(R) 3.6 GHZ processor, with 16 GiB of memory. Table 1 gives average running times after one round (iteration) of reoptimization for simulations generated with three choices of parameters n, σ and l. Although multiOrthoAlign's running time increases slightly with increasing values of n, σ and l, it is still within a few minutes for n = 250. In comparison, the same data took more than 6 hours to be processed by DupLoCut.
Accuracy: In order to test the performance of multiOrthoAlign in terms of accuracy, we used two measures: Error=\frac{InfOpt}{Inf} where Inf is the number of events inferred by multiOrthoAlign and Opt is the "almost optimal" number of events obtained by running DupLoCut; Accuracy=\frac{NbOpt}{Total}, where N bOpt is the number of simulations among Total (number of all simulations) for which multiOrthoAlign returns the same number of events as DupLoCut.
The same algorithm (2SPP [19]) was used for the initialization step of both multiOrthoAlign and DupLoCut. Figure 3 gives results for different choices of the parameter l. With ratios σ/n = 1/2 and l/n = 1/20, multiOrthoAlign returns the same cost as DupLoCut for more than 96% of the simulations. This accuracy rate remains stable for decreasing alphabet size (results not shown), i.e., for increasing number of gene copies, but decreases quickly as the number l of moves increases (left diagram of Figure 3). However, the average Error remains lower than 0.008 (right diagram of Figure 3).
In order to test the algorithm on larger trees, we generated a phylogenetic tree with 100 extant genomes. The genomes along the tree were generated as described above for triplet phylogenies, with parameters n = 100, σ = 50 and l = 5. Figure 4 illustrates the total cost of the tree (number of duplication/single gene loss events) obtained after each iteration of multiOrthoAlign (blue line) and DupLoCut (red line). After the initialization step (iteration 0), the total cost obtained by multiOrthoAlign is 1632. After 6 rounds of reoptimization, the two programs converge to a local minimum (no improvement can be made), with a total cost of 1100 for multiOrthoAlign and of 1124 for DupLoCut. Our cost is always slightly better in this case. Notice that, although DupLoCut is "almost" exact for the median problem, the whole steinerization procedure does not guarantee any optimality result.
Using OrthoAlign instead of the 2SPP algorithm for the initialization step would be something natural to do for reducing the running time of the whole procedure. However, as illustrated in Figure 4 (green line), the initial assignment obtained with OrthoAlign in this case leads to a cost of 1930 which is far from the best solution found. Notice that 2SPP is an exact algorithm for pairwise alignment and OrthoAlign is a heuristic which does not guarantee the optimal result. multiOrthoAlign converge to a local minimum of 1401 events after 4 rounds of reoptimization.
Real data
We also compared the two approaches on the set of realworld instances used in [20]. The set contains the stable RNA genes of 12 Bacillus strains of four species (amyloliquefaciens, subtilis, thuringiensis, and cereus). The phylogeny shown in Figure 5 is taken from the webpage (http://ccb.jhu.edu/software/duplocut).
Using 2SPP for the initialization step, multiOrthoAlign leads to a cost of 136 after the initialization step, and converges to a local minimum of 123 events after 2 rounds of reoptimization. As for DupLoCut, it converges to a local minimum of 120 events after 5 rounds of reoptimization. However, using OrthoAlign instead of 2SPP for the initialization step, multiOrthoAlign leads to a cost of 131 after the initialization step, which is not refined by subsequent iterations. It therefore appears that 2SPP is a more appropriate initialization procedure than OrthoAlign.
Conclusion
We have developed multiOrthoAlign, a phylogenetic alignment algorithm for a genomewide evolutionary model involving duplications, losses and rearrangements. It uses a generalization of OrthoAlign [21], a recently developed pairwise alignment algorithm, to the median of three genomes. Our algorithm for the median problem is a heuristic that does not guarantee any optimality result. Compared with DupLoCut, the most closely related existing algorithm, multiOrthoAlign exhibits similar results but is much faster. The method can be easily extended to other contentmodifying and rearrangement operations such a substitutions, insertions, tandem duplications or inverted duplications. However, the more operations we add, the more challenging is the problem of finding appropriate costs for operations, and appropriate criteria to deal with the nonuniqueness of solutions.
References
Delcher A, Kasif S, Fleischmann R, Peterson J, White O, Salzberg S: Alignment of Whole Genomes. Nucleic Acids Research. 1999, 27 (11): 23692376. 10.1093/nar/27.11.2369.
Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W: Humanmouse alignments with BLASTZ. Genome research. 2003, 13: 103107. 10.1101/gr.809403.
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and MultiLAGAN: efficient tools for largescale multiple alignment of genomic DNA. Genome research. 2003, 13 (4):
Haas B, Delcher A, Wortman J, Salzberg S: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 2004, 20 (18): 36433646. 10.1093/bioinformatics/bth397.
Darling A, Mau B, Perna N: progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. PLoS ONE. 2010, 5 (6):
Braga M, Chauve C, Doerr D, Jahn K, Stoye J, Thévenin A, Wittler R: Models and algorithms for genome evolution. Springer 2013 chap. The potential of familyfree genome comparison
Hannenhalli S, Pevzner PA: Transforming men into mice (polynomial algorithm for genomic distance problem). Proceedings of the IEEE 36th Annual Symposium on Foundations of Computer Science. 1995, 581592.
Hannenhalli S, Pevzner PA: Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). Journal of the ACM. 1999, 48: 127.
Kaplan H, Shamir R, Tarjan RE: A faster and simpler algorithm for sorting signed permutations by reversals. SIAM Journal on Computing. 2000, 29: 880892. 10.1137/S0097539798334207.
ElMabrouk N, Sankoff D: Analysis of Gene Order Evolution beyond SingleCopy Genes. Springer (Humana), 397429. Volume 855 of Methods in Mol. Biol. 2012 chap. Part II, Evolutionary Genomics: statistical and computational methods
Fertin G, Labarre A, Rusu I, Tannier E, Vialette S: Combinatorics of genome rearrangements. 2009, Cambridge, Massachusetts and London, England: The MIT Press
Bergeron A, Chauve C, Gingras Y: Formal models of gene clusters. Bioinformatics algorithms: techniques and applications. Edited by: Mandoiu I, Zelikovsky A. 2008, Wiley
Durand D, Sankoff D: Testing for gene clusters. Journal of Computational Biology. 2003, 10: 453482. 10.1089/10665270360688129.
Bergeron A, Stoye J: On the similarity of sets of permutations and its applications to genome comparison. Journal of Computational Biology. 2003, 13: 13401354.
Heber S, Stoye J: Finding all common intervals of k permutations. Combinatorial Pattern Matching 12th Annual Symposium, Volume 2089 of Lecture Notes in Computer Science. Edited by: Amir A, Landau GM, Springer. 2001, 207218.
Yang Z, Sankoff D: Natural Parameter Values for Generalized Gene Adjacency. Journal of Computational Biology. 2010, 17: 11131128. 10.1089/cmb.2010.0099.
Benzaid B, Dondi R, ElMabrouk N: DuplicationLoss Genome Alignment: Complexity and Algorithm. LNCS, Volume 7810 of Language and Automata Theory and Applications, (LATA). 2013, 116127.
Dondi R, ElMabrouk N: Aligning and Labeling Genomes Under the DuplicationLoss Model. LNCS, Volume 7921 of Computability in Europe (CiE 2013). 2013, 97107.
Holloway P, Swenson K, Ardell D, ElMabrouk N: Ancestral Genome Organization: an Alignment Approach. Journal of Computational Biology. 2013, 20 (4): 280295. 10.1089/cmb.2012.0292.
Andreotti S, Reinert K, Canzar S: The DuplicationLoss Small Phylogeny Problem: From Cherries to Trees. Journal of Computational Biology. 2013, 20 (9):
TremblaySavard O, Benzaid B, Lang B, ElMabrouk N: Evolution of tRNA genes in Bacillus inferred with OrthoAlign. 2014, [Unpublished]
Sankoff D, Cedergren R, Lapalme G: Frequency of insertiondeletion, transversion, and transition in the evolution of 5S ribosomal RNA.
Kovac J, Brejova B, Vinar T: A practical algorithm for ancestral rearrangement reconstruction. LNBI, Volume 6833 of WABI. 2011, 163174.
Elias I: Settling the intractability of multiple alignment. Journal of Computational Biology. 2006, 13 (7):
Pe'er I, Shamir R: The median problems for breakpoints are NPcomplete. Journal of Computational Biology, Volume 5 of Electronic colloquium on computational complexity. 1998
Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009, 10:
Acknowledgements
We thank the authors of [20] for providing us with their software and genome datasets.
Declarations
Publication charges for this work was funded by The Natural Sciences and Engineering Research Council of Canada (NSERC).
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Benzaid, B., ElMabrouk, N. Gene order alignment on trees with multiOrthoAlign. BMC Genomics 15 (Suppl 6), S5 (2014). https://doi.org/10.1186/1471216415S6S5
Published:
DOI: https://doi.org/10.1186/1471216415S6S5