Gene order alignment on trees with multiOrthoAlign

We relate the comparison of gene orders to an alignment problem. Our evolutionary model accounts for both rearrangement and content-modifying events. We present a heuristic based on dynamic programming for the inference of the median of three genomes and apply it in a phylogenetic framework. multiOrthoAlign is shown accurate on simulated and real datasets, and shown to significantly improve the running-time of DupLoCut, an "almost" exact algorithm based on linear programming, developed recently for the same problem.


Introduction
A major requirement in comparative genomics is to be able to compare genomes based on their whole content. This is necessary for a myriad of applications such as phylogenetic reconstruction, orthology and paralogy identification, ancestral reconstruction and the study of evolutionary events. Consequently, a large variety of algorithms have been developed for the comparison of wholegenome sequences with partial or no information on gene annotation. Most of them are based on first identifying, in a pair-wise alignment dotplot, local alignments (anchors, syntenies) with high similarity score, and then chaining them in a way maximizing an alignment score (cf. e.g. MUMmer [1], BLASTZ [2], LAGAN [3], DAGchainer [4], progressiveMauve [5]). Similarity scores are computed according to the local mutations (nucleotide substitutions and indels) inferred from the alignment. Other approaches compare genomes in terms of building block organization. Although a recently developed method does not require any preliminary information on gene families [6], most of them assume a full or partial annotation of genomes, or a previously established large coverage of genomes in terms of syntenic blocks. Given two genomes represented as ordered sequences of genes (or building blocks), the rearrangement approach consists in finding a sequence of global evolutionary events transforming one gene order to the other. Early work on genome rearrangement focused on sorting permutations (no duplicates) by rearrangements (inversions, translocations, transpositions) [7][8][9]. More recently, a variety of studies have considered the more difficult case of genomes with duplicates evolving through rearrangements, but also through content modifying operations such as duplications and losses (reviews in [10,11]). Other model-free approaches based on conserved synteny, with no assumption on the evolutionary mechanisms, have also been developed [6,[12][13][14][15][16].
In a recent set of papers [17][18][19] we related the comparison of two gene orders to an alignment problem: find an alignment between the two gene orders that can be interpreted by a minimum number of evolutionary events (rearrangements and content-modifying operations). Although alignments are a priori simpler to handle than rearrangements, this problem has been shown NP-hard for the duplication-loss model of evolution [17,18,20]. Exact exponential-time algorithms based on linear programming [19,20] and a polynomial-time heuristic based on dynamic programming [17] have been developed for this model. Recently [21], we developed OrthoAlign (alignment of orthologs), a time-efficient heuristic for the gene order alignment problem, that extends the dynamic programming approach to a model including rearrangements (inversions and transpositions).
Sequence and gene order alignments are useful for ancestral inference purposes. As explained in [19], a "labeled" pair-wise gene order alignment can be translated into a common ancestor and an evolutionary scenario leading to the two compared gene orders. Such an alignment approach for ancestral inference is relevant if the two gene orders reflect enough conservation so that we can assume that only few events have occurred since the divergence of the lowest common ancestor of the two genomes. For such closely related species, events can be assumed to be nonoverlapping (each gene involved in at most one event) and thus still visible in the alignment. The gene-order alignment approach has been shown useful to decipher the evolutionary mechanisms that have shaped the tRNA gene repertoires of the bacterium Bacillus [19].
Here, we undertake the next step, which is using the alignment approach on a phylogeny: infer ancestral genomes identified with each speciation node of a phylogenetic tree. The alignment on a tree problem introduced by Sankoff et al. in [22], consists in finding assignments of internal nodes in a way minimizing the total branch length of the tree according to a given distance. The result is, not only a set of ancestral genomes, but also a multiple alignment for extant sequences. As trying all possibilities for internal node assignments is intractable, iterative heuristics on subtrees are usually considered, the most popular being the median-based heuristic [10,23]: (1) find an initial assignment for internal nodes; (2) in a post-order traverse of the tree, improve the assignment of each internal node u by considering the median of the leaf-assignments of the 3-star tree centered on u, i.e., the tree formed by the three neighbouring nodes of u; (3) repeat until no improvement on the tree distance can be made. In the case of genomes represented as gene orders, applying the exact 2-SPP (2-Small Phylogeny Problem) algorithm [19] or OrthoAlign [21] to the cherries of the phylogeny can be used for an initial assignment. As for the iterative step, an efficient algorithm for the median problem has to be found. Although NP-hard for most versions of the problem [24][25][26], efficient heuristics have been developed for various nucleotide and rearrangement distances. As for the duplication-loss model of evolution, DupLoCut, an "almost" exact algorithm based on linear programming has been presented in [20].
In this paper, we present multiOrthoAlign for the alignment of a set of gene orders related through a phylogenetic tree. It is based on a dynamic programming approach generalizing OrthoAlign [17,21] to a 3-star tree, under a model involving a wide range of evolutionary events. multiOrthoAlign is compared with DupLoCut [20], the most closely related algorithm. Experiments on simulated and real datasets reveal similar accuracy for both algorithms, but with a significant improvement in running time for multiOrthoAlign.

Method
We consider uni-chromosomal genomes represented as strings of signed characters from an alphabet Σ, where each character represents a gene family. Each character may appear many times in a genome G, all such positions corresponding to genes belonging to the given gene family. The sign of a gene represents its transcriptional orientation. Let X = x 1 x 2 · · · x n be a string. We call the reverse of X the string −X = −x n · · · − x 2 − x 1 . We denote by X[i, i + k] the substring of X formed by the consecutive genes of the interval [i, i + k].
A phylogeny or species tree S for a set Γ of genomes is a tree with a one-to-one correspondence between the leaves of S and the species of Γ, reflecting the evolution of the genomes through speciation. Although the method developed in this paper does not require any assumption on the species tree, for ease of presentation, we consider binary and rooted phylogenies. An internal node of S corresponds to a speciation event and an assignment for that node corresponds to the genome at the moment of speciation. A phylogenetic alignment S for S is the tree S augmented with an assignment of one string for each internal node of S. When no ambiguity, we will make no difference between a node and its assignment. Two nodes are related if they belong to the same path from a leaf of S to the root, and unrelated otherwise. For two related nodes A ≠ X, A is an ancestor of X if A is closer to the root of S than X. For two unrelated nodes X ≠ Y, they are siblings if they share the same parental node. A pair of siblings is called a cherry. Moreover, we call a 3-star of S and we denote by A|XY a star-tree with three leaves A, X, Y such that X and Y are two siblings in S and A is the immediate ancestor of the parent M of X and Y. M is called the center of A|XY.

The evolutionary model
We assume that present-days genomes have evolved from an ancestral genome through rearrangement and content-modifying events, each event (operation) acting on a uni-chromosomal genome X and leading to a new uni-chromosomal genome Y. An operation is denoted by where O is the operation type, k is the length of the operation, O S is the source, i.e., the substring affected by the event and O T is the target, i.e., the new substring resulting from the event. Characters of O S and O T are said to be covered by the operation. The mostly considered content-modifying operations are duplications and losses, where: , is an operation that copies the substring X[i, i + k − 1] of size k to a location j outside the interval [i, i + k − 1] (i.e. preceding i or following i + k − 1);

for empty string) is an operation that removes the substring
The mostly considered uni-chromosomal rearrangements are reversals and transpositions, where: Denote by O the set of operation types. We will describe our approach for O = {D, L, R, T}. Including other events, such as inverted duplications or inverted transpositions with target being the reverse of the source, insertions which are the counterparts of losses, or substitutions replacing a string with another of the same size, do not add any complexity to the problem. Notice however that the more operations we include to the model, the more challenging is the problem of assigning appropriate operations costs.
Let S be a phylogeny and X, Y be two nodes of S. If X and Y are related, say X is an ancestor of Y, then a history O X Y for X and Y is a sequence of events (possibly of length 0) transforming X into Y. Otherwise, if X and Y are unrelated, then a history for X and Y is a triplet where A is an assignment of the lowest-common ancestral node of X and Y. We call a visible history for X and Y a history where the source and target of each operation is a substring of X or Y.
Finally, let A|XY be a 3-star of S. A history for A|XY is a quadruplet (M, where M is an assignment of the center of the 3-star. A visible history for A|XY is a history where the source and target of each operation is a substring of A, X or Y. Notice that a duplication with source and target in two different genomes can be interpreted as a duplication followed by the loss of the source (a relaxation of visibility), or alternatively as a transposition, or even as a horizontal gene transfer between the two considered genomes. We will take this general view of a duplication, which implicitly integrates transpositions.

Genome alignment
We begin by recalling the classical notion of an alignment of strings (genomes) Γ = {X k : 1 ≤ k ≤ g}. Let Σ − = Σ ∪ {−} be the alphabet Σ augmented with an additional character '-' called a gap. Then an alignment for Γ is a set¯ = {X k : 1 ≤ k ≤ γ } of strings obtained by filling X k with gaps, such that the resulting aligned genomes have equal length l, and for each position i, 1 ≤ i ≤ l, the column i is not empty in the sense that at least one of X k [i], for 1 ≤ k ≤ g, is not a gap. The induced alignment for a subset Γ' ⊂ Γ is the alignment Γ' obtained by removing from¯ all genomes that are not in Γ' and all empty columns. Given A multiple alignment is expected to reflect the evolutionary events that have led to the present-day genomes. The notion of an alignment labeling has been introduced in [19] for a pair-wise alignment. It relates each column of the alignment to a given operation. Generalization to an arbitrary number of genomes is given bellow. We will make use of this definition later in the context of a 3-star history.
for¯ is a set of operations covering the characters of the given sequences. For any l and m in [1, g] with l ≠ m and any i, is covered by at most one operation of¯ as follows: • if a match, then it is covered by no operation; • if a mismatch, then it is covered by a reversal; • if a gap, then it is covered by one of the other operations of O.
with the restriction that, if the two genomes are related, say X l is an ancestor of X m , then the source of the operation should be in X l and the target should be in X m .
A labeled alignment is an alignment¯ accompanied with a labeling L(¯ ). We simply refer to a labeled alignment by its labeling L(¯ ). The cost of a labeled alignment is the sum of costs of all its labeling events.
The above definition does not ensure a valid interpretation of a labeled alignment in terms of an evolutionary history (A, O A X , O A Y ) for two genomes X and Y. We showed in [19] [19] the one-to-one correspondence between feasible labeled alignments and visible histories for two genomes X and Y in case of an evolution through duplications and losses.

Phylogenetic alignment
Let S be a species tree for a genome set Γ. Call a feasible labeled phylogenetic alignment for S a phylogenetic alignment S accompanied with a feasible labeled alignment for each cherry (X, Y) of S , in other words a visible history (A, O A X , O A Y ) for each (X, Y). Such a feasible labeled phylogenetic alignment leads to a multiple alignment for Γ: traverse S in post-order and iteratively incorporate alignments of cherries in a current multiple alignment which is initially empty.
Let A and X be two genomes of S with A being an ancestor of X and let O A X = {O 1 (k 1 ), · · · O m (k m )} be a history for A and X. The cost of O A X is defined as . Now, the phylogenetic alignment problem, is to infer a feasible labeled phylogenetic alignment for S minimizing the sum of costs of all branches of S .
The relaxed phylogenetic alignment problem with no restriction on visibility, i.e. the problem of assigning ancestral configurations leading to a minimum cost for the tree, has been shown to be NP-hard for most formulations in terms of type of genomes and different distances. A classical heuristic strategy is known as the steinerization approach [23]. It begins with an initial assignment for the internal nodes of S , and in a postorder traversal it improves each internal node assignment by solving a 3-star problem defined as follows. 3 In the case of symmetrical operations, such as nucleotide substitutions or indels, or gene order rearrangements, the direction of evolution can be ignored, which leads to the median problem: find M minimizing C(M, A) + C(M, X) + C(M, Y). However, this is not the case for content-modifying operations, as for example a duplication from A to X is rather a loss from X to A, and therefore the evolutionary direction cannot be ignored in this case.
For the evolutionary model of interest, the restriction of the phylogenetic alignment problem to a cherry has been considered in [17,19]. The developed algorithm can be used for the initialization step: traverse the tree in a depth-first manner and compute successive ancestors of pairs of nodes. Here, we extend our study to a 3-star phylogeny, which allows for the application of the aforementioned steinerization approach. Notice that the phylogenetic alignment problem has been shown NPcomplete for the duplication-loss model of evolution, already for two species [20,17,18].
The 3-star Problem We first show that the 3-star problem for a 3-star A| XY reduces to finding a feasible labeled alignment for {A, X, Y} of minimum cost. It is easy to see that any visible history for A|XY leads to a unique feasible labeled alignment for {A, X, Y}. Conversely, let L(Ā,X,Ȳ) be a feasible labeled alignment for a 3-star A| XY. A corresponding visible history for A|XY can be obtained as follows (see Figure 1 for an example): Therefore, given a 3-star A|XY, we focus here on the problem of finding a feasible labeled alignment for {A, X, Y} of minimum cost.
Let C(i, j, k) (C f (i, j, k) respectively) be the minimum cost of a labeled (feasible labeled respectively) alignment of three prefixes A[1, i], X[1, j] and Y [1, k] of A, X and Y, for all 1 ≤ i ≤ |A|, 1 ≤ j ≤ |X| and 1 ≤ k ≤ |Y |.

• STEP 1. FINDING A LABELED ALIGNMENT BY A DYNAMIC PROGRAMMING APPROACH.
As explained previously, transpositions are implicitly considered by allowing the source and target of a duplication to belong to two different genomes. Therefore, we will restrict our presentation to the model O = {D, L, R}.
To compute C(i, j, k), we consider all the possibilities for the last column of an alignment of the three prefixes A[1, i], X [1, j] and Y [1, k] and interpret it by the minimum number of operations. In the following, a column is represented as a triplet of characters from Σ − , were different letters denote different characters of Σ. Clearly, each column can be interpreted by no more than 2 operations. If two operations are required to interpret a given column, then we assume them to be of the same size. This eliminates the case of a column of the form [a, x, y], as this would require two reversals of different sizes.
C(i, j, k) is the minimum over all the computed costs.
x, x]: Reversal in both X and Y (i.e. in M).
where l is the largest value such that X[j − l, j] has an occurrence in A, X or Y.
After computing all the values leading to C(|A|, |X|, |Y|), the labeled alignment L(Ā,X,Ȳ) obtained by a backtracking approach is not necessarily a feasible alignment as it may contain cycles. Notice that, since A is an ancestor of both X and Y, the target of an event cannot belong to A. Therefore only events with source and target in X or Y may belong to a cycle.  Then X[r] should be in the source of O p and in the target of O p+1 . But this leads to an interpretation of the corresponding column of L(Ā,X,Ȳ) with two events instead of one, which is in contradiction with the recurrences leading to a minimum number of events for each column. □ We resolve cycles as follows. Let Z be the set of all overlapping strings {Z 1 , Z 2 , . . . , Z h } of O c . Let ε i = {z i 1 , z i 2 , · · · z i l } be a set of substrings of Z i(1≤i≤h ) of minimum cardinality such that z i 1 z i 1 · · · z i l = Z i and z i k (1≤k≤l) has an occurrence in A. Let Z t be the string for which |E t | = min(|E 1 |, |E 2 |, . . . |E h |). Assume w.l.o.g. that Z t is a substring of X. Then Z t in L(X,Ȳ) is covered by a loss in Y, and each substring of Z t in L(Ā,X) is covered by a duplication in X (source in A) (see Figure 2 for details).
Complexity: For simplicity, assume that |A| = |X| = |Y| = n. From the recurrences detailed above, each C(i, j, k) can be computed in linear time, leading to an O(n 4 ) worst-time complexity for Step 1. Now, the complexity of Step 2 depends on the complexity for finding all cycles and resolving them. As cycles can only involve strings from X and Y, the problem reduces to the case of cycle-resolution for a pair-wise alignment, which has been shown quadratic (submitted journal version of [17]). This leads to a worst-time complexity of O(n 4 ) for the whole algorithm.

Experimental results
We call multiOrthoAlign our algorithm for the phylogenetic alignment problem based on the steinerization approach described in Section and using our 3-star algorithm for the iteration step.
In this section, we compare multiOrthoAlign with DupLoCut [20], on simulated and real-world instances.
DupLoCut is an "almost" exact heuristic based on linear programming. For the sake of comparison with DupLoCut [20], we consider a model restricted to duplications and single gene losses. Indeed, DupLoCut is restricted to this evolutionary model. Moreover, we consider the default cost of one for each event.

Simulations
We generate phylogenetic trees with 3 extant genomes. The genome at the root is generated in 2 steps. First, a random sequence R of length n on an alphabet of size s is generated. Then, l moves (duplications and single gene losses) are applied to R where duplication length follows the geometric distribution of parameter 0.5. All other genomes along the tree are generated by applying l moves to their direct ancestor.
Execution time: We compare the running-time of our 3-star algorithm with that used in DupLoCut for the reoptimization steps. Running times were recorded using a 8-core Intel(R) 3.6 GHZ processor, with 16 GiB of memory. Table 1 gives average running times after one round (iteration) of reoptimization for simulations generated with three choices of parameters n, s and l. Although multiOrthoAlign's running time increases slightly with increasing values of n, s and l, it is still within a few minutes for n = 250. In comparison, the same data took more than 6 hours to be processed by DupLoCut.
Accuracy: In order to test the performance of multi-OrthoAlign in terms of accuracy, we used two mea- where Inf is the number of events inferred by multiOrthoAlign and Opt is the "almost optimal" number of events obtained by running DupLoCut; Accuracy = NbOpt Total , where N bOpt is the number of simulations among Total (number of all simulations) for which multiOrthoAlign returns the same number of events as DupLoCut.
The same algorithm (2-SPP [19]) was used for the initialization step of both multiOrthoAlign and DupLo-Cut. Figure 3 gives results for different choices of the parameter l. With ratios s/n = 1/2 and l/n = 1/20, mul-tiOrthoAlign returns the same cost as DupLoCut for Figure 2 Two different labeling for the alignment of strings A="abcde", X="acdeabdeabde" and Y="acde". Losses are denoted by "L" and duplications by arrows from source (indicated by bracket) to target. In the left labeling, "a 2 b 1 d 2 e 2 " is interpreted as the target of a duplication. In the right one, it is interpreted in L(X,Ȳ) as a loss, and in L(X,Ȳ) as 2 duplications. more than 96% of the simulations. This accuracy rate remains stable for decreasing alphabet size (results not shown), i.e., for increasing number of gene copies, but decreases quickly as the number l of moves increases (left diagram of Figure 3). However, the average Error remains lower than 0.008 (right diagram of Figure 3).
In order to test the algorithm on larger trees, we generated a phylogenetic tree with 100 extant genomes. The genomes along the tree were generated as described above for triplet phylogenies, with parameters n = 100, s = 50 and l = 5. Figure 4 illustrates the total cost of the tree (number of duplication/single gene loss events) obtained after each iteration of multiOrthoAlign (blue line) and DupLoCut (red line). After the initialization step (iteration 0), the total cost obtained by multi-OrthoAlign is 1632. After 6 rounds of reoptimization, the two programs converge to a local minimum (no improvement can be made), with a total cost of 1100 for multiOrthoAlign and of 1124 for DupLoCut. Our cost is always slightly better in this case. Notice that, although DupLoCut is "almost" exact for the median problem, the whole steinerization procedure does not guarantee any optimality result.
Using OrthoAlign instead of the 2-SPP algorithm for the initialization step would be something natural to do for reducing the running time of the whole procedure. However, as illustrated in Figure 4 (green line), the initial assignment obtained with OrthoAlign in this case leads to a cost of 1930 which is far from the best solution found. Notice that 2-SPP is an exact algorithm for pair-wise alignment and OrthoAlign is a heuristic which does not guarantee the optimal result. multiOrthoAlign converge to a local minimum of 1401 events after 4 rounds of reoptimization.

Real data
We also compared the two approaches on the set of realworld instances used in [20]. The set contains the stable RNA genes of 12 Bacillus strains of four species (amyloliquefaciens, subtilis, thuringiensis, and cereus). The phylogeny shown in Figure 5 is taken from the webpage (http:// ccb.jhu.edu/software/duplocut).
Using 2-SPP for the initialization step, multiOrthoAlign leads to a cost of 136 after the initialization step, and converges to a local minimum of 123 events after 2 rounds of Running times comparison between multiOrthoAlign and DupLoCut on simulated triplet phylogenies after one round of reoptimization. Times are averaged over 50 simulations for the first choice, and 10 simulations for the second and the last one. Average running times are reported in minutes.

Figure 3
Performance of multiOrthoAlign in terms of accuracy. The genome length is fixed to n = 100 and the alphabet size to s = n/2; diagrams are obtained by varying the number of moves l (x-axis is l/n); results are averaged over 50 simulations. Left: Accuracy of multiOrthoAlign compared with DupLoCut. Right: the average Error. Figure 4 Total cost obtained by multiOrthoAlign versus the one obtained by DupLoCut. Blue refers to the cost obtained by multiOrthoAlign when we used the 2-SPP algorithm for the initialization step, green to the cost obtained by multiOrthoAlign when we used OrthoAlign for the initialization step, and red to the cost obtained by DupLoCut.
reoptimization. As for DupLoCut, it converges to a local minimum of 120 events after 5 rounds of reoptimization. However, using OrthoAlign instead of 2-SPP for the initialization step, multiOrthoAlign leads to a cost of 131 after the initialization step, which is not refined by subsequent iterations. It therefore appears that 2-SPP is a more appropriate initialization procedure than OrthoAlign.

Conclusion
We have developed multiOrthoAlign, a phylogenetic alignment algorithm for a genome-wide evolutionary model involving duplications, losses and rearrangements. It uses a generalization of OrthoAlign [21], a recently developed pair-wise alignment algorithm, to the median of three genomes. Our algorithm for the median problem is a heuristic that does not guarantee any optimality result. Compared with DupLoCut, the most closely related existing algorithm, multiOrthoAlign exhibits similar results but is much faster. The method can be easily extended to other contentmodifying and rearrangement operations such a substitutions, insertions, tandem duplications or inverted duplications. However, the more operations we add, the more challenging is the problem of finding appropriate costs for operations, and appropriate criteria to deal with the non-uniqueness of solutions.