 PROCEEDINGS
 Open access
 Published:
Genomic duplication problems for unrooted gene trees
BMC Genomics volume 17, Article number: 15 (2016)
Abstract
Background
Discovering the location of gene duplications and multiple gene duplication episodes is a fundamental issue in evolutionary molecular biology. The problem introduced by Guigó et al. in 1996 is to map gene duplication events from a collection of rooted, binary gene family trees onto theirs corresponding rooted binary species tree in such a way that the total number of multiple gene duplication episodes is minimized. There are several models in the literature that specify how gene duplications from gene families can be interpreted as one duplication episode. However, in all duplication episode problems gene trees are rooted. This restriction limits the applicability, since unrooted gene family trees are frequently inferred by phylogenetic methods.
Results
In this article we show the first solution to the open problem of episode clustering where the input gene family trees are unrooted. In particular, by using theoretical properties of unrooted reconciliation, we show an efficient algorithm that reduces this problem into the episode clustering problems defined for rooted trees. We show theoretical properties of the reduction algorithm and evaluation of empirical datasets.
Conclusions
We provided algorithms and tools that were successfully applied to several empirical datasets. In particular, our comparative study shows that we can improve known results on genomic duplication inference from real datasets.
Background
Genomic duplication plays important role in evolution of life on Earth. This phenomenon have been extensively studied in the last decades for plant, bacterial and many other genomes [1–7]. Duplication events can involve individual genes, genomic segments or whole genomes. While the reconstruction of evolutionary history of individual genes is generally well established [8–13], still little is known on the inference of large genomic duplications that can span through thousands of genes families.
In this approach we propose to use the model of reconciliation in which a gene tree is reconciled with its species tree. The concept of reconciliation was introduced by Goodman [14] and formalized by Page [8] in the context of reconciling potential incongruence between a rooted gene family tree and its species tree. In this model, differences between gene and species trees are explained in terms of evolutionary events such as gene duplication, gene loss and speciation. Reconciliation can be interpreted as the embedding of a gene tree into a species tree where these evolutionary events, located in the species tree, induce a biologically consistent scenario [15]. Tree reconciliation has been extensively studied in recent decades in many theoretical and practical contexts including supertree inference, error correction and HGT detection [16–24]. In the process of reconciliation, which is relatively simple from computational point of view, each gene from a single gene family is mapped into the species tree and it is classified as a single gene duplication or related to speciation. However, the problem becomes much more complex, when a gene duplication is a part of large genomic duplications, called multiple gene duplication episode, in which parts of a genome are duplicated. In fact, it is known that a large duplication event is usually followed by many gene losses and gene rearrangements. In consequence, the reconstruction of large gene duplication events may be difficult.
The first approach to detect multiple gene duplication episodes from a collection of rooted gene trees was proposed by Guigó et al. [10]. In the model, for a given collection of rooted gene trees and a rooted species tree, the authors proposed heuristic to aggregate single gene duplication events into a large gene duplication. This approach was formalized and refined by Page and Cotton [25]. They formally defined the problem of episode clustering (EC) as the problem of locating the minimal number of locations in the species tree, where all duplications from the input gene trees can be placed. This model was applied in the context of the supertree problem by Fellows [26]. Burleigh et al. [27] and Bansal and Eulenstein [28] proposed the first polynomial time solutions for two types of the multiple gene duplication problems: the episode clustering (EC) and a more general variant of clustering called minimum episodes (ME). Finally, Luo et al. [29] proposed linear time and space algorithms to these problems.
While the classical reconciliation model is applicable to rooted trees only, most standard phylogenetic inference methods, like maximum likelihood, maximum parsimony or neighbour joining, infer unrooted gene family trees, and it is often difficult, to identify credible rootings. For example, outgroup rooting can result in incorrect rootings when evolutionary events cause heterogeneity in the gene trees, and rooting gene trees under the molecular clock assumption, or similarly by using midpoint rooting, also can result in error when there is a molecular rate variation throughout the tree [30, 31]. Tree reconciliation have been successfully extended to reconcile an unrooted gene tree with a rooted species tree by seeking a rooting of the unrooted gene tree that invokes the minimum number of evolutionary events such as gene duplications (D) or gene duplications and losses (DL), in the context of a given species tree [32, 33]. It is known that the rooting edges with minimal D or DL cost, induce a full subtree, called plateau, in the unrooted gene tree [34].
In this article we present the first solution to the open problem [27] of unrooted episode clustering, that is, the problem of episode clustering where the input consists of unrooted gene trees. We show that for a given set of unrooted gene trees and a species tree we can solve the unrooted episode clustering by reducing it to the rooted episode clustering problem that has a linear time complexity. Our solutions require a linear time preprocessing and a creation of at most 1+2^{k} collections of rooted gene trees, that is, instances of rooted EC Problem, where k is the number of input gene trees having a special topology located in the plateau of the duplication cost (formally, the condition requires two stars S2 [32]). Usually k represents a small fraction of the whole input, thus, this condition significantly reduces the complexity. In other words, we show that the problem of unrooted episode clustering is fixed parameter tractable. Finally, in a number of empirical computational experiments we show that despite the exponential worst case complexity our algorithm is able to resolve instances of the problem after the verification of at most two rooted datasets. In consequence, our solution can be efficiently applied to locate duplication clusters in collections of unrooted gene trees.
Results
Basic notation
A species tree is a rooted binary tree with leaves uniquely labeled by the names of species. Throughout this work, the species tree is fixed, therefore, we use S to denote it. A rooted gene tree is a rooted binary tree with leaves labeled by the names of species. The set of species present in T is denoted by \(\mathcal {L}(T)\). The rooted tree (T _{1},T _{2}) has two subtrees T _{1} and T _{2} whose roots are children of the tree root. Additionally, for nodes a and b, a≼b means that a and b are on the same path from the root, with b being closer to the root than a. We write a≺b if a≼b and a≠b. The root of a tree T we denote by root(T).
Let T=〈V _{ T },E _{ T }〉 be a rooted gene tree such that \(\mathcal {L}(T) \subseteq \mathcal {L}(S)\). The least common ancestor (lca) mapping, M _{ T }:V _{ T }→V _{ S }, is defined as follows. If v is a leaf in T then M _{ T }(v) is the leaf in S labeled by the label of v. When v is an internal node in T having two children a and b, then M _{ T }(v) is the least common ancestor of M _{ T }(a) and M _{ T }(b) in S. An internal node g∈V _{ T } is called a duplication if M _{ T }(g)=M _{ T }(a) for a child a of g. The duplication cost, denoted by D(T,S), is the total number of duplications in T. Each nonduplication node of T we call a speciation. The total number of gene losses required to reconcile T and S can be defined by: \(\mathsf {L}(\textit {T,S})=2\mathsf {D}(\textit {T,S})+\sum _{g\ \text {is internal}, \textit {a,b}\ \text {children of}\ g} (\\mathsf {M}_{T}(a),\mathsf {M}_{T}(b)\2)\), where ∥a,b∥ is the number of edges on the path connecting a and b in S. Finally, we can define the duplicationloss cost of reconciling a rooted gene tree T and a species tree S as follows: DL(T,S)=D(T,S)+L(T,S) [34]. Examples of the reconciliation are depicted in Fig. 1.
Unrooted reconciliation
The unrooted gene tree is an undirected acyclic connected graph in which each node has degree 1 (leaves) or 3 (internal nodes), and the leaves are labeled by the names of species. For an unrooted gene tree G=〈V _{ G },E _{ G }〉 and an edge e∈E _{ G }, by G _{ e }, we denote the rooting of G obtained from G by placing the root on e. Such a rooting induces the duplication cost D(G _{ e },S). We call Dminimal, the rooting or edges having the minimal duplication cost. It follows from the theory of unrooted reconciliation [32, 34] that the set of Dminimal edges, called Dplateau, is a full subtree of G. The same property holds for the DLplateau, that is, the set of edges with the minimal duplicationloss cost. We use a similar notation for DLminimal edges, rootings and so on. The most important property of these plateaus is below.
Theorem 1 (From [34]).
DLplateau is a subgraph of Dplateau.
Without loss of generality we assume that every root of a gene tree is mapped into the root of S, denoted by ⊤, and both trees are nontrivial. An edge e=〈v,w〉 of G is empty if the root of G _{ e } is a speciation, i.e., \(\mathsf {M}_{G_{e}}(v) \neq \top \neq \mathsf {M}_{G_{e}}(w)\). We call e double if \(\mathsf {M}_{G_{e}}(v)=\top =\mathsf {M}_{G_{e}}(w)\). Otherwise, e is called single. A single edge e is called vincoming or woutgoing if \(\mathsf {M}_{G_{e}}(v) \neq \top = \mathsf {M}_{G_{e}}(w)\).
Let v be an internal node of G, then a star with a center v consists of three edges, denoted by e _{ a }, e _{ b } and e _{ c }, sharing v and incident to nodes a, b and c, respectively (see Fig. 2). The are several types of possible star topologies based on the above classification of edges: the S1 star has one vincoming edge and two voutgoing edges, the S2 star has exactly two voutgoing edges and one empty edge, the S3 star has two voutgoing edges and one double edge, the S4 star all 3 edges are double, and the S5 star has one voutgoing edge and two double edges. The star topologies are depicted in Fig. 2.
Theorem 2 (Adopted from [32]).
For a given unrooted gene tree G, we have

either G has exactly one empty edge or G has at least one double edge,

if the DLplateau of G consists of exactly one edge, then this edge is either empty or double, and all other edges are single.

if the DLplateau of G has more than one edge, then it contains all edges present in stars S4 and S5, and all other edges are single.
Note that if a gene has an empty edge, then it has at most two stars S2 (see examples in Fig. 3).
Episode clustering problems
To model gene duplication episodes we allow to relocate a gene duplication from its lcamapping location to one of its ancestors. In other words, we introduce mappings representing evolutionary scenarios that can differ from the scenario defined by the lcamapping. Additionally, we require that the total number of gene duplications is minimal. To ensure biological correctness of such mappings, we introduce several conditions, e.g., time order preservation.
A mapping F _{ G }:V _{ G }→V _{ S } is called valid if the following conditions are satisfied:

F _{ G }(a)≼F _{ G }(b) if a≼b (time consistency),

F _{ G }(a)=M _{ G }(a) for any speciation node a (fixed speciations),

F _{ G }(a)≽M _{ G }(a) for any duplication node a (duplication can be raised),

F _{ G }(a)≺M _{ G }(b) for any speciation node b such that a≺b (fixed number of gene duplications).
It can be shown that every valid mapping uniquely defines an evolutionary scenario represented by a DLStree [15]. Additionally, every DLStree obtained from a valid mapping can be transformed into the optimal evolutionary scenario (i.e., lcabased scenario), by a sequence of TMOVE (i.e., lowering duplication) transformations. Please refer to [15] for more details on formal modeling of evolutionary scenarios. Observe, that the above model is more general than the model from [28].
We denote by Dup(T), the set of all duplication nodes in T. Let G _{1},G _{2},…,G _{ n } be a collection of rooted gene trees. Assume that, for every i∈{1,2,…,n}, F _{ i } is a valid mapping between G _{ i } and the species tree S. Every element \(s \in \bigcup _{i} \mathsf {F}_{i}(\mathsf {Dup}_{G_{i}})\) denotes the location of multiple gene duplication events in S. Such locations will be called duplication episodes. A duplication cluster for s is the set of all gene duplications present in G _{ i }’s that are mapped to s. By ⊤cluster we denote the duplication cluster whose elements are mapped to ⊤.
Problem 1 (Rooted Episode Clustering (EC)).
Given a collection of rooted gene trees G _{1},G _{2},…,G _{ n } and a species tree S. Compute the minimal number of duplication episodes, denoted by EC(G _{1},G _{2},…,G _{ n },S), in the set of all valid mappings F _{1},F _{2},…,F _{ n } such that \(\mathsf {F}_{i} \colon V_{G_{i}} \rightarrow V_{S}\).
This problem can be solved in lineartime and space [29]. In this article we solve the following problem.
Problem 2 (Unrooted Episode Clustering (UEC)).
Given a collection of unrooted gene trees G _{1},G _{2},…,G _{ n } and a species tree S. Compute the minimal EC(T _{1},T _{2},…,T _{ n },S) in the set of rooted gene trees {T _{1},T _{2},…,T _{ n }} such that T _{ i } is a rooting obtained from G _{ i } by placing the root on the edge from the Dplateau.
Observe, that we allow rootings only in the Dplateau. Otherwise, the total number of gene duplications is not minimal. By singleUEC we denote the problem UEC for a single unrooted gene tree, i.e., when n=1. Every edge in an unrooted gene tree that induces the optimal solution for singleUEC will be called optimal (for singleUEC). For convenience, we use EC(T _{1},T _{2},…,T _{ n }) instead of EC(T _{1},T _{2},…,T _{ n },S).
Episodes in a gene tree with an empty edge
In this Section we solve singleUEC problem for the case when the input gene tree has one empty edge.
Let v be a center of the star that contains the only DLplateau edge in a gene tree G. This star induces three rooted subtrees T _{ a }, T _{ b } and T _{ c } rooted at neighbours a, b and c, respectively, as indicated in Fig. 2. Let be the indicator function, that is, is 1 if p is satisfied and 0 otherwise.
Lemma 1.
Let a _{0},a _{1},a _{2},…,a _{ n+1} (for n≥0) be the path of Dplateau nodes connecting v=a _{0} and a _{ n+1}∈T _{ a } in G. Let G _{ n } be the Dminimal rooting induced by the edge 〈a _{ n },a _{ n+1}〉. If e _{∗}=〈v,c〉 is empty then
where T _{1},T _{2},…,T _{ n+1} are subtrees of T _{ a } such that T _{ a }=(T _{1},(T _{2},…,(T _{ n },T _{ n+1})…)) and the root of T _{ n+1} is a _{ n+1} (see Figs. 2 and 4).
Proof.
First we show that v is a speciation node in G _{ n }. It follows from the fact that v is a center of S2 star and 〈v, b〉 is single. Thus, M _{ n }(v)=⊤, M _{ n }(c)≺⊤ and M _{ n }(b)≺⊤, where M _{ n } is the lcamapping for G _{ n }. From the fact that M _{ n }(v)=⊤ we conclude that all nodes on the path connecting the parent of v with the root in G _{ n } are mapped to ⊤, therefore, they are duplications.
Lets consider the number of duplication clusters in G _{ n }. We have the ⊤cluster composed of the duplication nodes a _{1},a _{2},…,a _{ n },root(G _{ n }) mapped to ⊤. Both T _{ c } and T _{ b } in G _{ n } are under speciation node v so their clusters are disjoint with the ⊤cluster. Finally, if the root of some T _{ i } is a duplication then its cluster can be merged with the ⊤cluster. Therefore, the ⊤cluster contributes to EC(G _{ n }) only if the root of T _{ i } is a speciation for every i. Now, it is easy to conclude the final formula.
Lemma 2.
Under the assumptions from the previous lemma, we have
where G _{∗} is the rooting induced the empty edge e _{∗}=〈v,c〉 (see Fig. 4).
Proof.
Both rootings G _{ n } and G _{∗} are Dminimal. Hence, D(G _{∗},S)=D(G _{ n },S) and, in consequence, the number of duplication nodes in A={a _{1},a _{2},…,a _{ n },v,root(G _{∗})} in G _{∗} and B={a _{1},a _{2},…,a _{ n },v,root(G _{ n })} in G _{ n } are equal. It follows from the properties of star S2, that in G _{ n } node v is a speciation mapped to ⊤. Hence, all predecessors of v are duplications in G _{ n }. Thus, we have exactly n+1 duplications in B. On the other hand, by star S2, root(G _{∗}) is a speciation, therefore all remaining nodes in A are duplications.
We conclude that G _{ n } has the ⊤cluster containing duplications from A, and G _{∗} has a cluster (mapped below ⊤) containing duplications from B, respectively. These two clusters we call high clusters. If the root of one of T _{ i }’s is a duplication, then it can be merged with the high cluster in both rootings. Otherwise, if every root of these subtrees is a speciation then the high cluster is disjoint with clusters from T _{1},T _{2},…,T _{ n+1}. Moreover, if b is a duplication then the high cluster contains b in G _{∗}. However, in G _{ n } the cluster of b will be disjoint with the ⊤cluster due to the speciation node v. Combining the above observations we obtain our formula.
Lemma 1 and Lemma 2 complete the case of empty rootings. We proved that rooting on empty edge has the best EC.
Episodes in a gene tree with a double edge
We start with two technical lemmas on the properties of the plateaus.
Lemma 3.
If the DLplateau consists of exactly one double edge then the Dplateau and the DLplateau are equal.
Proof.
Let 〈v,a〉 be the DLplateau edge (see Fig. 2). It follows from the property of star S3 that both v and a are mapped to ⊤ in the DLminimal rooting and their children (if present) are mapped below ⊤. Hence, the root is a duplication, while v and a are speciation nodes. Now, it is easy to show that rooting on edge 〈v,b〉 (or 〈v,c〉) induces one additional gene duplication at v. We conclude that the only edge with the minimal duplication cost is 〈v,a〉.
We write that a node g from unrooted gene tree G is a superduplication, if g is a duplication in every rooting of G. Please recall, that the plateau is a subtree of a gene tree, thus a leaf of the Dplateau may refer to an internal node of a gene tree. For example, in Fig. 3, the Dplateau of G _{1} has four leaves: one is an internal node of G _{1} and others, labeled a, c, e, are leaves of G _{1}.
Lemma 4.
If the DLplateau has a double edge then

every leaf of the Dplateau is a speciation in every rooting from the Dplateau,

and every internal node of the Dplateau is a superduplication.
Proof.
For the first part of the proof, let us assume that v is a leaf of the Dplateau. By using the notation from Fig. 2, let v be a center of a star such that 〈v,a〉 belongs to the Dplateau. Assume that v is a duplication in every Dminimal rooting. Then, the Dminimal rooting G _{〈v,a〉} has one duplication in v. The edge 〈v,b〉 does not belong to Dplateau, therefore, the rooting G _{〈v,b〉} has at least one more duplication than G _{〈v,a〉}. Hence, G _{〈v,b〉} has two duplications in v and in the root. Moreover, the root of G _{〈v,a〉} is not a duplication. However, this is possible only when T _{ a } and T _{ v } are mapped below ⊤, thus the 〈v,a〉 is an empty edge, which is a contradiction with Theorem 2. This completes the first part of the proof.
Next, if the DLplateau consists of exactly one double edge, then, by Lemma 3 the property holds trivially. Now, we assume that the DLplateau has more than one edge. We show that every internal node v of the DLplateau is a superduplication. From Theorem 2 we know that v is incident to at least two double edges. Hence, in any rooting at least one of its children is mapped to ⊤. We conclude that v is a duplication mapped to ⊤.
Let us consider a path p=v _{1},v _{2},…,v _{ n } (n>1) connecting an internal node v _{1} from the DLplateau with a leaf v _{ n } from the Dplateau. We show that the first n−1 nodes on p are duplications for every rooting placed on this path. It follows from the first part of this proof that v _{1} is a superduplication mapped to ⊤. Hence, when rooting at 〈v _{ n−1},v _{ n }〉, we have n gene duplications: for v _{1},v _{2},…,v _{ n−1} and one for the root. All edges from p are elements of the Dplateau, thus moving the root to other edges on p will preserve the total number of gene duplications.
It should be clear that the same holds when choosing other root positions. We omit the details.
In the next lemma we show that rootings at edges of the Dplateau induce the same EC cost.
Lemma 5.
If an unrooted gene tree G has no empty edge then for any Dminimal rooting of G denoted by G _{∗}
where T _{1},T _{2},…,T _{ n } are the rooted subtrees of G obtained from G by removing all internal nodes of the Dplateau.
Proof.
It follows from Lemma 4 and its proof that all internal nodes of the Dplateau are present in the ⊤cluster in the clustering with minimal number of clusters. This cluster is separated from other duplication clusters by speciation nodes located on the border of the Dplateau. Thus, the clusters induced by optimal solution to EC for G _{∗} are the clusters induced by optimal solution to EC of T _{1},T _{2},…,T _{ n } plus the ⊤cluster.
Solutions
Now we present solutions to our unrooted episode clustering problem.
Theorem 3 (Solution to singleUEC).
For any gene tree G, an edge e is optimal for singleUEC, if either e is empty or e is in the Dplateau and G has a double edge.
Proof.
The first part of the proof follows immediately from Lemma 2 and the second part from Lemma 5.
Theorem 4.
For a collection of unrooted gene trees G _{1},G _{2},…,G _{ n }, if every gene tree has a double edge then rooting every gene tree on an edge from the Dplateau yields the optimal solution for UEC.
Proof.
Assume that n=2 and let \(G^{\prime }_{1}\) and \(G^{\prime }_{2}\) be two Dplateau rootings of G _{1} and G _{2}, respectively. It should be clear that EC(G1′,G2′)=EC(T), where T=(G1′,G2′). Next, by Lemma 5, EC(T) is independent on the choice of rooting of G _{1} and G _{2}, as long as the rootings are in the Dplateau. Therefore, we conclude that EC(T) is the solution to UEC Problem for G _{1} and G _{2}. This observation can be easily generalized by induction to any n.
Note that we cannot generalize the property stated in Theorem 4 to gene trees with empty edges. The example is shown in Fig. 3. Consider the dataset {G _{1},G _{2}}. G _{1} has five Dminimal rootings, while G _{2} has exactly one. In G _{2∗} we have one ⊤cluster, therefore G _{2∗} with G _{1∗}, i.e., the empty edge rooting of G _{1}, have two duplication clusters. However, the best clusterings for {G _{1},G _{2}} having exactly one cluster are obtained for G _{1,1}, G _{1,2} or G _{1,3}. On the other hand, the best clusterings can be also obtained for empty edge rootings, e.g. {G _{1,∗},G _{4,∗}} with cost 2 for the input {G _{1},G _{4}}. From these examples, we see that the empty edges have different properties than double edges in the context of UEC, and we cannot generalize Theorem 4 to empty edges.
Theorem 5 (Candidate rootings for UEC).
For a collection of unrooted gene trees \(\mathcal G\), the solution to UEC is induced by a rooting edge e of \(G \in \mathcal G\) satisfying:

if G has a double edge, then e is any Dminimal edge in G,

if G has an empty edge, then e is an element of star S2.
Proof.
If some \(G \in \mathcal G\) has a double edge then the property follows from Theorem 4 and Lemma 5 For gene trees with an empty edge e _{∗} we show that any Dminimal rooting of the edge that is not adjacent to e _{∗} can be equivalently replaced by a rooting adjacent to e _{∗}. By using the notation from Fig. 2, let \(T_{a}=(T_{a'},T_{a^{\prime \prime }})\) such that a ^{′} and a ^{′′} are the roots of \(T_{a^{\prime }}\) and \(T_{a^{\prime \prime }}\), respectively. We show that the rooting G _{〈v,a〉} denoted by G _{ a } (see Fig. 5) has the same duplication episodes as the rooting \(G_{a^{\prime }}\) obtained for the edge 〈a,a ^{′}〉. In both rootings v is a speciation, therefore the structure of clusters present in T _{ b } and T _{ c } is the same in both rootings. The edge 〈v,a〉 is aincoming, thus the roots are duplications mapped to ⊤. From the fact that 〈a,a ^{′}〉 is in the Dplateau we have that a is a duplication. Thus, every root and a induce the ⊤cluster. Finally, if a ^{′′} is a duplication node, then in both rootings it will be a member of the ⊤cluster. We proved these two adjacent rootings have the same structure of clusters. Therefore, it is sufficient to choose the rooting G _{ a } instead of \(G_{a^{\prime }}\). This proof can be naturally extended by induction to any edge from the Dplateau.
We conclude that for a gene tree G we have at most 5 candidates for rootings. For instance, G _{4} has two stars S2 in the Dplateau, therefore we have 5 candidate rootings: the empty edge rooting G _{4,∗} and the rootings of adjacent edges G _{4,1}, G _{4,4}, G _{4,7} and G _{4,10}. Note that the clusters from G _{4,1} are equivalent to clusters from G _{4,2} and G _{4,3}. Similar property holds for other candidates.
Next, we show that the condition U2 can be improved.
Lemma 6.
Under the assumptions from Theorem 5. Let the set of clusters induced by the solution to UEC contains ⊤cluster. Then, the condition (U2) from Theorem 5 can be refined as follows:

if e _{∗} is the empty edge in G, then e is one among at most two nonadjacent edges such that e=〈x,y〉 is adjacent to e _{∗} and M _{∗}(x)=M _{∗}(y), where M _{∗} is the lcamapping for G _{∗}.
Proof.
Let G be a gene tree with an empty edge. Let e _{ a } be that edge from (U2’). By using the notation from Fig. 5, we compare the rooting G _{∗} and G _{〈v,a〉}, denoted here by G _{ a }. We have the following clusters in G _{∗}: the cluster C that contains c (if c is a duplication) and the cluster X that contains v (it follows from the proof of Lemma 2 that v is a duplication node). Thus, X={v}∪A∪B where A and B denote duplications from T _{ a } and T _{ b }, respectively. Note that C has the same contribution to EC in both rootings, which follows from the property that valid mappings of C are the same in both rootings. In G _{ a }, A is a subset of the ⊤cluster whose contribution to EC is already incorporated (by the assumption). The node v is a duplication in G _{∗}. Hence, without loss of generality we assume that M _{∗}(a)=M _{∗}(v), i.e., the rooting edge 〈v,a〉 satisfies the condition from (U2’).
We have two cases depending on whether B is empty. If B is empty then G _{ a } has “better” composition of clusters than in G _{∗}, i.e., one cluster less then in G _{∗} and other clusters has the same valid mappings. Otherwise, both rootings are equivalent if M _{∗}(b)=M _{∗}(v) (B in G _{ a } has the same valid mappings as X in G _{∗}), or again G _{ a } has a better structure of clusters than G _{∗} if M _{∗}(b)≺M _{∗}(v) (valid mappings of X in G _{∗} are included in valid mappings of B in G _{ a }). Similarly, we show that G _{ a } is also better than G _{〈v,b〉} (see also rootings of G _{4} in Fig. 3).
We proved that among three rootings from the star S2 we can choose one candidate. The second edge is obtained from the second star S2 (sharing the empty edge) if it is present in the gene tree (see Theorem 2).
From the last lemma we have at most two candidates for any gene tree from the input collection. For example, the candidate rooting G _{4,1} has more flexible valid mappings than G _{4,4}, e.g. the duplication cluster of ((c,b),a) in G _{4,1} has larger range of possible mappings than the duplication cluster of ((d,b),a) in G _{4,4}, while the remaining two clusters have the same locations in the species tree. Hence, for the dataset {G _{3},G _{4}}, if the ⊤cluster is present in solution to UEC, we have two candidates G _{4,1} and G _{4,7} (which is more flexible than G _{4,10}). Note, that the clustering costs 3 is obtained by rootings G _{3,∗} and G _{4,1} (or G _{4,2}, G _{4,3}).
Algorithms
Algorithm 1 presents the solution to UEC problem. The correctness of this algorithm follows from Theorem 5 and Lemma 6. Algorithm 1 has two phases. In the first phase for every gene tree a set of candidate rootings is prepared with respect to the conditions (U1) and (U2’). To find optimal rootings we use a linear time algorithm (procedure FindOptEdge) based on greedy descent method that search a double or an empty edge in a gene tree [32]. Based on condition U2’, we divide possible solutions into two categories depending on the presence of ⊤cluster in an optimal clustering. If the ⊤cluster is not present then every gene tree has an empty edge (in line 10). Otherwise, we check every possible variant of rooting candidates. Note that from Lemma 6, a gene tree has two candidates if and only if the gene tree has two stars S2 that are included in the Dplateau. Thus, the overall time complexity depends on the presence of such trees in the input. From this observation we conclude the following result.
Theorem 6.
The time complexity of Algorithm 1 is \(O(2^{k}(\sum _{i} G_{i} + S))\), where k is the number of input gene trees having two stars S2 that are included in the Dplateau.
Thus, from theoretical point of view UEC is fixed parameter tractable. Later we show that k usually represents a small fraction (up to 5 %) of the whole input. For the cases when 2^{k} is still too large for efficient computation, we propose Algorithm 2, in which we first solve the instance of UEC for the collection of gene trees that have a unique candidate. Clearly, if there are rootings of the whole input that have the same cost, then this cost is optimal. The overall complexity of Algorithm 2 is the same as Algorithm 1, however, for large datasets this strategy appeared to be successful after checking just one additional candidate set (in lines 2–4).
Experiments
We performed several computational experiments on three empirical datasets.
Guigó dataset consists of 53 rooted gene trees from 16 Eukaryotes from [10]. This dataset was evaluated with 71 species trees from [35], known to have the total minimal duplication cost. Génolevures is a dataset of 4144 gene trees [33] from nine yeast genomes [36] and two species trees: one from [37] and the second one having the lowest duplicationloss cost computed by Fasturec [38]. The third dataset TreeFam, spanning 25 mostly animal species, consists of 1274 curated gene family trees from TreeFam v7.0 [39]. The species tree for TreeFam is based on NCBI taxonomy.
We implemented our algorithms and the algorithms for the rooted variant of EC Problem (based on [29]). In our experiments the rooting candidates were used to compare the results for UEC with the model of mappings (for rooted gene trees) proposed in [28].
We performed two series of 74 computational experiments, one for our model and one with the model described in [28]. The total running time of our program was about 7 minutes on a standard PC workstation. For every dataset we were able to find solutions to UEC by testing at most two rooted instances of input gene trees (see Algorithm 2). The summary of experiments is depicted in Table 1.
For the Guigó dataset we found four duplication clusters, while for the rooted model from [28] we located five clusters. The difference can be explained by the properties of our model that is more flexible: the input trees are unrooted and the model of valid mappings is more generic. Observe that this dataset has unique rooting candidates (k=0).
Génolevures is the most complex dataset due to its size and potentially large parameter k. Despite these properties, Algorithm 2 located 17 clusters for the filtered input with all unique rooting candidates. In other words, in this filtered dataset a duplication cluster is present in every node of the species tree. Obviously, the whole input dataset has the same property. The same holds for the model from [28].
In TreeFam we located 45 clusters for the filtered dataset with unique rooting candidates. Then, Algorithm 2 found the solution having the same cost for the whole dataset (see Fig. 6). The same result was obtained for the model from [28] (see Table 1).
Conclusions
In this article we presented the first solution to the open problem of the duplication episode clustering for case when the input collection is composed of unrooted gene trees. By using theoretical properties of the unrooted reconciliation we proved that the problem has nice mathematical and computational properties. From practical point of view, we were able to provide efficient algorithms and tools that were successfully applied to locate duplication clusters in real datasets.
From the computational point of view the complexity of our algorithms depends on the parameter k, i.e., in the worst case EC Problem has to be solved 2^{k} times in order to find a solution to UEC. Even if k usually represents a small fraction of the whole input it can be still large, e.g. k>100 for the yeast dataset, which may prohibit computation of all possible variants. Here we proposed a solution, that is based on the observation that the clustering induced from the input gene trees having unique candidates (that is, without k gene trees with nonunique variants), usually represents an optimal solution for the whole input. Thus, the strategy that we applied in Algorithm 2, i.e., first cluster easy part and then try to incorportate the hard one by using already identified clusters, appeared to be successful even for potentially complex datasets.
Our computational experiments show that the duplication clusters are usually located in large parts of the species tree especially when the input dataset consists of thousands of gene trees. To provide more detailed information on the duplication clusters, we plan to study minimal episode problem (ME) which is a natural extension of the episode clustering problem. In the future we plan to extend the episode clustering problem by using other types of valid mappings.
Our software for solving unrooted episode clustering problem is publicly available at http://www.mimuw.edu.pl/jpaszek/uec.php.
Abbreviations
 D:

gene duplication
 DL:

Gene duplication and loss
 EC:

episode clustering for rooted gene trees
 lca:

least common ancestor
 UEC:

episode clustering for unrooted gene trees
References
Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004; 428:617–24.
Guyot R, Keller B. Ancestral genome duplication in rice. Genome. 2004; 47(3):610–4.
Vision TJ, Brown DG, Tanksley SD. The origins of genomic duplications in Arabidopsis. Science. 2000; 290(5499):2114–7.
Costantino L, Sotiriou SK, Rantala JK, Magin S, Mladenov E, Helleday T, et al. Breakinduced replication repair of damaged forks induces genomic duplications in human cells. Science. 2014; 343(6166):88–91.
Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, et al. Global trends of wholegenome duplications revealed by the ciliate Paramecium tetraurelia. Nature. 2006; 444(7116):171–8.
Cui L, Wall PK, LeebensMack JH, Lindsay BG, Soltis DE, Doyle JJ, et al. Widespread genome duplications throughout the history of flowering plants. Genome Res. 2006; 16(6):738–49.
Van de Peer Y, Maere S, Meyer A. The evolutionary significance of ancient genome duplications. Nat Rev Genet. 2009; 10(10):725–32.
Page RDM. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol. 1994; 43(1):58–77.
Mirkin B, Muchnik I, Smith TF. A biologically consistent model for comparing molecular phylogenies. J Comput Biol. 1995; 2(4):493–507.
Guigó R, Muchnik IB, Smith TF. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996; 6(2):189–213.
Page RDM. Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol. 2000; 14:89–106.
Arvestad L, Berglund AC, Lagergren J, Sennblad B. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics. 2003; 19(Suppl 1):i7–15.
Bonizzoni P, Della Vedova G, Dondi R. Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci. 2005; 347:36–53.
Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by Cladograms Constructed from Globin sequences. Syst Zool. 1979; 28(2):132–63.
Górecki P, Tiuryn J. DLStrees: a model of evolutionary scenarios. Theor Comput Sci. 2006; 359:378–99.
Arvestad L, Lagergren J, Sennblad B. The gene evolution model and computing its associated probabilities. J ACM. 2009; 56(2):1–44.
Doyon JP, Chauve C, Hamel S. Space of gene/species tree reconciliations and parsimonious models. J Comput Biol. 2009; 16:1399–1418.
Durand D, Halldórsson BV, Vernot B. A hybrid micromacroevolutionary approach to gene tree reconstruction. J Comput Biol. 2006; 13(2):320–35.
Hallett MT, Lagergren J. Efficient Algorithms for Lateral Gene Transfer Problems. In: Proceedings of the Fifth Annual International Conference on Computational Biology. RECOMB ’01. New York, NY, USA: ACM: 2001. p. 149–156.
In: (Bourque G, ElMabrouk N, editors.)Comparative Genomics, RECOMB 2006 International Workshop, RCG 2006, Montreal, Canada, September 24–26, 2006, Proceedings. vol. 4205 of Lect Notes Comput Sc. Berlin, Germany: Springer; 2006.
Ma B, Li M, Zhang L. From gene trees to species trees. SIAM J Comput. 2000; 30(3):729–52.
Sjostrand J, Tofigh A, Daubin V, Arvestad L, Sennblad B, Lagergren J. A Bayesian method for analyzing lateral gene transfer. Syst Biol. 2014; 63(3):409–20.
Stolzer M, Lai H, Xu M, Sathaye D, Vernot B, Durand D. Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics. 2012; 28(18):i409–15.
Zhang L. From gene trees to species trees II: species tree inference by minimizing deep coalescence events. IEEE/ACM Trans Comput Biol Bioinform. 2011; 8(6):1685–91.
Page RDM, Cotton JA. Vertebrate phylogenomics: reconciled trees and gene duplications. Pac Symp Biocomput. 2002;536–47.
Fellows M, Hallet M, Stege U. On the Multiple Gene Duplication Problem. In: 9th International Symposium on Algorithms and Computation (ISAAC’98), Lecture Notes in Computer Science 1533. Taejon, Korea: Springer Berlin Heidelberg: 1998. p. 347–356.
Burleighm JG, Bansal MS, Wehe A, Eulenstein O. Locating Multiple Gene Duplications through Reconciled Trees In: Vingron M, Wong L, editors. RECOMB. vol. 4955 of Lect Notes Comput Sc. Berlin, Germany: Springer: 2008. p. 273–284.
Bansal MS, Eulenstein O. The multiple gene duplication problem revisited. Bioinformatics. 2008; 24(13):i132–8.
Luo CW, Chen MC, Chen YC, Yang RWL, Liu HF, Chao KM. Lineartime algorithms for the multiple gene duplication problems. IEEE/ACM Trans Comput Biol Bioinform. 2011; 8(1):260–5.
Holland BR, Penny D, Hendy MD. Outgroup misplacement and phylogenetic inaccuracy under a molecular clock – a simulation study. Syst Biol. 2003; 52:229–38.
Huelsenbeck JP, Bollback JP, Levine AM. Inferring the Root of a Phylogenetic Tree. Syst Biol. 2002; 51(1):32–43.
Górecki P, Tiuryn J. Inferring phylogeny from whole genomes. Bioinformatics. 2007; 23(2):e116–22.
Górecki P, Eulenstein O. Algorithms: simultaneous errorcorrection and rooting for gene tree reconciliation and the gene duplication problem. BMC Bioinformatics. 2012; 13(Suppl 10):S14.
Górecki P, Eulenstein O, Tiuryn J. Unrooted tree reconciliation: a unified approach. IEEE/ACM Trans Comput Biol Bioinform. 2013; 10(2):522–36.
Chang W, Górecki P, Eulenstein O. Exact solutions for species Tree Inference from discordant gene trees. J Bioinform Comput Bio. 2013; 11(5):1342005.
Sherman DJ, Martin T, Nikolski M, Cayla C, Souciet JL, Durrens P. Génolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes. Nucleic Acids Res. 2009; 37(suppl 1):D550–4.
Dujon B. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet. 2006; 22(7):375–87.
Górecki P, Eulenstein O. GTP supertrees from unrooted gene trees: linear time algorithms for NNI based local searches. Lect Notes Comput Sc. 2012; 7292:83–105.
Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, et al. TreeFam 2008 Update. Nucleic Acids Res. 2008; 36:D735–40.
Page RDM, Charleston MA. Reconciled trees and incongruent gene and species trees In: Mirkin B, McMorris FR, Roberts FS, Rzhetsky A, editors. Mathematical Hierarchies in Biology, American Mathematical Society, Providence, Rhode Island: 1997. p. 57–70.
Acknowledgements
We would like to thank the three reviewers for their detailed comments that allowed us to improve our paper. JP and PG were supported by the grant of NCN #2011/01/B/ST6/02777. JP was supported by the DSM funding for young researchers of the Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JP and PG contributed equally to the writing of the paper. Both authors read and approved the final manuscript. JP implemented algorithms and performed all computational experiments.
Declarations
The publication costs for this article were funded by the Polish Ministry of Science and Higher Education funding for Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw.
This article has been published as part of BMC Genomics Volume 17 Supplement 1, 2016: Selected articles from the Fourteenth Asia Pacific Bioinformatics Conference (APBC 2016): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/17/S1.
From The Fourteenth Asia Pacific Bioinformatics Conference(APBC 2016) San Francisco, CA, USA. 11  13 January 2016
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Paszek, J., Górecki, P. Genomic duplication problems for unrooted gene trees. BMC Genomics 17 (Suppl 1), 15 (2016). https://doi.org/10.1186/s1286401523084
Published:
DOI: https://doi.org/10.1186/s1286401523084