 Research
 Open Access
 Published:
Efficient algorithms for reconciling gene trees and species networks via duplication and loss events
BMC Genomics volume 16, Article number: S6 (2015)
Abstract
Reconciliation methods explain topology differences between a species tree and a gene tree by evolutionary events other than speciations. However, not all phylogenies are trees: hybridization can occur and create new species and this results into reticulate phylogenies. Here, we consider the problem of reconciling a gene tree with a species network via duplication and loss events. Two variants are proposed and solved with effcient algorithms: the first one finds the best tree in the network with which to reconcile the gene tree, and the second one finds the best reconciliation between the gene tree and the whole network.
Background
Reconciliations explain topology incompatibilities between a species tree and a gene tree by evolutionary events  other than speciation  affecting genes [[1], for a review]. However, not all phylogenies are trees: indeed, hybridization can occur and create new species [2] and this results into reticulate phylogenies, i.e. species (phylogenetic) networks [3]. In [4], the authors presented a first contribution toward solving a problem similar to the reconciliation problem, namely the cophylogeny problem [5–8], on networks. In their article, they first propose a polynomial algorithm to solve this problem on dated host trees taking into account codivergence, duplication, host switching, and loss events. This model is similar to the DTL model in gene tree reconciliation  that takes into account speciation, duplication, transfer, and loss events [[9], among others]. However, when extending the cophylogeny problem to species networks, their model may not be the more pertinent one for the DTL problem. Indeed, in [4] the parasite tree that is "reconciled" with the host network can take any path in the latter, modeling the fact that some hybridization species can receive the parasites of both parents. In the problem of gene tree reconciliation, their model is more adapted to novel hybridizations, where the genes still keep trace of the polyploidy due to the hybridization. But, for ancient hybridizations, the polyploidy of the extant species being reduced, a model where each gene of an hybridization species can be inherited from at most one of its two parents is more pertinent. In other words, for solving the latter problem, we are interested in finding a tree that is "displayed" by the species network such that its reconciliation with a given gene tree is optimum. We propose an efficient algorithm that takes into account duplication and loss events whose complexity does not depend on the number of hybridization events in the species network but only on the level of the network, where the level is a measure of how much the network is "tangled". Moreover, we propose a faster algorithm solving the problem described in [4] when restricting to duplication and loss events (that is, host switching is not taken into account).
Basic notions
We start by giving some basic definitions that will be useful in the paper.
Definition 1 (Rooted phylogenetic network) A rooted phylogenetic network N on a label set $\mathcal{X}$ is a rooted directed acyclic graph with a single root where each outdegree0 node (the leaves) is labelled by an element of $\mathcal{X}$. The root of N, denoted by r(N), has indegree 0 and outdegree 2. All other internal nodes have either indegree 1 and outdegree 2 (speciation node), or indegree 2 and outdegree 1 (hybridization node).
Denote by V(N), I(N), E(N), L(N) and $\mathcal{L}\left(N\right)$ respectively the set of nodes, internal nodes (nodes with outdegree greater than 0), edges, leaves and leaf labels of N. The size of N, denoted by N, is equal to V(N) + E(N). Given x in V(N), we denote by N_{ x } the subnetwork of N rooted at x, i.e. the subgraph of N consisting of all edges and nodes reachable from x. If x is a leaf of N, we denote by s(x) the label of x in $\mathcal{L}\left(N\right)$. If x is a speciation node, we denote by p(x) the only parent of x.
Given two nodes x and y of N, we say that x is lower or equal to y in N, denoted by x ≤_{ N }y (resp. lower, denoted by x <_{ N } y), if and only if there exists a path (possible reduced to a single node) in N from y to x (resp. and x ≠ y). If x ≤_{ N } y, then, for every path p from y to x, denote by length(p) the number of speciation nodes in N such that x <_{ N } z ≤_{ N } y. If N is a tree, then, for every two nodes u, v of N, LCA_{ N } (u, v) [10] is the lowest node of N that is above or equal to both u, v.
Given a speciation node x in N, two paths of N starting from x are said to be separated if each path contains a different child of x. Let x, y be two nodes of N. Denote by ${\mathcal{M}}_{N}\left(x,y\right)$ the set of nodes z of N such that there exist two separated paths in N from z to x and y. For example, in Figure 1(d), ${\mathcal{M}}_{N\text{'}}\left(x,y\right)=\left\{{m}_{1},{m}_{2},{m}_{3}\right\}$. Note that all nodes in ${\mathcal{M}}_{N}\left(x,y\right)$ are speciation nodes and, when N is a tree and x, y are not comparable, M_{ N } (x, y) contains exactly one node, which coincides with LCA_{ N } (x, y).
If every biconnected component of N has at most k hybridization nodes, we say that N is of levelk [11]. A rooted phylogenetic tree is a rooted phylogenetic network with no hybridization nodes, i.e. a level0 network. In the following, we will refer to rooted phylogenetic networks and rooted phylogenetic trees simply as networks and trees, respectively. In this paper, we allow trees to contain artificial nodes, i.e. nodes with indegree and outdegree 1.
Let B be a biconnected component of a network N. Then B contains exactly one node r(B) without ancestors in B [[12], Lemma 5.3]; we call r(B) the root of B. If B is not trivial, i.e. B consists of more than one node, we can contract it by removing all nodes of B other than r(B), then connect r(B) to every node with indegree 0 created by this removal. Then the following definitions are wellposed.
Definition 2 (Tree bc(N)) Given a network N, the tree bc(N) is obtained from N by contracting all its biconnected components.
For example, Figure 1(a, b) shows respectively a level2 network N and its associated tree bc(N).
Let denote by $\stackrel{\circ}{B}$ the node in bc(N) that corresponds to a biconnected component B in N. Given two biconnected component B_{ i }, B_{ j } , we say that B_{ i } ≤ _{ N } B_{ j } (resp. B_{ i } <_{ N } B_{ j } ) if and only if $\stackrel{\circ}{{B}_{i}}{\le}_{bc(N)}\stackrel{\circ}{{B}_{j}}$ (resp. $\stackrel{\circ}{{B}_{i}}{<}_{bc(N)}\stackrel{\circ}{{B}_{j}}$). We say that B_{ i } is the parent (resp. a child) of B_{ j }if $\stackrel{\circ}{{B}_{i}}$ is the parent (resp. a child) of $\stackrel{\circ}{{B}_{j}}$ in _{bc}(N). We also denote by LCA_{ N }(B_{ i }, B_{ j }) the biconnected component corresponding to ${LCA}_{bc\left(N\right)}(\stackrel{\circ}{{B}_{i}},\stackrel{\circ}{{B}_{j}})$ in N.
Definition 3 (Elementary network) Given a network N, each biconnected component B that is not a leaf of N defines an elementary network, denoted by N(B), consisting of B and all cutedges coming out from B.
Definition 4 (Switchings of a network [13]) Given a network N, a switching S of N is obtained from N by choosing, for each hybridization node, an incoming edge to switch on and the other to switch off. Once this is done, we also switch off all switchedon edges with the target node having only switchedoff outgoing edges (see Figure 1(e) for an example). For each biconnected component B of N, we also denote by S(B) the subgraph of S restricted to N(B).
Switchings will be used in the next section to model gene histories for genes evolving in a species network. For example, Figure 1 presents in (c) a switching of the level2 network in (a). We denote by V_{ on }(S) the set of nodes of S that are not an endpoint of any switchedoff edge. A path of S is a path of N that uses only switchedon edges.
Hereafter, G will denote a tree and N a network such that there is a bijection between L(N) and $\mathcal{L}\left(N\right)$ and $\mathcal{L}\left(G\right)\subseteq \mathcal{L}\left(N\right)$. In the gene tree reconciliation problem, G represents a gene tree such that each leaf corresponds to a contemporary gene and is labeled by the species containing this gene, while N is a species network such that each leaf represents an extant species. In the cophylogeny problem, G represents a parasite tree such that each leaf corresponds to a parasite species that is labeled by the species that hosts it, while N is a host network such that each leaf represents an extant species.
Reconciliations
We will now extend the definition of $\mathbb{D}\mathbb{L}$ reconciliation in [14] to networks. In a $\mathbb{D}\mathbb{L}$ reconciliation, each node of G is associated to a node of S and an event  a speciation $\left(\mathbb{S}\right)$, a duplication $\left(\mathbb{D}\right)$ or a contemporary event $\left(\u2102\right)$  under some constraints. A contemporary event $\u2102$ associates a leaf u of G with a leaf x of S such that s(u) = s(x). A speciation in a node u of G is constrained to the existence of two separated paths from the mapping of u to the mappings of its two children, while the only constraint given by a duplication event is that evolution of G cannot go back in time. More formally:
Definition 5 (Reconciliation) Given a tree G and a network N such that $\mathcal{L}\left(G\right)\subseteq \mathcal{L}\left(N\right)$, a reconciliation between G and N is a function α that maps each node u of G to a pair (α_{ r }(u), α_{ e }(u)) where α_{ r }(u) is a node of V(N) and α_{ e }(u) is an event of type $\mathbb{S}$ or$\mathbb{D}$ or$\u2102$, such that:

${\alpha}_{e}\left(u\right)=\u2102$ if and only if u ∈ L(G), α_{ r }(u)∈L(N) and s(u) = s(α_{ r }(u));

for every u ∈ I(G) with child nodes {u_{1}, u_{2}}, if${\alpha}_{e}\left(u\right)=\mathbb{S}$, then${\alpha}_{r}(u)\in {\mathcal{M}}_{N}({\alpha}_{r}({u}_{1}),{\alpha}_{r}({u}_{2}))$;

for any two nodes u, v of V(G) such that v <_{ G } u, if ${\alpha}_{e}\left(u\right)=\mathbb{D}$, then α_{ r }(v) ≤_{ N } α_{ r }(u). Otherwise, α_{ r }(v) <_{ N } α_{ r }(u).
Note that, if N is a tree, this definition coincides with the one given in [14].
The number of duplications of α, denoted by d(α) is the number of nodes u of G such that ${\alpha}_{e}\left(u\right)=\mathbb{D}$. Since in a network there can be several paths between two nodes, we count the number of losses on shortest paths, as done in [4]. In more details, given two nodes x, y of N, the distance between x and y, denoted by dist(x, y), is defined as follows:

If y ≤_{ N } x, then dist(x, y) = min_{ p } length(p) over all possible paths p from x to y;

otherwise, dist(x, y) = +∞.
Then, for every u ∈ I(G) with child nodes {u_{1}, u_{2}}, the number of losses associated with u in a reconciliation α, denoted by l_{ α }(u), is defined as follows:

if ${\alpha}_{e}\left(u\right)=\mathbb{S}$, then l_{ α }(u) = min{dist(x_{1}, α_{ r }(u_{1})) + dist(x_{2}, α_{ r }(u_{2})), dist(x_{1}, α_{ r }(u_{2}))+dist(x_{2}, α_{ r }(u_{1}))} where x_{1}, x_{2} are the two children of α_{ r }(u);

if ${\alpha}_{e}\left(u\right)=\mathbb{D}$, then l_{ α }(u) = dist(α_{ r }(u), α_{ r }(u_{1})) + dist(α_{ r }(u), α_{ r }(u_{2})).
The number of losses of a reconciliation α, denoted by l(α), is the sum of l_{ α }(·) for all internal nodes of G. Thus, the cost of α, denoted by cost(α), is d(α)·δ + l(α)·λ, where δ and λ are respectively the cost of a duplication and a loss event. We use cost(G,N) to denote the cost of the minimum reconciliations between G and N. A reconciliation having the minimum cost is called a most parsimonious reconciliation.
Let S be a switching of N, then a reconciliation between G and S is defined similarly to Definition 5 except that only switchedon edges are considered when defining paths, and only nodes in V_{ on }(S) are counted for calculating the length of the shortest path in the definition of dist. This is done to model the fact that, since each gene of an hybridization species is inherited from one of its two parents, we should not count as a loss the fact that the other parent does not contribute. Moreover, note that, for every two nodes x, y of S such that x ≤_{ S } y, there is a unique path from y to x in S.
When N is a tree, there is a unique reconciliation (the LCA reconciliation) between G and N which has minimum cost and this reconciliation can be found in O(G) time [15–18] as follows:

for each node u in L(G), α_{ r }(u) is defined as the only node x of S such that s(u) = s(x), and ${\alpha}_{e}\left(u\right)=\u2102$;

for each node u in I(G) with child nodes {u_{1}, u_{2}}, α_{ r }(u) = LCA_{ N }(α_{ r }(u_{1}), α_{ r }(u_{2})); if α_{ r }(u_{1}) ≤_{ N }α_{ r }(u_{2}) or α_{ r }(u_{2}) ≤_{ N } α_{ r }(u_{1}) then ${\alpha}_{e}\left(u\right)=\mathbb{D}$; otherwise ${\alpha}_{e}\left(u\right)=\mathbb{S}$.
In the LCA reconciliation, the mapping α_{ e } is totally defined by α_{ r }, hence it can be omitted, and we will use α to refer to α_{ r }. Note that the algorithms used on trees to find the LCA reconciliation can also be used on switchings, which  when only switchedon edges are considered  are actually trees. Hereafter, when we refer to the reconciliation between a tree and a switching, we refer to the LCA reconciliation between them. The problems in which we are interested in here are defined as follows:
Problem 1 BEST SWITCHING FOR THE$\mathbb{D}\mathbb{L}$ MODEL
Input A tree G, a network N such that $\mathcal{L}\left(G\right)\subseteq \mathcal{L}\left(N\right)$, and positive costs δ and λ for respectively $\mathbb{D}$ and $\mathbb{L}$ events.
Output A switching S of N such that the cost(G, S) is minimum over all switchings of N.
Problem 2 MINIMUM$\mathbb{D}\mathbb{L}$ RECONCILIATION ON NETWORKS
Input A tree G, a network N such that $\mathcal{L}\left(G\right)\subseteq \mathcal{L}\left(N\right)$, and positive costs δ and λ for respectively $\mathbb{D}$ and $\mathbb{L}$ events.
Output A minimum reconciliation between G and N.
Remark 1 For the sake of simplicity, we suppose that G does not contain any internal node u such that $\left\mathcal{L}\left({G}_{u}\right)\right=\mathsf{\text{1}}$ (i.e. all nodes of G_{ u } are mapped to a leaf of N). If it is not the case, we can simplify the instance by replacing in G each such subtree G_{ u } by a leaf labeled by the unique label in $\mathcal{L}\left({G}_{u}\right)$ and compute a reconciliation for the simplified tree G'. A parsimonious reconciliation for G can be easily obtained from a parsimonious one for G'.
Method
Best switching
We start by proving that finding the best switching for the $\mathbb{D}\mathbb{L}$ model is NPhard:
Theorem 1 Problem 1 is NPhard.
Proof: To prove the theorem, we reduce Problem 1 to the TREE CONTAINMENT problem, which is NPhard [19]. The TREE CONTAINMENT problem asks the following "Given a network N and a tree T, both with their leaf sets bijectively labeled by a label set $\mathcal{X}$, is there a switching S of N such that T can be obtained by S deleting all switchedoff edges and nodes with indegree and outdegree 1?". Now, assume there is an algorithm $\mathcal{A}$ to solve Problem 1 in polynomial time. Then, it is easy to see that, since λ, δ > 0, there is a solution of Problem 1 with cost 0 if and only if G is contained in N. Therefore, this method would provide a polynomialtime algorithm to solve the TREE CONTAINMENT problem, which is impossible.
In the following, we present a fixedparameter tractable algorithm [20] in the level of the network to solve Problem 1. To do so, we need to introduce some notations.
Definition 6 (Mapping B ) For every u ∈ V(G), B(u) is defined as the lowest biconnected component B of N such that ℒ(N_{r(B)}) contains ℒ(G_{ u }).
Then the following remark holds:
Remark 2 If u ∈ L(G), then B(u) is the only leaf of N such that s(u) = s(B(u)). If u ∈ L(G), then B(u) = LCA_{ N } (B(u_{1}), B(u_{2})) where u_{1}, u_{2} are child nodes of u in G.
We define by G_{ N } the tree obtained from G by adding some artificial nodes on the edges of G and label each node of G_{ N } by a biconnected component of N via an extension of the labeling function B(·) as follows:
Definition 7 (Tree G_{ N } ) The tree G_{ N } is obtained from G as follows: for each internal node u in G with child nodes u_{1}, u_{2} such that there exist k biconnected components ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}{>}_{N}..\phantom{\rule{0.25em}{0ex}}.{>}_{N}{B}_{{i}_{k}}$strictly below B(u) and strictly above B(u_{1}), we add k artificial nodes v_{1} > ... >v_{ k } on the edge (u, u_{1}), and B(v_{ j } ) is fixed to ${B}_{{i}_{j}}$. We do the same for u_{2}.
See Figure 2 for an example of G_{ N } . We have the following lemma:
Lemma 1 Let u be a node of I(G_{ N } ), and u' be one of the children of u. If u is an artificial node, then B(u) is the parent of B(u') and a child of B(p(u)); otherwise, B(u') is either equal to B(u) or one of its children.
Proof: Suppose that u is an artificial node. Then, by Definition 7, B(p(u)), B(u), B(u') are three consecutive nodes of G_{ N } , thus B(u) is the parent of B(u') and a child of B(p(u)). Suppose now that u is not an artificial node, and let u" be the child of u in G such that u" ≤_{ Gn } u' ≤_{ Gn } u. If B(u") = B(u'), then by definition, u" = u' because no artificial node is added between u and u", and thus the claim holds. If B(u') ≠ B(u"), then by Definition 7, B(u') is the highest biconnected component of N that is below B(u) and above B(u"), which is then a child of B(u).
Definition 8 (G_{ B } ) Let B be a biconnected component of N different from a leaf, then G_{ B } is the set of all maximal connected subgraph H of G_{ N } such that B(u) = B for every u ∈ I(H).
For example, in Figure 3, ${{G}_{B}}_{{}_{\mathsf{\text{1}}}}$ consists of one binary tree, ${{G}_{B}}_{{}_{\mathsf{\text{2}}}}$ consists of one edge, and ${{G}_{B}}_{{}_{\mathsf{\text{3}}}}$ contains 1 tree and 1 edge.
Lemma 2 Let B be a biconnected component of N different from a leaf, then we have the following:
(i) for every H ∈ G_{ B }, H is either a binary tree or an edge whose upper extremity is an artificial node. Moreover, for every leaf u of H, B(u) is a child of B.
(ii) if B = B(r(G)), then G_{ B } consists of one binary tree.
Proof: (i) First, suppose that I(H) contains an artificial node u. Then this artificial node is the only internal node of H; indeed, by Lemma 1, the value of B(·) for the parent and the child of u are both different from B. Thus, H consists of only one edge whose upper extremity is u. If I(H) does not contain any artificial node, it follows that H is a binary tree. Moreover, by Lemma 1 and Definition 8, B(u) is a child of B for every leaf u of H.
(ii) Suppose now that B = B(r(G)), and G_{ B } contains at least two components H_{1}, H_{2}, rooted at two different nodes r_{1}, r_{2} where B(r_{1}) = B(r_{2}) = B. Let $r=LC{A}_{{G}_{N}}({r}_{1},{r}_{2})$, then, by Definition 6, B(v) = B for every node v on the two paths from r to r_{1} and to r_{2}, because B(r(G)) = B(r_{1}) = B(r_{2}) = B and ${r}_{\mathsf{\text{1}}},\phantom{\rule{0.25em}{0ex}}{r}_{\mathsf{\text{2}}}{\le}_{{G}_{N}}r\phantom{\rule{0.25em}{0ex}}{\le}_{{G}_{N}}r\left(G\right)$. But this contradicts the maximality of H_{1} and H_{2} since they can both be extended. Hence G_{ B } contains only one component. Suppose that this component is an edge; thus, its upper extremity is an artificial node that has been added on the path from a node u to a node v of G where u is strictly higher than r(G). But this is not possible, because r(G) is the highest node of G. Therefore, G_{ B } contains one binary tree.
Given a biconnected component B_{ i } different from a leaf, denote by ${G}_{{B}_{i}}^{t}$ the set of binary trees of ${G}_{{B}_{i}}$, and ${G}_{{B}_{i}}^{e}$ the set of edges of ${G}_{{B}_{i}}$. Let S_{ i } be a switching of N(B_{ i }), and let H be a tree in ${G}_{{B}_{i}}^{t}$. By Lemma 2, for every u ∈ L(H), B(u) is a child of B_{ i } and thus r(B(u)) is a leaf of N(B_{ i }), which is also a leaf of S_{ i }. Hence, we can define a reconciliation between H and S_{ i }, denoted by ${\beta}_{{S}_{i}}^{H}$, such that each leaf u of H is mapped to the leaf r(B(u)) of S_{ i }.
Lemma 3 Let S be a switching of N, and let α be the reconciliation between G and S. For every u ∈ I(G), there exists $H\in {G}_{B\left(u\right)}^{t}$such that u ∈ I(H), and $\alpha (u)={\beta}_{S(B(u))}^{H}(u)$.
The proof of this lemma is deferred to the appendix. The following definition will be useful later on.
Definition 9 ( cost(H,S_{ i })) Let B_{ i } be a biconnected component of N different from a leaf, and S_{ i } a switching of N(B_{ i }); then cost(H,S_{ i }) is defined as follows:

$\forall H\in {G}_{{B}_{i}}^{t}$, $cost\left(H,{S}_{i}\right)=cost\left({\beta}_{{S}_{i}}^{H}\right)$if B_{ i }= B(r(G)), and$cost(H,{S}_{i})=cost({\beta}_{{S}_{i}}^{H})+\lambda \phantom{\rule{0.25em}{0ex}}\xb7dist(r\left({S}_{i}\right),{\beta}_{{S}_{i}}^{H}\left(r\left(H\right)\right)$otherwise;

$\forall H\in {G}_{{B}_{i}}^{e}$, $cost(H,{S}_{i})=\lambda \phantom{\rule{0.25em}{0ex}}\xb7dist(r\left({S}_{i}\right),{\beta}_{{S}_{i}}^{H}\left(r\left(B(u)\right)\right)$where u is the only leaf of H.
As we will see later, this cost corresponds to the contribution of H to a reconciliation between G and any switching of N that contains S_{ i }. For example, in Figure 3(b), H is the edge (B_{2}, a) and, if S_{ i } is the switching on the left, then cost(H,S_{ i }) = λ.
The next lemma is a central one, and it permits to solve Problem 1 independently per each biconnected component of N :
Lemma 4 Let {B_{1},..., B_{ p }} be the biconnected components of N that are not leaf nodes, and let S be a switching of N where each elementary network N(B_{ i }) has S_{ i } as switching. Then $cost\left(G,S\right)=\sum _{i=1}^{p}{\sum}_{H\in {G}_{{B}_{i}}}cost\left(H,{S}_{i}\right).$
Proof: Let α be the reconciliation between G and S. Denote by d_{ α }(S_{ i }) the number of nodes in I(S_{ i }) associated to a duplication by α. By Remark 1, no duplication happens at leaves of S. Additionally, ${\cup}_{{B}_{i}}{\cup}_{H\in {G}_{{B}_{i}}^{t}}I\left(H\right)=I\left(G\right)$ and the sets of internal nodes of ${G}_{{B}_{i}}^{t}$ are disjoint. Hence, we have $d(\alpha )\phantom{\rule{0.25em}{0ex}}={\displaystyle {\sum}_{i=1}^{i}{d}_{\alpha}({s}_{i})}={\displaystyle {\sum}_{i=1}^{p}{\displaystyle {\sum}_{H\in {G}_{{B}_{i}}^{t}}d({\beta}_{{S}_{i}}^{H})}}$ because ${\beta}_{{S}_{i}}^{H}\left(u\right)=\alpha \left(u\right)$ for every internal node u of H (Lemma 3).
Let us now consider the loss count. Note that a node/edge is on (resp. off) in S_{ i } if and only if it is also on (resp. off) in S. Let x, y be two nodes of S such that y ≤_{ S } x. Then we define $dis{t}_{{S}_{i}}\left(x,y\right)$ as the number of nodes z in V_{ on }(S_{ i }) \ L(S_{ i }) such that y <_{ S } z ≤_{ S } x. Then, for each internal node u of G, we define the number of losses associated with u in S_{ i }, denoted by l_{ α }(u, S_{ i }), similarly to l_{ α }(u) but using the function $dis{t}_{{S}_{i}}$ instead of dist. Then, ${l}_{\alpha}\left(u\right)={\displaystyle {\sum}_{i=1}^{p}{l}_{\alpha}(u,{S}_{i})}$. Now, let u_{1}, u_{2} be two children of u in G, then l_{ α }(u, S_{ i }) >0 only if the path from α(u) to either α(u_{1}) or α(u_{2}) in S contains at least a node of B_{ i }. Therefore, we have three possible cases: (1) α(u) is in is in B_{ i }, (3) α(u) is above r(B_{ i }) while either α(u_{1}) or α(u_{2}) is in a biconnected component below B_{ i }. In case (1), by Lemma 3, there exists a binary tree H of ${G}_{{B}_{i}}^{t}$ such that u∈I(H), and $\alpha \left(v\right)\phantom{\rule{0.25em}{0ex}}={\beta}_{{S}_{i}}^{H}\left(v\right)$ for every v∈I(S_{ i }), thus ${l}_{\alpha}\left(u,\phantom{\rule{0.25em}{0ex}}{S}_{i}\right)\phantom{\rule{0.25em}{0ex}}={l}_{{\beta}_{{s}_{i}}^{H}}\left(u\right)$. Now, let suppose S_{ i } that case (2) holds for u_{1}, then u_{1} must be the root of a binary tree H_{1} of G^{t} , and the contribution of u_{1} to l_{ α }(u, S_{ i }) is $dist\left(r\left({S}_{i}\right),{\beta}_{{S}_{i}}^{{H}_{1}}\left(r\left({H}_{\mathsf{\text{1}}}\right)\right)\right)$. Note that in this case, u_{1} ≠ r(G). Finally, let suppose that case (3) holds for u_{1} and let u_{ a }, u_{ b } be the artificial nodes added on the edge (u, u_{1}) of G such that B(u_{ a }) = B_{ i } and B(u_{ b }) is a child of B_{ i }. Hence, $\left({u}_{a},\phantom{\rule{0.25em}{0ex}}{u}_{b}\right)\in {G}_{{B}_{i}}^{e}$, and the contribution of u_{1} to l_{ α }(u, S_{ i }) is dist(r(S_{ i }), r(B(u_{ b }))). Let call ${V}_{i}^{1},{V}_{i}^{2},{V}_{i}^{3}$ the set of nodes u in the first, second, and third case. By construction, V^{1} is disjoint from ${V}_{i}^{2}$ and ${V}_{i}^{3}$. Moreover, ${V}_{i}^{2}$ and ${V}_{i}^{3}$ are disjoint because if a node u has two children u_{1}, u_{2} such that u_{1} is in B_{ i }, and u_{2} is below B_{ i }, then u must be in B_{ i }, i.e. cannot be above r(B_{ i }). Thus,
Therefore,
As proved above (in case (3)), the first term between squared brackets is equal to ${\sum}_{H\in {G}_{{B}_{i}}^{t}}cost\left(H,{S}_{I}\right)$ by Definition 9. In the second term between squared brackets, the sum of the two first factors is exactly ${\sum}_{H\in {G}_{{B}_{i}}^{t}}cost\left({\beta}_{{S}_{i}}^{H}\right)$ (as proved in case (1)), while the last factor is equal to $\lambda \xb7{\displaystyle {\sum}_{H\in {G}_{{B}_{i}}^{t}}dist(r({S}_{i}),{\beta}_{{S}_{i}}^{H}(r(H))}$ (as proved in case (2)). Note that as mentioned in case (2), only nodes that are not the root of G are considered. Hence, the second term between squared brackets corresponds to ${\sum}_{H\in {G}_{{B}_{i}}^{t}}cost\left(H,{S}_{i}\right)$ by Definition 9.
Therefore, $cost\left(G,S\right)\phantom{\rule{0.25em}{0ex}}={\sum}_{i=1}^{p}{\sum}_{H\in {G}_{{B}_{i}}}cost(H,{S}_{i})$ and this concludes the proof.
Algorithm 1 computes the switching of N for which its reconciliation with G has the smallest cost, by analyzing each biconnected component of N independently. See Figure 3 for an example of application of this algorithm to the species network in Figure 1(a) and the gene tree in Figure 2(a).
Algorithm 1 Solving Problem 1
1: Input: A species network N and a gene tree G such that $\mathcal{L}\left(G\right)\subseteq \mathcal{L}\left(N\right)$, and positive costs δ, λ for duplication and loss events, respectively.
2: Output: A switching S of N that is optimal in the sense of Problem 1.
3: Compute the tree G_{ N } and its labeling function B(·);
4: Compute ${G}_{{B}_{i}}$ for each biconnected component B_{ i } of N that is not a leaf;
5: for each biconnected component B_{ i } of N do
6: for each switching ${S}_{i}^{j}$ of N(B_{ i }) do
7: $cos{t}_{j}={\sum}_{H\in {G}_{{B}_{i}}}cost\left(H,{S}_{i}^{j}\right)$;
8: ${S}_{i}^{m}\leftarrow $ the switching of N(B_{ i }) with the lowest value of cost_{ j } over all j;
9: Return the switching S of N in which each elementary network N(B_{ i }) has ${S}_{i}^{m}$ as switching.
Theorem 2 Let N be a levelk network with p biconnected components. Algorithm 1 runs in O(N + 2^{k}·p·G) time and returns a switching S of N such that cost(G,S) is minimum.
Proof: Complexity: It is wellknown that all biconnected components of N can be computed in linear time, i.e. O(N), by using depthfirstsearch [10]. After a linear preprocessing, LCA operations on a tree can be performed in constant time [21, 22]. Thus, from Remark 2, the mapping B(·) can be computed in O(G_{ N }). Hence, the tree G_{ N } can be constructed in times O(G_{ N }  + N).
By a simple traversal of G_{ N } which takes time O(G_{ N } ), we can compute ${G}_{{B}_{i}}$ for all B_{ i }. Each biconnected component B_{ i } of N has at most k hybridization nodes, then N(B_{ i }) has at most 2^{k} switchings. For each switching ${S}_{i}^{j}$ of N(B_{ i }), it takes O(G_{ B } ) to compute cost_{ j } [23, 24]. The overall size of all G_{ Bi } is the size of G_{ N } . Therefore, the total complexity of the loop at lines 5  8 is O(2^{k} ·G_{ N } ). Each edge of G can have at most p artificial nodes added to it. Hence in the worst case O(G_{ N } ) = O(G·p), i.e. the total complexity of Algorithm 1 is O(N + 2^{k}·p·G).
Correctness: Let S be a switching of N where each elementary network N(B_{ i }) has S_{ i } as the switching. By Lemma 4, cost(G,S) = ${\sum}_{i=1}^{p}{\sum}_{H\in {G}_{{B}_{i}}}cost\left(H,{S}_{i}\right)$. Hence cost(G,S) is minimum if and only if ${\sum}_{H\in {G}_{{B}_{i}}}cost\left(H,{S}_{i}\right)$ is minimum for every S_{ i }. Lines 58 in Algorithm 1 search, for each network N(B_{ i }), the tree ${S}_{i}^{m}$ such that ${\sum}_{H\in {G}_{{B}_{i}}}cost\left(H,{S}_{i}\right)$ is minimum, which implies the correctness of the algorithm.
Minimum Reconciliation on Networks
Given a tree G and a network N, in [4] the authors prove that reconciling G on N can be solved in polynomial time, when host switchings (or transfer events, in the DTL reconciliation terminology) are also accounted for. In their model, each node of N is dated by a single date while each node of G is dated by a set of dates, and they search for a parsimonious reconciliation between G and N, i.e. one that has minimum cost, under the constraint that an event associated to a node u of G can only happen at a node/edge of N whose date is contained in the set of possible dates of u. Although the algorithm complexity stays polynomial, it is very high: O(τ^{3}·G·N^{5}) for a binary tree and a binary network, where τ is the number of possible dates of the nodes of G and N, which is at most O(G + N). A drawback of this model is that it requires information on the dates that is often not available. Moreover, transfers are not always possible in all parts of Tree of Life. Here, we take into account only speciation, duplication and loss events, and consider G and N as undated (see Problem 2 and Definition 5 for more details). Using a similar dynamic algorithm on this simpler model, and by some further analyses, we provide an algorithm that is a generalization of the LCA algorithm to networks that has a much smaller complexity than the algorithm in [4], namely O(h^{2}·G·N), where h is the number of hybridization nodes of N.
Let x, y be two nodes of N. Denote by $\mathcal{M}i{n}_{N}\left(x,\phantom{\rule{0.25em}{0ex}}y\right)$ the subset of ${\mathcal{M}}_{N}\left(x,\phantom{\rule{0.25em}{0ex}}y\right)$ such that, for every $z\in \mathcal{M}i{n}_{N}\left(x,\phantom{\rule{0.25em}{0ex}}y\right)$, there does not exist any ${z}^{\prime}\in \mathcal{M}i{n}_{N}\left(x,\phantom{\rule{0.25em}{0ex}}y\right)$ such that every path from z to x and to y passes through z'. For example, in Figure 1(d), m_{1}, m_{2}, m_{3} are in ${\mathcal{M}}_{N}\left(x,\phantom{\rule{0.25em}{0ex}}y\right)$ but only m_{1} and m_{2} are in ${\mathcal{M}}_{N}\left(x,\phantom{\rule{0.25em}{0ex}}y\right)$ because every path from m_{3} to x, y must pass m_{2}.
For the sake of simplicity, we consider only reconciliations without duplications on hybridization nodes: indeed, since losses are not counted at hybridization nodes, a duplication on such nodes can be moved to its only child without changing the total cost of the reconciliation.
Lemma 5 Let α be a reconciliation of minimum cost between G and N, then for every node u of G that has two children u_{1}, u_{2}, we have:
(i) if ${\alpha}_{e}\left(u\right)=\mathbb{S}$, then ${\alpha}_{r}\left(u\right)\in \mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$;
(ii) if ${\alpha}_{e}\left(u\right)=\mathbb{D}$, then either α_{ r } (u_{1}) ≤_{ N }α_{ r }(u_{2}) and α_{ r }(u) = α_{ r }(u_{2}), or α_{ r }(u_{2}) ≤_{ N }α_{ r }(u_{1}) and α_{ r }(u) = α_{ r }(u_{1}).
Proof: (i) Suppose that ${\alpha}_{e}\left(u\right)=\mathbb{S}$, then by definition α_{ r }(u) must be a node of ${\mathcal{M}}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$. Let x_{1}, x_{2} be two children of α_{ r }(u) such that l_{ α }(u) = dist(x_{1}, α_{ r }(u_{1}))+dist(x_{2}, α_{ r }(u_{2})). Suppose that ${\alpha}_{r}\left(u\right)\notin \mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$, then there exists a node y in $\mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$ such that every path from α_{ r }(u) to α_{ r }(u_{1}) (resp. to α_{ r }(u_{2})) passes for y. Let y_{1}, y_{2} be the two children of y.
First, we suppose that the shortest path from x_{1} to α_{ r }(u_{1}) passes through y, y_{1}, while the one from x_{2} to α_{ r }(u_{2}) passes y, y_{2}. Consider the reconciliation α' such that ${\alpha}_{r}^{\prime}\left(v\right)={\alpha}_{r}\left(v\right)$ and ${\alpha}_{r}^{\prime}\left(e\right)={\alpha}_{r}\left(e\right)$ for every v ≠ u, while ${\alpha}_{r}^{\prime}\left(u\right)=y$ and ${\alpha}_{e}^{\prime}\left(u\right)=\mathbb{S}$. It is easy to see that α' respects Definition 5, and that d(α) = d(α'). Denote by f = dist(α_{ r }(u), y), then f >0. The numbers of losses in α and α' are different on those associated with u and p(u) (if u is not the root of G). We thus have l_{ α }(p(u)) ≥ l_{ α' }(p(u))  f if u ≠ r(G). Moreover, l_{ α }(u) = dist(x_{1}, α_{ r }(u_{1})) + dist(x_{2}, α_{ r }(u_{2})) = dist(x_{1}, y) + 1 + dist(y_{1}, α_{ r }(u_{1})) + dist(x_{2}, y) + 1 +dist(y_{2}, α_{ r }(u_{2})) ≥ dist(α_{ r }(u), y) + dist(y_{1}, α' (u_{1}) + dist(α_{ r }(u), y)+dist(y_{2}, α' (u_{2}))) ≥ 2·f +l_{ α' } (u). Hence, whether u coincides with r(G) or not, l(α) > l(α'), i.e. cost(α) > cost(α'), a contradiction.
Suppose now that both the shortest paths from x_{1} to α_{ r }(u_{1}) and to α_{ r }(u_{2}) pass through y, and then pass through y_{1} (or y_{2}). This means that y_{1} is above both α_{ r }(u_{1}) and α_{ r }(u_{2}). Let y' be one of the lowest nodes below or equal to y_{1} that is above both α_{ r }(u_{1}) and α_{ r }(u_{2}). Let ${y}_{1}^{\prime}$, ${y}_{2}^{\prime}$ be the two children of y', and sup1 2 pose, without loss of generality, that ${y}_{1}^{\prime}$ is above or equal to α_{ r }(u_{1}) and ${y}_{2}^{\prime}$ is above or equal to α_{ r }(u_{2}). Hence, the shortest path from x_{1} to α_{ r }(u_{1}) must pass, in the order, through y, y_{1}, y', and y', while the shortest path from x_{2} to α_{ r }(u_{2}) must pass through y, y_{1}, y', and ${y}_{1}^{\prime}$. By defining the reconciliation α'as done above apart for α'(u), which is fixed to y', we arrive at a contradiction by a similar argument.
(ii) Suppose that ${\alpha}_{e}\left(u\right)=\mathbb{D}$, and α_{ r }(u_{1}), α_{ r }(u_{2}) are not comparable. Hence, $\mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$ is not empty. If ${\alpha}_{r}\left(u\right)\notin \mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$, then there exists a node $y\in \mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$ such that every path from α_{ r }(u) to α_{ r }(u_{1}) and to α_{ r }(u_{2}) must pass through y. Similarly to case (i), we have that the reconciliation α' that coincides with α apart for the fact that ${\alpha}_{r}^{\prime}\left(u\right)=y$ and ${\alpha}_{e}^{\prime}\left(u\right)=\mathbb{S}$ has smaller cost than α, a contradiction. Hence, α_{ r }(u) 2 ℳin_{ N } (α_{ r }(u_{1}), α_{ r }(u_{2})). Considering now the reconciliation α' that coincides with α but for α' (u), which is fixed to $\mathbb{S}$. Hence d(α) = d(α')+1, and ${l}_{\alpha}\left(p\left(u\right)\right)\phantom{\rule{0.25em}{0ex}}={l}_{{\alpha}^{\prime}}\left(p\left(u\right)\right)$ if u ≠ r(G). Let x_{1}, x_{2} be the two children of α_{ r }(u). If the shortest path from α_{ r }(u) to α_{ r }(u_{1}) passes through x_{1} while the shortest path from α_{ r }(u) to α_{ r }(u_{2}) passes through x_{2}, then ${l}_{\alpha}\left(u\right)=\mathsf{\text{2}}+{l}_{{\alpha}^{\prime}}\left(u\right)$. Therefore, cost(α) > cost(α'), a contradiction. Thus, both the two shortest paths from α_{ r }(u) to α_{ r }(u_{1}) and α_{ r }(u_{2}) must pass through x_{1} (or x_{2}). Let y be one of the lowest nodes located on both of these paths. Then, y ∈ ℳin_{ N } (α_{ r }(u_{1}), α_{ r }(u_{2})). By the same argument as in the previous case, the reconciliation α" coinciding with α but for ${\alpha}_{r}^{\u2033}\left(u\right)=y$ and ${\alpha}_{e}^{\u2033}\left(u\right)=\mathbb{S}$ must have smaller cost than α, a contradiction.
Hence, either α_{ r }(u_{1}) ≤_{ N } α_{ r } (u_{2}) or α_{ r }(u_{2}) ≤_{ N } α_{ r } (u_{1}). Suppose that the first case holds (the second case is similar), but α_{ r }(u) ≠ α_{ r }(u_{2}), i.e. α_{ r }(u_{2}) <_{ N } α_{ r } (u). Let x_{1}, x_{2} be two children of α_{ r }(u). If the shortest path from α_{ r }(u) to α_{ r }(u_{1}) passes through x_{1} while the shortest path from α_{ r }(u) to α_{ r }(u_{2}) passes through x_{2}, then by replacing α_{ e }(u) by an $\mathbb{S}$ event, we obtain a reconciliation with a smaller cost. Thus, both shortest paths from α_{ r }(u) to α_{ r }(u_{1}) and α_{ r }(u_{2}) must pass through x_{1} (or x_{2}). Let y be a node that is located on both paths such that there is no other node below y on these two paths. Then, y ≠ ℳin_{ N } (α_{ r }(u_{1}), α_{ r }(u_{2})). By the same argument as in the case (i), the reconciliation α' such that ${\alpha}_{r}^{\prime}\left(v\right)={\alpha}_{r}\left(v\right)$, ${\alpha}_{e}^{\prime}\left(v\right)={\alpha}_{e}\left(v\right)$ for every v ≠ u, ${\alpha}_{r}^{\prime}\left(u\right)=y$, ${\alpha}_{e}^{\prime}\left(u\right)=\mathbb{S}$ must have smaller cost than α, a contradiction.
Now, we are ready to describe a dynamic algorithm to compute a reconciliation of minimum cost between G and N. Let α be a reconciliation between G and N. For every u ∈ V(G), denote by cost_{ α }(u) the cost of the reconciliation of α restricted to G_{ u }. Hence, if α is a most parsimonious reconciliation, then cost_{ α }(u) is the minimum cost among all reconciliations between G_{ u } and N that maps u to α_{ r }(u). Algorithm 2 aims at computing, for each u, the set $\mathcal{C}$(u) containing all pairs (x,c) such that c is the minimum cost among all reconciliations between G_{ u } and N mapping u to x. It is straightforward to see that the cost of a most parsimonious reconciliation between G and N is the minimum cost involved in a pair in $\mathcal{C}$(r(G)).
The function merge(L_{1}, L_{2}) used in Algorithm 2 takes as input two lists of pairs (x,c)  where x is a node of N and c is a real number  and merges them keeping, for each x, the pair (x,c) with the smallest value of c. The method computeMin(y, z) used in Algorithm 2 is detailed in Algorithm 3. This method computes, for two nodes y, z of N, the set $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$ by using two breathfirstsearches (BFS) starting respectively from y and z up to the root of N (note that, to perform the breathfirstsearches, the edges are considered as directed in the inverse order). For this, it labels each node v in such a way that, if v is not strictly above y and z, then label (v) = ∅. Otherwise, label(v) is the lowest node such that every path from v to y and z passes through it. This method also computes the value of the function dist between y (resp. z) and each node visited in the corresponding breathfirstsearch.
Algorithm 2 Solving Problem 2
1: Input: A network N and a tree G such that $\mathcal{L}\left(G\right)\subseteq \mathcal{L}\left(N\right)$, and positive costs δ, λ for duplication and loss events, respectively.
2: Output: The set $\mathcal{C}$(u) of pairs (x,c) for every $u\in V\left(G\right)$.
3: for each node u of G in postorder do
4: $\mathcal{C}$(u) ← ∅;
5: if u is a leaf then
6: Let x be the leaf of S such that s(x) = s(u);
7: $\mathcal{C}$(u) ← {(x, 0)};
8: else
9: Let u_{1}, u_{2} be the two children of u;
10: for each (y, c_{1}) ∈ $\mathcal{C}$(u_{1}) and each (z, c_{2}) ∈ $\mathcal{C}$(u_{2}) do
11: computeMin(y, z);
12: C ← ∅;
13: for each $x\in \mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$do
14: Let x_{1}, x_{2} be the two children of x in N ;
15: c = c_{1} + c_{2} + λ·min{dist(x_{1}, y) + dist(x_{2}, z), dist(x_{2}, z) + dist(x_{1}, y)};
16: C ← C ∪ {(x, c)};
17: if y ≤_{ N } z then
18: c = δ + λ·dist(z, y)+ c_{1} + c_{2};
19: C ← C ∪ {(z, c)};
20: else if z ≤_{ N } y then
21: c = δ + λ·dist(y, z) + c_{1} + c_{2};
22: C ← C ∪ {(y, c)};
23: $\mathcal{C}$(u) = merge$\left(\mathcal{C}\right(u),C)$;
24: Return $\mathcal{C}$.
The following theorem proves the correctness of Algorithm 2:
Theorem 3 Algorithm 2 returns a matrix $\mathcal{C}$ such that, for every u ∈ V(G), (x,c) is contained in $\mathcal{C}$(u) if and only if there exists a most parsimonious reconciliation between G_{ u } and N mapping u to x with cost c.
Algorithm 3 computeMin(y, z)
1: Proceed a BFS from y, store the ordered list of visited nodes in BFS(y) and compute, for each u in BFS(y), dist(y,v);
2: Do the same from z;
3: for each node v ∈ BFS(y) do
4: if v = y or v = z or v is not in BFS(z) then label(v) = ∅
5: else if v has only one child v_{1} that is in BFS(z) or BFS(y) then
6: label(v) = label(v_{1});
7: else
8: Let v_{1}, v_{2} be the two children of v;
9: if label(v_{1}) = label(v_{2}) ≠ ∅ then
10: label(v) = label(v_{1});
11: else
12: label(v) = {v}; Add v into $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$;
Proof: For each u ∈ V(G), we need to prove that, if α' is a reconciliation of minimum cost between G_{ u } on N, then (α'(u), cost_{ α' }) is contained in $\mathcal{C}$(u). This is obviously true for every leaf of G (by lines 57). Let now u be an internal node having two children u_{1}, u_{2}. Then, following Lemma 5, $\mathcal{C}$(u) can be computed from $\mathcal{C}$(u_{1}) and $\mathcal{C}$(u_{2}) by using the information contained in $\mathcal{M}i{n}_{N}$ and dist. Lines 1023 in Algorithm 2 computes $\mathcal{C}$(u) following this lemma.
It remains now to prove that Algorithm 3 correctly computes $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$. For every node v that is above y, z, denote by low(v) the lowest node such that every path from v to y, z must pass through this node. There is only one such node. Indeed, suppose that there are two such nodes m, m', then every path from v to y must pass through both m, m', i.e. either m <_{ N } m' or m' <_{ N } m, contradicting the lowest property of m, m'. To prove the claim, we need to show that if v is a node of BFS(y), then:
(i) if v is not strictly above y and z, then label(v) = ∅;
(ii) otherwise, label(v) = low(v) and line 12 of Algorithm 3 is performed if and only if v is actually in $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$.
Let v be a node of BFS(y), then (i) holds by line 4. Now consider the case where v is strictly above y and z. We will prove (ii) by recursion on v following the order of the nodes in BFS(y). The recursion begins from the set of lowest nodes that are strictly above y and z, i.e. the set of nodes of v in $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$ such that there is not any node in $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$ that is below v. Let v_{1}, v_{2} be two children of v, then by hypothesis v_{1}, v_{2} are not strictly above y, z, i.e. label(v_{1}) = label(v_{2}) ≠ ∅; and low(v) = v. Thus, due to line 12, label(v) = v = low(v). Now let v be a node strictly above y, z such that (ii) is correct for each node below v which is strictly above or equal to y, z. If v is a hybridization node, then it is evident that low(v) = low(v_{1}) where v_{1} is the only child of v. Moreover, since label(v_{1}) = low(v_{1}) (by the hypothesis of recurrence), then label(v) = label(v_{1}) = low(v_{1}) = low(v). If v is a speciation node having two children v_{1}, v_{2} such that v_{2} is not in either BFS(y) or BFS(z), then we also have low(v) = low(v_{1}). Hence, due to lines 5  6, we have label(v) = label(v_{1}) = low(v_{1}) = low(v). Now consider the last case, i.e. v is a speciation node having two children v_{1}, v_{2} that are both in BFS(y) and BFS(z). If there exists a node q = low(v_{1}) = low(v_{2}), then low(v) = q because every from v to y, z must pass either through v_{1} or v_{2}, i.e. always pass through q. Following line 10, we fix label(v) = label(v_{1}) = q = low(v). Moreover, in this case v can not be in $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$ because every path from v to y, z passes through a node q below v. In the last case, we have $label\left({v}_{\mathsf{\text{1}}}\right)\ne label\left({v}_{\mathsf{\text{2}}}\right)$. Since both v_{1}, v_{2} are above y, z, then there exists a node q_{1} = label(v_{1}), and a different node q_{2} = label(v_{2}). We will prove that $v\in \mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$. Indeed, suppose otherwise, then there is a node q such that every path from v to y, z must pass through q. Hence, every path from v_{1} (resp. v_{2}) to y, z must also pass through q, so q_{1} ≤_{ N } q (resp. q_{2} ≤_{ N } q). It means that there is a path from v_{1} to q to q_{2} and then to y, z that does not pass through q_{1}, a contradiction. Hence v is in $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$, and thus by definition the only node that every path from v to y, z must pass through is v itself (line 12).
We now present some intermediate results that will be useful to prove the complexity of Algorithm 2.
We extend the definition of ${\mathcal{M}}_{N}$ to a subset of leaves. Let L be a subset of L(N ). If L = 1, then ${\mathcal{M}}_{N}\left(L\right)=L$. Otherwise, ${\mathcal{M}}_{N}\left(L\right)$ is the set of nodes m of N such that m is above all leaves in L and there exist at least two separated paths from m to two distinct leaves of L.
Given a node u of G, L_{ N } (u) is defined as the set of leaves of N to which α maps the leaf set of G_{ u }, i.e. ${L}_{N}\left(u\right)=\{x\in L\left(N\right)\exists u\in L\left({G}_{u}\right)$ and s(u) = s(x)}.
Lemma 6 Let α be a most parsimonious reconciliation between G and N, then, for every node u of $G,{\alpha}_{r}\left(u\right)\in {\mathcal{M}}_{N}\left({L}_{N}\left(u\right)\right)$.
Proof: It is true for every leaf u of G. Let u now be an internal node having two children u_{1}, u_{2}.
If u_{1}, u_{2} ∈ L(G), then following Remark 1, L_{ N } (u) consists of two distinct nodes α_{ r }(u_{1}), α_{ r }(u_{2}). By Lemma 5, ${\alpha}_{e}\left(u\right)=\mathbb{S}$ and ${\alpha}_{r}\left(u\right)\in \mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$, i.e. there exist two separated paths from α_{ r }(u) to α_{ r }(u_{1}) and α_{ r }(u_{2}). It means that ${\alpha}_{r}\left(u\right)\in {\mathcal{M}}_{N}\left({L}_{N}\left(u\right)\right)$.
Let u_{1} ∉ L(G), then following Remark 1, L_{ N } (u_{1}) ≥ 2, then there always exist two distinct leaves x, y of L_{ N } (u) such that x ∈ L_{ N }(u_{1}), y ∈ L_{ N }(u_{2}), i.e. x is below α_{ r }(u_{1}) and y is below α_{ r }(u_{2}). If ${\alpha}_{e}\left(u\right)=\mathbb{S}$, then following Lemma 5, ${\alpha}_{r}\left(u\right)\in \mathcal{M}i{n}_{N}\left({\alpha}_{r}\left({u}_{\mathsf{\text{1}}}\right),{\alpha}_{r}\left({u}_{\mathsf{\text{2}}}\right)\right)$, i.e. there exist two separated paths from α_{ r }(u) to α_{ r }(u_{1}), α_{ r }(u_{2}). By extending these two paths from α_{ r }(u_{1}) to x, and from α_{ r }(u_{2}) to y, we have to two separated paths from α_{ r }(u) to x, y. In other words, α_{ r }(u) ∈ ℳ_{ N } (L_{ N } (u)). If α_{ e }(u) = $\mathbb{D}$ and suppose that α_{ r }(u_{2}) ≤_{ N } α_{ r } (u_{1}), then α_{ r }(u) = α_{ r }(u_{1}) following Lemma 5, and α_{ r }(u) is not a leaf of N following Remark 1. Let u' be the highest node such that u' ≤_{ G } u_{1} and ${\alpha}_{e}\left({u}^{\prime}\right)=\mathbb{S}$. Then following Lemma 5, ${\alpha}_{r}\left({u}^{\prime}\right)={\alpha}_{r}\left(u\right)$, and there exists two separated paths from α_{ r }(u'), i.e. from α_{ r }(u), to two distinct leaves of L_{ N }(u'), i.e. two distinct leaves of L_{ N } (u). We can prove the claim similarly for the case when α_{ r }(u_{1}) ≤_{ N } α_{ r } (u_{2}).
Lemma 7 If N is a network that contains h hybridization nodes, then for every subset L of $L\left(N\right),\left{\mathcal{M}}_{N}\left(L\right)\right\le h+\mathsf{\text{1}}$ holds.
The proof of this lemma is deferred to the appendix. We are now ready to state the complexity of Algorithm 2.
Theorem 4 The time complexity of Algorithm 2 is O(h^{2}·G·N) where h is the number of hybridization nodes of N.
Proof: For every u ∈ V(G), $\left\mathcal{C}\right(u\left)\right$ is equal to the possible nodes of N that u can be mapped to, which is bounded by $\left{\mathcal{M}}_{N}\left({L}_{N}\left(u\right)\right)\right$ by Lemma 6, and so by O(h) following Lemma 7.
The for loop at lines 3  23 is performed G times, and, at each iteration, the for loop at lines 1023 is performed O(h^{2}) times. In each iteration of the second loop, the operation computeMin, as detailed in Algorithm 3, requires two breathfirstsearch traversals, which can be performed in time O(N). Moreover, for every node x of $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)$, by definition there exists two separated paths from x to y, z, which can be extended to be two separated paths from x to two distinct leaves l_{1}, l_{2} of L_{ N } (u) where l_{1} is a leaf below y of and l_{2} is a leaf below z. This is always possible because, by Remark 1, L_{ N } (u) > 1. Hence, x must be in ${\mathcal{M}}_{N}\left({L}_{N}\left(u\right)\right)$ by definition, i.e. $\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)\subseteq {\mathcal{M}}_{N}\left({L}_{N}\left(u\right)\right)$, and thus $\left\mathcal{M}i{n}_{N}\left(y,\phantom{\rule{0.25em}{0ex}}z\right)\right\le \left{\mathcal{M}}_{N}\left({L}_{N}\left(u\right)\right)\right\le \left(h+\mathsf{\text{1}}\right)$. Therefore, the loop at lines 13  16 can be performed in time O(h). The operation merge(L_{1}, L_{2}) at lines 23 for two lists of size O(h) can be implemented in times O(h), if we know that the resulting list is also of size O(h). Hence, it takes O(N + h) = O(N) times for each iteration of the loop 10  23. Therefore, the total complexity is O(h^{2}· G·N).
Finally, a reconciliation of minimum cost between G and N can be then obtained by a standard backtracking of the matrix $\mathcal{C}$, starting from any pair (x,c) of $\mathcal{C}$(r(G)) such that c is the minimum value over all pairs in $\mathcal{C}$(r(G)).
Conclusions
In this paper, we have studied two variants of the reconciliation problem between a gene tree and a species network. In particular, for the problem of finding the "most parsimonious" switching of the network, even though the number of switchings can be exponential with respect to the number of hybridization nodes, we proposed an algorithm that is exponential only with respect to the level of the network, which is often low for biological data. Moreover, the problem of finding a reconciliation between a gene tree and a network, which was solved in [4] for a more general model but with a very high complexity, was restudied here for a simpler model, which is more pertinent for same parts of the Tree of Life, and an algorithm with a much smaller complexity was provided. In a further work, we intend to implement the algorithms presented in this paper and apply them to biological data.
Appendix
Proof of lemma 3
By Definition 8, for every u ∈ I(G), there must exist H ∈ G_{B(u) }such that u ∈ I(H). If H is an edge, then u is the only internal node of H, which must be an artificial node by Lemma 2. But this is not possible because nodes of G cannot be artificial. Hence, H must be a binary tree. Let denote B_{ i } = B(u), and S_{ i } = S(B(u)). We will prove now that $\alpha \left(u\right)\phantom{\rule{0.25em}{0ex}}={\beta}_{{S}_{i}}^{H}\left(u\right)$ by recursion on the height of u.
Let u be an internal node of G that has two children u_{1}, u_{2} in G, and let H be the binary tree of ${G}_{{B}_{i}}$ such that u ∈ I(H). Denote ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}=B\left({u}_{\mathsf{\text{1}}}\right)$, ${{B}_{i}}_{{}_{\mathsf{\text{2}}}}=B\left({u}_{\mathsf{\text{2}}}\right)$, and ${{S}_{i}}_{{}_{\mathsf{\text{1}}}}=S\left(B\left({u}_{\mathsf{\text{1}}}\right)\right)$, ${{S}_{i}}_{{}_{\mathsf{\text{2}}}}=S\left(B\left({u}_{\mathsf{\text{2}}}\right)\right)$. For j = 1, 2, if u_{ j } is a leaf, let H_{ j } be equal to u_{ j } , otherwise H_{ j } is the binary tree of ${G}_{{B}_{{i}_{j}}}$ such that u_{ j } ∈ I(H_{ j }). For the sake of convenience, if u_{ j } is a leaf, we also denote by ${\beta}_{{S}_{{i}_{j}}}^{{H}_{j}}$ the reconciliation that maps the only leaf u_{ j } to the only leaf x of N such that s(x) = s(u_{ j } ). Note that s(x) = α(u_{ j } ).
We now suppose that $\alpha \left({u}_{j}\right)\phantom{\rule{0.25em}{0ex}}={\beta}_{{S}_{{i}_{j}}}^{{H}_{j}}\left({u}_{j}\right)$ for j = 1, 2 (which is evidently true if u_{ j } is a leaf), and we will show that this implies that the claim is true for u.
Let ${u}_{1}^{\prime}$(resp. ${u}_{2}^{\prime}$) be the child of u in G_{ N } such that ${u}_{\mathsf{\text{1}}}{\le}_{{G}_{N}}{u}_{1}^{\prime}{<}_{{G}_{N}}u\phantom{\rule{0.25em}{0ex}}\left(\mathsf{\text{resp}}.{\phantom{\rule{0.25em}{0ex}}u}_{\mathsf{\text{2}}}\le {G}_{N}{u}_{2}^{\prime}\right)<{G}_{N}\phantom{\rule{0.3em}{0ex}}u)$. We respectively denote ${B}_{{{i}^{\prime}}_{1}}=B\left({u}_{1}^{\prime}\right)$ and ${B}_{{{i}^{\prime}}_{2}}=B\left({u}_{2}^{\prime}\right)$. By definition of the LCA reconciliation, we have ${\beta}_{{S}_{i}}^{H}(u)=LC{A}_{{S}_{i}}({\beta}_{{S}_{i}}^{H}({{u}^{\prime}}_{1}),\beta ({{u}^{\prime}}_{2}))$.

(i)
If ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}={{B}_{i}}_{{}_{\mathsf{\text{2}}}}$, then ${B}_{i}={B}_{{i}_{1}}={B}_{{{i}^{\prime}}_{1}}={B}_{i2}={B}_{{{i}^{\prime}}_{2}}$, and H = H_{1} = H_{2}. This implies that ${u}_{1}^{\prime}={u}_{\mathsf{\text{1}}}$, because otherwise H will contain an artificial node. The same holds for ${u}_{2}^{\prime}$ and u_{2}. Thus, $\begin{array}{c}\alpha \left(u\right)=LC{A}_{S}\left(\alpha \left({u}_{\mathsf{\text{1}}}\right),\alpha \left({u}_{\mathsf{\text{2}}}\right)\right)=LC{A}_{S}\left({\beta}_{{S}_{{i}_{1}}}^{{H}_{1}}\left({u}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{{i}_{2}}}^{{H}_{2}}\left({u}_{\mathsf{\text{2}}}\right)\right)\\ =LC{A}_{S}\left({\beta}_{{S}_{i}}^{H}\left({u}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{\mathsf{\text{2}}}\right)\right)=LC{A}_{{S}_{i}}\left({\beta}_{{S}_{i}}^{H}\left({u}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{\mathsf{\text{2}}}\right)\right)=LC{A}_{{S}_{i}}\left({\beta}_{{S}_{i}}^{H}\left({{u}^{\prime}}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{i}}^{H}\left({{u}^{\prime}}_{\mathsf{\text{2}}}\right)\right)={\beta}_{{S}_{i}}^{H}\left(u\right).\end{array}$

(ii)
If ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}{<}_{N}{{B}_{i}}_{{}_{\mathsf{\text{2}}}}$, then ${B}_{i}={{B}_{i}}_{{}_{\mathsf{\text{2}}}}$, and H = H_{2}. As in point (i), this implies ${u}_{2}^{\prime}={u}_{\mathsf{\text{2}}}$ and thus $\alpha \left(u\right)=LC{A}_{S}\left(\alpha \left({u}_{\mathsf{\text{1}}}\right),\alpha \left({u}_{\mathsf{\text{2}}}\right)\right)=LC{A}_{S}\left({\beta}_{{S}_{{i}_{1}}}^{{H}_{1}}\left({u}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{2}^{\prime}\right)\right)$. If ${u}_{1}^{\prime}={u}_{1}$, then we have $LC{A}_{S}({\beta}_{{S}_{{i}_{1}}}^{{H}_{1}}\left({u}_{1}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{2}^{\prime}\right)=LC{A}_{S}({\beta}_{{S}_{i}}^{H}\left({u}_{1}^{\prime}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{2}^{\prime}\right)={\beta}_{{S}_{i}}^{H}\left(u\right)$. Otherwise, ${u}_{1}^{\prime}$ is an artificial node which is a leaf of H that is mapped to r$\left({B}_{{i}_{1}}^{\prime}\right)$ by ${\beta}_{{S}_{i}}^{H}$. By Lemma 1, ${B}_{{i}_{1}}{\le}_{N}{B}_{{{i}^{\prime}}_{1}}{<}_{N}{B}_{i}$, so all nodes of B_{ i } are above all nodes of ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}$ and above $r\left({B}_{{{i}^{\prime}}_{1}}\right)$. This implies that $LC{A}_{S}\left({\beta}_{{S}_{{i}_{1}}}^{{H}_{1}}\left({u}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{2}^{\prime}\right)\right)=LC{A}_{S}\left(r\left({B}_{{{i}^{\prime}}_{1}}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{2}^{\prime}\right)\right)=LC{A}_{{S}_{i}}\left({\beta}_{{S}_{i}}^{H}\left({u}_{1}^{\prime}\right),{\beta}_{{S}_{i}}^{H}\left({u}_{2}^{\prime}\right)\right)={\beta}_{{S}_{i}}^{H}\left(u\right)$. Similarly for the case where ${{B}_{i}}_{{}_{\mathsf{\text{2}}}}{<}_{N}{{B}_{i}}_{{}_{\mathsf{\text{1}}}}$.

(iii)
Suppose now that ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}$, ${{B}_{i}}_{{}_{\mathsf{\text{2}}}}$ are not comparable and that ${u}_{1}^{\prime}\ne {u}_{\mathsf{\text{1}}}$ and ${u}_{2}^{\prime}\ne {u}_{\mathsf{\text{2}}}$ (the other cases can be shown reusing the arguments of point (ii)). Then, similarly as in point (ii), ${B}_{{i}_{1}}{\le}_{N}{B}_{{{i}^{\prime}}_{1}}{<}_{N}{B}_{i},{{B}_{i}}_{{}_{\mathsf{\text{2}}}}{\le}_{N}{B}_{{{i}^{\prime}}_{2}}{<}_{N}{B}_{i}$, and ${u}_{1}^{\prime}$, ${u}_{2}^{\prime}$ are leaves of H mapped respectively to $r\left({B}_{{{i}^{\prime}}_{1}}\right)$ and $r\left({B}_{{{i}^{\prime}}_{2}}\right)$ by ${\beta}_{{S}_{i}}^{H}$. Since ${\beta}_{{S}_{{i}_{1}}}^{{H}_{1}}\left({u}_{\mathsf{\text{1}}}\right)$ is a node of ${{B}_{i}}_{{}_{\mathsf{\text{1}}}}$, ${\beta}_{{S}_{{i}_{2}}}^{{H}_{2}}\left({u}_{\mathsf{\text{2}}}\right)$ is a node of ${{B}_{i}}_{{}_{\mathsf{\text{2}}}}$, then $LC{A}_{S}\left({\beta}_{{S}_{{i}_{1}}}^{{H}_{1}}\left({u}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{{i}_{2}}}^{{H}_{2}}\left({u}_{\mathsf{\text{2}}}\right)\right)=LC{A}_{S}\left(r\left({B}_{{{i}^{\prime}}_{1}}\right),\phantom{\rule{0.25em}{0ex}}r\left({B}_{{{i}^{\prime}}_{2}}\right)\right)=LC{A}_{{S}_{i}}\left({\beta}_{{S}_{i}}^{H}\left({{u}^{\prime}}_{\mathsf{\text{1}}}\right),{\beta}_{{S}_{i}}^{H}\left({{u}^{\prime}}_{\mathsf{\text{2}}}\right)\right)={\beta}_{{S}_{i}}^{H}\left(u\right).$
Therefore, in all cases we always have $\alpha \left(u\right)={\beta}_{S}^{H}\left(u\right)$, and thus the same is true for every u ∈ I(G) by recursion.
Proof of lemma 7
We will prove this lemma by recursion on the number of hybridization nodes. See the figure 4 for an example.
If there is no hybridization node, then N is a tree, and it is evident that ${\mathcal{M}}_{N}\left(L\right)$ contains exactly one node.
Suppose that the claim is correct for every network having h hybridization nodes. Let N now be a network that has h + 1 hybridization nodes. Let (a, b) be an edge of N having as target a hybridization node (namely, b) such that it does not exist any hybridization node above a (such a node always exists because N is a directed acyclic graph). Let N' be the network obtained by removing (a, b) from N (and also removing all nodes of indegree 1 and outdegree 1 created by this removal), then N' has exactly h hybridization nodes. Let $Q={\mathcal{M}}_{N}\left(L\right)\backslash {\mathcal{M}}_{{N}^{\prime}}\left(L\right)$, then every node q in Q must be above a. Indeed, if q is not above a, then every path from q to every node of L does not contains (a, b), thus q is in ${\mathcal{M}}_{{N}^{\prime}}\left(L\right)$, a contradiction. Moreover, by hypothesis, there is no hybridization node above a, hence all nodes of Q must be contained in a path leading a, and this path does not contain any hybridization to node. Let enumerate the nodes in Q as q_{1},... q_{ m } from the lowest to the highest one.
If $\leftQ\right\le \mathsf{\text{1}},\mathsf{\text{then}}{\mathcal{M}}_{N}\left(L\right)={\mathcal{M}}_{{N}^{\prime}}\left(L\right)+\leftQ\right\le h+\mathsf{\text{1}}+\mathsf{\text{1}}=h+\mathsf{\text{2}}$, we are done.
Suppose now that Q = m >1. In the following, we will define m  1 edges of N having as target a hybridization node such that if N" is the network obtained from N' by removing these edges, then ${\mathcal{M}}_{{N}^{\prime}}\left(L\right)={\mathcal{M}}_{{N}^{\u2033}}\left(L\right)$.
Denote by L^{*} the set of nodes in L that are below q_{ m } in N'. Hence L \ L^{*} is not empty since otherwise q_{ m } would be in ${\mathcal{M}}_{{N}^{\prime}}\left(L\right)$, a contradiction. For every q_{ i }, i = 1,..., m, let ${q}_{i}^{\prime}$, ${q}_{i}^{\u2033}$ be the two children of q_{ i } such that ${q}_{i}^{\prime}$ is above or equal to a. Hence, ${q}_{i}^{\u2033}$, is not above or equal to a since there is no hybridization node above a, thus every path from q_{ i } to L \ L^{*} must pass through ${q}_{i}^{\prime}$ and a, b. By definition, there must exist at least two separated paths from q_{ i } to two leaves of L. Hence, for every i, there exists always a path from q_{ i } to a node of L^{*} that passes through ${q}_{i}^{\u2033}$. Denoted this node by l(q_{ i }).
Denote by L_{ m } the set of nodes in L^{*} such that, for every l ∈ L_{ m }, there is a path from q to l that passes through ${q}_{m}^{\u2033}$. As explained above, L_{ m } is not empty. Recursively, for every i< m, let L_{ i } be the set of nodes in L^{*} \ L_{ m }\... L_{i+1 }such that, for every l ∈ L_{ i }, there is a path from q_{ i }to l that passes ${q}_{i}^{\u2033}$. Note that this set may be empty if i< m, and for every i ≠ j, L_{ i } ∩ L_{ j } = ∅.
We will define, for each i < m, an index c(i) that is strictly greater than i such that L_{c(i) }≠ ∅, together with two paths p_{ i } (resp. ${p}_{i}^{\prime}$) from q_{ i } (resp. q_{c(i)}) to a node of L_{c(i) }as follows: if l(q_{ i }) is not in L_{ i }, then by the definition, there exists a unique j such that L_{ j } contains l(q_{ i }), and j > i. We fix c(i) = j. Next, we define p_{ i } (resp. ${p}_{i}^{\prime}$) as a path from q_{ i } (resp. q_{c(i)}) to l(q_{ i }) that passes ${q}_{i}^{\u2033}$ (resp. ${{q}_{c}^{\u2033}}_{\left(i\right)}$). If l(q_{ i }) is in L_{ i }, then let c(i) be the smallest number that is greater than i and L_{c(i) }≠ ∅ (such an index always exists because L_{ m } ≠ ∅). Let l' be a node of L_{c(i)}. Since q_{ i } is in M_{ N } (L), then there must exist a path from q_{ i } to l', and we define p_{ i } as this path. The path ${p}_{i}^{\prime}$ is defined as the one from q_{c(i) }to l' that passes through ${q}_{c\left(i\right)}^{\u2033}$. Note that p_{ i }, ${p}_{i}^{\prime}$ must contain at least one common hybridization node since they start from different nodes and end at a same leaf of L. Denote by h_{ i } the highest common hybridization node of p_{ i } and ${p}_{i}^{\prime}$. Hence, all h_{ i }s are distinct since each p_{ i } starts at a different node q_{ i }, and each ${p}_{i}^{\prime}$ starts at a node q_{c(i) }that is strictly greater than q_{ i }. We define (a_{ i }, b_{ i }) recursively in increasing order of i from 1 to m  1 as follows. If i = 1, then b_{ i } is the highest hybridizatition node on p_{ i }. If i >1, then b_{ i } is the highest hybridization node on p_{ i } and different from all b_{ k } for every k < i. There exists always such a node b_{ i }, for example h_{ i }. Therefore, all b_{ i }s are distinct. Denote by a_{ i } the parent of b_{ i } on the path p_{ i }.
Denote by N" the network obtained from N' by removing all edges (a_{ i }, b_{ i }). For every node x in ${\mathcal{M}}_{{N}^{\prime}}\left(L\right)$, we will prove that x is also in ${\mathcal{M}}_{{N}^{\u2033}}\left(L\right)$. Denote by x', x" the two children of x. By definition, for every l ∈ L, there exists a path, denoted by f'(l), in N' from x to l such that at least one path among them passes through x' and one other passes through x". To prove that x is in ${\mathcal{M}}_{{N}^{\u2033}}\left(L\right)$, we will now construct another set of paths in N" (i.e. in N' and does not contain any (a_{ i }, b_{ i })) from x to each leaf l of L, denoted by f"(l), such that at least one path among them passes through x'and one other passes through x".
Consider first the case that x is above q_{ m } (as shown in the figure 4). Without loss of generality, suppose that x' is above or equal to q_{ m } while x" is not. Suppose that f'(l) contains q_{ m }, then l ∈ L^{*} and f'(l) must pass through x' because there is no hybridization node above q_{ m }. Suppose that $l\phantom{\rule{0.25em}{0ex}}\in {L}_{\mathsf{\text{1}}}\cup ..\phantom{\rule{0.25em}{0ex}}.\phantom{\rule{0.25em}{0ex}}{L}_{m}$. Let k be the index such that l ∈ L_{ k }, then we can choose a path f"(l) in N' from x to l that does not contain any (a_{ i }, b_{ i }) as follows. This path starts from x, passes through x', goes down to q_{ k } , then takes the path from q_{ k } to l that contains q". Note that this path does not include any p_{ i } since, by construction, every path p_{ i } starts from q_{ i } and goes to a node in L_{c(i) }that is different from L_{ i }, while this path passes through q_{ k } and goes to a node in L_{ k } . Moreover, this path and p_{ i } can not have any common hybridization node above a_{ i } because b_{ i } is the highest hybridization node on p_{ i }. Hence, it can not pass through (a_{ i }, b_{ i }) for any i. If l is not in ${L}_{\mathsf{\text{1}}}\cup ...\phantom{\rule{0.25em}{0ex}}{L}_{m}$, it means that every path from q_{ m } to l must pass through ${q}_{1}^{\prime}$, then we take the path starting from x, going down to ${q}_{i}^{\prime}$, then continuing to l. It is evident that this path does not include any p_{ i }, and it cannot have with p_{ i } any common hybridization node above a_{ i } because b_{ i } is the highest hybridization node on p_{ i }. Hence it does not pass through any (a_{ i }, b_{ i }). If f'(l) does not contain q_{ m }, then we fix f"(l) = f'(l). It is easy to see that f'(l) does not contain any edge (a_{ i }, b_{ i }) because otherwise f'(l) and p_{ i } must have at least a common hybridization node above a_{ i } (since there is no hybridization node above q_{ i }). But this is not possible because b_{ i } is the highest hybridization node on p_{ i }. Remark that at least one of the paths f'(l) in this case must pass through x" since all paths in the first case must pass through x'. Hence at least two of the paths f"(l) are separated, thus x is in ${\mathcal{M}}_{{N}^{\u2033}}\left(L\right)$ by definition.
Consider now the case that x is not above q_{ m }, then similarly as in the previous case where f'(l) does not contain q_{ m }, we deduce that f'(l) does not contain any (a_{ i }, b_{ i }) for every l. Hence, by choosing f"(l) = f'(l) for every l, we are done.
Therefore, we have ${\mathcal{M}}_{{N}^{\prime}}\left(L\right)={\mathcal{M}}_{{N}^{\u2033}}\left(L\right)$.
The network N" contains h  Q + 1 hybridization nodes, then following the hypothesis of recurrence, ${\mathcal{M}}_{{N}^{\u2033}}\left(L\right)\le h\phantom{\rule{0.25em}{0ex}}Q+\mathsf{\text{2}}$. This implies that $\left{\mathcal{M}}_{{N}^{\prime}}\left(L\right)\right=\left{\mathcal{M}}_{{N}^{\u2033}}\left(L\right)\left\le h\rightQ\right+\mathsf{\text{2}}$, thus $\left{\mathcal{M}}_{N}\left(L\right)\right=\left{\mathcal{M}}_{{N}^{\prime}}\left(L\right)\right+\leftQ\right\le h\leftQ\right+\mathsf{\text{2}}+\leftQ\right=h+\mathsf{\text{2}}$.
References
 1.
Doyon Jp, Ranwez V, Daubin V, Berry V: Models, algorithms and programs for phylogeny reconciliation. Briefings in bioinformatics. 2011, 12 (5): 392400. doi:10.1093/bib/bbr045
 2.
Maddison WP: Gene trees in species trees. Systematic Biology. 1997, 46 (3): 523536. doi:10.1093/sysbio/46.3.523, [http://sysbio.oxfordjournals.org/content/46/3/523.full.pdf+html]
 3.
Huson D, Rupp R, Scornavacca C: Phylogenetic Networks. 2010, Cambridge University Press
 4.
LibeskindHadas R, Charleston MA: On the computational complexity of the reticulate cophylogeny reconstruction problem. JCB. 2009, 16 (1): 105117.
 5.
Page RDM: Parallel phylogenies: reconstructing the history of hostparasite assemblages. Cladistics. 1994, 10 (2): 155173.
 6.
Ronquist F: Reconstructing the history of hostparasite associations using generalized parsimony. Cladistics. 1995, 11 (1): 7389.
 7.
Charleston M: Jungles: a new solution to the hostparasite phylogeny reconciliation problem. Math Biosci. 1998, 149 (2): 191223.
 8.
Merkle D, Middendort M: Reconstruction of the cophylogenetic history of related phylogenetic trees with divergence timing information. Theory Biosci. 2005, 123 (4): 277299.
 9.
Doyon JP, Scornavacca C, Gorbunov KY, Szöllősi GJ, Ranwez V, Berry V: An effcient algorithm for gene/species trees parsimonious reconciliation with losses duplications and transfers. Research in Computational Molecular Biology: Proceedings of the 14th International Conference on Research in Computational Molecular Biology (RECOMB). 2010, LNCS, Springer, Berlin/Heidelberg, Germany, 6398: 93108. Software downloadable at http://www.atgcmontpellier.fr/Mowgli/
 10.
Cormen TH, Stein C, Rivest RL, Leiserson CE: Introduction to Algorithms. 2001, McGrawHill Higher Education, 2
 11.
Choy C, Jansson J, Sadakane K, Sung WK: Computing the maximum agreement of phylogenetic networks. Theoretical Computer Science. 2005, 335 (1): 93107.
 12.
Fischier M, van Iersel L, Kelk S, Scornavacca C: On Computing the Maximum Parsimony Score of a Phylogenetic Network. SIAM Journal on Discrete Mathematics  SIDMA. 2015.
 13.
Kelk S, Scornavacca C, Van Iersel L: On the elusiveness of clusters. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2012, 9 (2): 517534.
 14.
Doyon J, Chauve C, Hamel S: Space of gene/species trees reconciliations and parsimonious models. Journal of Computational Biology. 2009, 16 (10): 13991418.
 15.
Zmasek CM, Eddy SR: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics. 2001, 17 (9): 821828.
 16.
Goodman M, Czelusniak J, Moore GW, RomeroHerrera AE, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Systematic Biology. 1979, 28 (2): 132163.
 17.
Page RDM: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Systematic Biology. 1994, 43 (1): 5877.
 18.
Chauve C, ElMabrouk N: New perspectives on gene family evolution: Losses in reconciliation and a link with supertrees. Research in Computational Molecular Biology, 13th Annual International Conference, RECOMB 2009, Tucson, AZ, USA, May 1821, 2009. Proceedings. 2009, 4658.
 19.
Kanj Ia, Nakhleh L, Than C, Xia G: Seeing the trees and their branches in the network is hard. Theoretical Computer Science. 2008, 401: 153164. doi:10.1016/j.tcs.2008.04.019
 20.
Downey RG, Fellows MR: Fundamentals of Parameterized Complexity. Springer. 2013, 4:
 21.
Gabow HN, Tarjan RE: A lineartime algorithm for a special case of disjoint set union. Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing. 1983, 246251.
 22.
Harel D, Tarjan RE: Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing. 1984, 13 (2): 338355.
 23.
Zhang L: On a mirkinmuchniksmith conjecture for comparing molecular phylogenies. Journal of Computational Biology. 1997, 4: 177187.
 24.
Bender M, FarachColton M: The lca problem revisited. LATIN 2000: Theoretical Informatics. Lecture Notes in Computer Science Springer. 2000, 1776: 8894.
Acknowledgements
This work was partially funded by the French Agence Nationale de la Recherche Investissements d'Avenir/ Bioinformatique (ANR10BINF0102, Ancestrome). The publication charges of this article were funded by the French Grant Agence Nationale de la Recherche: Investissements d'Avenir/Bioinformatique (ANR10BINF0102, Ancestrome).
This article has been published as part of BMC Genomics Volume 16 Supplement 10, 2015: Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S10.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Both authors contributed to design the models, algorithms and to write the paper.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
To, TH., Scornavacca, C. Efficient algorithms for reconciling gene trees and species networks via duplication and loss events. BMC Genomics 16, S6 (2015). https://doi.org/10.1186/1471216416S10S6
Published:
Keywords
 tree reconciliation
 gene evolution
 phylogenetic
 parsimony
 phylogenetic networks