Proof of Lemma ?? (Link between rooted and unrooted trees)
Let T1 and T2 be two rooted trees and \(T^{\prime }_{1}\) and \(T^{\prime }_{2}\) be the corresponding unrooted trees, i.e. V(T1′)=V(T1)∪{R},V(T2′)=V(T2)∪{R},E(T1′)=E(T)∪{(r(T1),R)} and E(T2′)=E(T)∪{(r(T2),R)}.
We first show that any bad bipartition of \(T^{\prime }_{1}\), i.e. any bad edge of \(T^{\prime }_{1}\), corresponds to a bad clade of T1 (a clade which is not present in T2). Let \(e^{\prime }_{1}\) be a bad edge of \(T^{\prime }_{1}\). Then \(e^{\prime }_{1}\) should be a non-terminal edge of \(T^{\prime }_{1}\), thus different from (r(T1),R)), and therefore it has a corresponding edge e1=(x1,y1) in T1. Then, for one of the two nodes adjacent to \(e^{\prime }_{1}\) that we denote y1′, we have \(L(T'_{1y'_{1}}) = L(T_{1y_{1}})=C\). If e1′ is a bad edge of \(T^{\prime }_{1}\), then C should be a bad clade of T1 not present in T2. This is because otherwise C would be a non-trivial clade of T2 rooted at an internal node y2 adjacent to an edge e2=(x2,y2) and thus also equal to \(L(T'_{2y'_{2}})\) for a given edge e2′=(x2′,y2′). This contradicts the fact that \(e^{\prime }_{1}\) is a bad edge. Therefore, each bad bipartition of \(T^{\prime }_{1}\) corresponds to a bad clade of T1. Moreover, two disjoint bad bipartitions of \(T^{\prime }_{1}\) correspond to two different bad edges of \(T^{\prime }_{1}\), with the corresponding edges of T1 associated to two disjoint clades. Thus we have \(|\mathcal {B}(T'_{1})| \leq |\mathcal {C}(T_{1})|\).
Conversely, a bad clade C of T1 corresponds to an internal node y1 of T1. Let e1=(x1,y1) in T1, where x1 is the parent of y1. Then the corresponding edge \(e^{\prime }_{1}\) in \(T^{\prime }_{1}\) is a bad edge. Moreover, two disjoint clades of T1 correspond to two disjoint edges of T1′. It follows that \(|\mathcal {C}(T_{1})| \leq |\mathcal {B}(T'_{1})|\). Combining this result with the result above, we deduce that \(|\mathcal {C}(T_{1})| = |\mathcal {B}(T'_{1})|\). As T2 and \(T^{\prime }_{2}\) can be considered similarly, the result follows.
Proof of Lemma ?? (Edit distance):
The non-negative and identity conditions are obvious. For the symmetric condition, notice that we can reverse every edit operation in an optimal sequence from T1 to T2 to obtain a sequence from T2 to T1 with the same number of events, and vice-versa (extensions and contractions are inverses of each other, and any flip can be reversed by a flip). We thus have δ(T2,T1)≤δ(T1,T2) and δ(T1,T2)≤δ(T2,T1), and equality follows.
Finally, we prove the triangular inequality condition: for 3 trees T1,T2 and T3, to transform T1 into T2, we may take any edit sequence from T1 to T3, followed by any edit sequence from T3 to T2. It follows that δ(T1,T2)≤δ(T1,T3)+δ(T3,T2).
Proof of Lemma ?? (Pairs of maximal bad subtrees):
As ∪iYi=Ł, \(\phantom {\dot {i}\!}\{e'_{i}\}_{1 \leq i \leq k}\) are the only terminal edges of any subtree S′ of T′ containing the set \(\phantom {\dot {i}\!}\{e'_{i}\}_{1 \leq i \leq k}\) as terminal edges. As T′ is a tree, for any 1≤i≠j≤k, there is only one possible path from \(x^{\prime }_{i}\) to \(x^{\prime }_{j}\). Uniqueness follows.
Suppose that such a subtree S′ is not a bad subtree. Then it contains an internal good edge e′=(x′,y′). In other words, there is a non-trivial bipartition of {Yi}1≤i≤k which is also a bipartition in S. This contradicts the fact that S is a bad subtree of T. Finally, as all terminal edges of S′ are good edges of T′, it follows that S′ is a maximal bad subtree of T′.
Proof of Lemma ?? (Contract non-mixed bad edges):
We first introduce a definition that will be of use later in the proof. For two rooted trees S1 and S2, define the union of S1 and S2 as the tree obtained by identifying their roots, i.e. by removing the root of S2 and making all its children now children of the root of S1.
Let e={u,v} be a non-mixed bad edge and assume, without loss of generality, that both u and v have the label Spe (recall that Λ={Spe,Dup}). Notice that any sequence of operations turning T into T′, at some point, must contract the {u,v} edge, as otherwise, the (bad) bipartition corresponding to {u,v} would remain in the transformed tree and we would not obtain T′ (noting that extensions cannot remove bipartitions). We now prove the Lemma by induction over δ(T,T′). As a base case, suppose that δ(T,T′)=1. Then {u,v} must be the only bad edge of T and the single operation is to contract it, proving the base case.
Now assume that for any tree \(\tilde {T}\) satisfying \(\delta (\tilde {T}, T') < \delta (T, T')\), contracting any non-mixed bad edge of \(\tilde {T}\) reduces its distance to T′ by 1. Let Q=(q1,…,ql) be an optimal sequence of operations transforming T into T′ (here each qi denotes either a contraction, extension or flip). Let qj be the event that contracts {u,v}. If q1=qj, then we are done, so assume otherwise. We make the assumption that whenever there is a contraction involving u prior to qj, the contracted node is still called u. Furthermore, we assume that if an extension prior to qj splits the neighbors of u, the node v is still a neighbor of u after the operation. All the same assumptions hold for v. This just changes the names we give to nodes and does not alter the scenario, but observe that this means that {u,v} is in every tree obtained before the first j operations.
For each i∈{1,…,l}, let Ti be the tree obtained after applying q1,…,qi on T, and define T0=T. Furthermore, for i∈{0,1,…,j−1}, denote by \(T^{u}_{i}\) and \(T^{v}_{i}\) the two trees obtained from Ti by removing the edge {u,v}, where u is in \(T^{u}_{i}\) and v is in \(T^{v}_{i}\). Define \(T^u = T^{u}_{0}\) and \(T^v = T^{v}_{0}\). We will assign u and v as the respective roots of each \(T^{u}_{i}\) and \(T^{v}_{i}\). Notice that for each i∈{1,…,j−1}, qi only modifies either the subtree \(T^{u}_{i-1}\) or \(T^{v}_{i-1}\). Therefore, if events qi and qi+1 modify \(T^{u}_{i-1}\) and \(T^{v}_{i}\), respectively, we could apply qi+1 before qi and Ti+1 would still be the same tree. This lets us assume that we may reorder events such that all events affecting Tu (prior to qj) occur before those affecting Tv. That is, there is some h such that q1,…,qh only affects the Tu subtree, qh+1,…,qj−1 only affects the Tv subtree, so that \(T^{u}_h = T^{u}_{h+1} = \ldots = T^{u}_{j-1}\) and \(T^v = T^{v}_1 = \ldots = T^{v}_{h}\).
Suppose first that u is labeled Spe in Th, and thus also in Tj−1. Then v is also labeled Spe in Tj−1 (and also in Th since v was untouched until qh+1). Let \(\hat {T}\) be the tree obtained after contracting {u,v} in T, and let z be the resulting node. Observe that if we interpret z as u, then we may apply the events q1,…,qh on \(\hat {T}\), since these events only affected the Tu subtrees. To be formal, we “reproduce” q1 through qh on \(\hat {T}\) by applying the events Q′=(q1′,…,qh′) on \(\hat {T}\), defining \(\hat {T}_{i}\) as the tree obtained after the i-th event of Q′, where each \(q^{\prime }_{i}\) in Q′ is defined as follows:
-
if qi contracts {x,y} in Ti−1, then \(q^{\prime }_{i}\) contracts {x,y} in \(\hat {T}_{i-1}\) if x,y≠u, otherwise if, say, x=u, then \(q^{\prime }_{i}\) contracts {z,y} (and calls the resulting node z);
-
if qi flips x in Ti−1, then \(q^{\prime }_{i}\) flips x in \(\hat {T}_{i-1}\) if x≠u, or flips z otherwise;
-
if qi is an extension and splits the neighborhood of x, then \(q^{\prime }_{i}\) does the same if x≠u (replacing u by z if needed). If x=u, then let X be the set of neighbors of v in Ti−1, excluding u. If Ch(u) is split into A and B by qi, where v∈B, then \(q^{\prime }_{i}\) splits the neighbors A∪B∪X of z into A and B∪X (and z is the neighbor of B∪X and the newly created node).
One can verify the following that the following invariant holds on each \(\hat {T_i}, i \in \{1, \ldots, h\}\): if we take Ti and contract the edge {u,v}, ignoring the labels and keeping the label of u, then we obtain \(\hat {T}_{i}\) (the invariant is also true for T and \(\hat {T}\)).
The resulting tree \(\hat {T}_{h}\) obtained from applying q1′,…,qh′ on \(\hat {T}\) will therefore contain z as a Spe node, and will be the union of \(T^{u}_{h}\) and \(T^{v}_{0}\). From this point, in a similar fashion, we may interpret z as v and apply qh+1,…,qj−1 on \(\hat {T}_{h}\), resulting a tree that is the union of \(T^{u}_h = T^{u}_{j-1}\) and \(T^{v}_{j-1}\). The corresponding events are the same as above, we omit the formal details. Since Tj is obtained from Tj−1 by contracting {u,v}, this means that \(\hat {T}_{j-1} = T_{j}\), which we have attained with j events but contracting {u,v} first, which proves this case.
Suppose instead that u is labeled Dup in Th. Then v is a Dup node in Tj−1. We may further assume that v is a Spe node in Th+1,…,Tj−2, since whenever we flip v into a Dup, we may assume by induction that {u,v} gets contracted. Therefore, qj−1 flips v from Spe to Dup, and for the first time. We may then do the following: first apply the events qh+1,…,qj−2 on \(\hat {T}\), interpreting z as v. The resulting tree \(\hat {T}'\) contains z as a Spe node, and is the union of \(T^{v}_{j-2}\) and \(T^{u}_{0}\). We may now apply q1,…,qh on \(\hat {T}'\) by interpreting u as z, resulting in a tree \(\hat {T}^{\prime \prime }\) that contains z as a Dup node and is the union of \(T^{u}_{h} = T^{u}_{j-1}\) and \(T^{v}_{j - 1}\). We have thus attained Tj, but this time without the qj−1 flip on v, contradicting the optimality of Q. This concludes the proof.
Proof of Lemma ?? (Upper bound δ):
Methodology 1 performs e contractions and e′ extensions. As for the number of flips, we have to flip at most all the nodes belonging to the smallest label group, which means at most half the nodes in each tree, and thus at most n flips in total.
Proof of Lemma ?? (Compare Meth.1 and Meth.2):
We denote by Cont(T) the minimum length of a sequence of operations contracting T, and by l(¶) the length of a sequence ¶ of edit operations (Fig. 7).
Let ¶2 be an optimal sequence contracting S to S∗ and ¶2′ be an optimal sequence contracting S′ to S∗′. As each operation is reversible, ¶2′ leads to a corresponding sequence ¶2′′ of the same length between S∗′ and S′. Thus, ¶2, concatenated with a possible flip operation transforming S∗ to \(S^{\prime }_{*}\), concatenated with ¶2′′ is a sequence from S to S′ following Methodology 1, and thus M1(S,S′)≤M2(S,S′) (R1).
Conversely, let ¶ be an optimal sequence following Methodology 1. Then this sequence can be subdivided into a sequence ¶1 from S to a star tree S1, and ¶1′ from S1 to S′. As each operation is reversible, ¶1′ leads to a corresponding sequence ¶1′′ of the same length between S′ and S1. In other words, M1(S,S′)=l(¶1)+l(¶1′)=l(¶1)+l(¶1′′)≥Cont(S)+Cont(S′).
-
1.
If S∗=S∗′, then M2(S,S′)=Cont(S)+Cont(S′) and thus M1(S,S′)≥M2(S,S′), and the result follows from (R1).
-
2.
Otherwise, S∗ and S∗′ are different and M2(S,S′)=Cont(S)+Cont(S′)+1. Thus M1(S,S′)≥Cont(S)+Cont(S′)=M2(S,S′)−1, and thus M2(S,S′)≤M1(S,S′)+1.
Proof of Lemma ?? (Optimal path contracting a mixed tree):
We first show that at least ⌈diam(T)/2⌉−1 flips are needed, by induction over the diameter of T. When diam(T)=2, T is a star tree and 0=diam(T)/2−1 flips are needed. For the induction step, we assume that any tree T′ with diam(T′)<diam(T) requires at least ⌈diam(T′)/2⌉−1 flips. Take any optimal sequence of events S, and observe that in S, when we flip a node v of T, by Lemma ?? we may assume that S contracts all the incident edges to v until we obtain another mixed tree. Let T1,T2,…,Tk be the sequence of mixed trees encountered when applying S, i.e. each Ti is obtained after flipping a node and contracting its incident edges. Define T0=T. Let i be the smallest index such that diam(Ti)<diam(T). Then in Ti−1, there was a longest chain P=(u1,…,ul) of length diam(T). The flip-and-contract operations from Ti−1 to Ti can reduce the length of P by at most 2 since we flip one node and only its incident edges, of which there are at most two on P. Hence diam(Ti)≥diam(T)−2. We deduce by induction that the number of required flips is at least 1+⌈(diam(T)−2)/2⌉−1=⌈diam(T)/2⌉−1.
We now turn to the converse bound ϕ(T)≤⌈diam(T)/2⌉−1. Fix any node v of T, and suppose that we run the following procedure: as long as T is not a star tree, flip v and contract its incident internal edges. Since each flip-and-contraction iteration reduces the length from v to any leaf by 1 (except its neighbors), eccT(v) is reduced by 1 each round. We stop when eccT(v)=1, in which case only terminal edges remain, and in the end, this means that eccT(v)−1 flips are needed.
To see why this proves our bound, we show that there always exists a node with eccentricity ⌈diam(T)/2⌉. Consider a longest chain P of T with nodes w1,…,wk. Observe that diam(T)=k−1 (recall that distances are counted in terms of edges). Consider a midpoint node w:=w⌈k/2⌉ on P. We claim that eccT(w)=⌈diam(T)/2⌉. It is easy to check that w has distance at most ⌈diam(T)/2⌉ and at least ⌊diam(T)/2⌋ to the leaves w1 and wk on P. Assume for contradiction that w is at distance at least ⌈diam(T)/2⌉+1 from some leaf l of T not in P. Then either we can form a chain from w1 to w and then to l, or a chain from wk to w and then to l. This chain has length at least ⌊diam(T)/2⌋+⌈diam(T)/2⌉+1>diam(T), a contradiction. This shows that eccT(w)=⌈diam(T)/2⌉ and concludes the proof.
Proof of Theorem ?? (Upper bound Meth.2):
Consider a given instance (T,T′). Take any leaf of T and assign it as the root, and do the same for T′. Although we have assumed roots of degree at least two so far, we use this rooting only for our analysis in order fix a parent-child relationship between nodes. Let Q be an optimal sequence of operations turning T into T′. We may assume that Q first contracts every non-mixed edge, and our algorithm does the same. Therefore, we suppose that T and T′ contain no non-mixed edges. Assume for our purposes that whenever a contraction takes place in Q between a node u and a child v, the u node stays in the tree and v gets removed (here the notion of a child is in the rooted sense with respect to our rooting above). Also assume that when there is an extension splitting a node u, then the newly created node becomes a child of u and u retains the same parent. It is easily checked that this only alters the name of nodes and not the sequence itself.
Call an internal node v of T a good child if the edge between v and its parent is good. Note that v has a unique corresponding node in T′ which we denote v′ (i.e. v′ is the root of the same clade as the subtree rooted at v). Further, call v a bad-good child if v is a good child, but either the label of v differs from that of v′, or v is incident to at least one bad edge (yes, children are capable of being both bad and good). Note that every bad subtree of T is rooted at a bad-good child, and observe that here we say that a bad-good child v that is incident to only good edges is a particular case of a bad subtree (i.e. v just has the wrong label).
We already know that δ(T,T′) is at least the number of bad edges in T and T′. Let Q′ be the set of operations of Q that are either flips, or contraction of good edges. We argue that |Q′| is at least the number of bad-good children in T. To see this, let v be a bad-good child. Assume first that v is not incident to any bad edge. If we never flip v nor remove it by contracting its parent edge, then Q cannot transform T into T′, as v and its underlying clade remain present in every tree from T to T′, but with the wrong label (because a contraction not removing v cannot remove the v clade, and extensions can create clades but not remove them). So we may assume that v gets flipped or that its parent edge gets contracted. A flip must be in Q′ and, observing that at any point the parent edge of v must be good, a contraction removing v must also be in Q′. Assume instead that v is incident to at least one bad edge {v,w}, with w a child of v. If v is never flipped nor removed owing to a contraction of its parent edge, then at some point w must be flipped so that the {v,w} edge gets contracted. Otherwise, if v gets removed, then its parent edge was contracted, again implying the contraction of a good edge. Either cases imply an operation in Q′. Importantly, observe that the operations in Q′ identified above are all distinct, since each one implies a flip or a node removal of a node in a different bad subtree of T.
Now, let T1,…,Tk be the bad subtrees of T and T′, and for each i∈{1,…,k}, let ti be the number of bad edges in Ti. Further denote \(b = \sum _{i=1}^k t_{i}\). Since bad subtrees form pairs, our arguments above imply that Q′ has at least k/2 operations (because |Q′| is at least the number of bad trees in T, which is half the number of bad subtrees). The contraction of bad edges plus the operations of Q′ show that Q has at least \(\sum _{i = 1}^k t_i + k/2 = b + k/2\) operations. Our algorithm contracts b edges in total. To count the number of flips, take any bad subtree Ti. Then ti≥diam(Ti)−2 and the number of flips we perform is at most ⌈diam(Ti)/2⌉−1=⌈(diam(Ti)−2)/2⌉≤ti/2+1. Note that this also holds when Ti contains no bad edge. Therefore, the number of operations that we perform is at most \(b + \sum _{i=1}^k (t_i/2 + 1) = 3b/2 + k\). Our approximation ratio is therefore \(\frac {3b/2 + k}{b + k/2} \leq \frac {2b + k}{b + k/2} = 2\).