Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction

Background Inferring species trees from gene trees using the coalescent-based summary methods has been the subject of much attention, yet new scalable and accurate methods are needed. Results We introduce DISTIQUE, a new statistically consistent summary method for inferring species trees from gene trees under the coalescent model. We generalize our results to arbitrary phylogenetic inference problems; we show that two arbitrarily chosen leaves, called anchors, can be used to estimate relative distances between all other pairs of leaves by inferring relevant quartet trees. This results in a family of distance-based tree inference methods, with running times ranging between quadratic to quartic in the number of leaves. Conclusions We show in simulated studies that DISTIQUE has comparable accuracy to leading coalescent-based summary methods and reduced running times. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3098-z) contains supplementary material, which is available to authorized users.


List of Figures
S8 Species trees generated using DISTIQUE on the Avian biological dataset [1]. 12 S9 Species trees generated using ASTRID (NJst) and ASTRAL on Avian biological dataset [1] Tables   Table S1: Empirical statistics of simulated Avian and Mammalian datasets. Model condition 2X corresponds to the case where ILS is reduced by increasing the branch lengths (2 times longer), and 0.5X represents the case where ILS is increased by reducing the branch lengths (2 times shorter). In the same way, the model condition with 0.2X corresponds to the case where ILS is reduced by dividing the branch lengths by five. Average Robinso-Foulds (RF) distances between true gene trees and the model species tree are provided in AD to species tree. # gene trees shows the number of gene trees that are available for the corresponding dataset and ILS. #base pairs represents the number of base pairs, and # replicates shows the number of replicates for the corresponding dataset and ILS. In column Ref., the reference paper for each dataset is provided. For the Mammalian with ILS level 0.2X, # replicates 5 and 10 are for the model conditions where # gene trees is 3200, and 1600 respectively. Also for the Avian dataset with ILS level 1X, # replicates 10 is only for the model condition with # gene trees 2000.    Mean and standard error of species tree error is shown for true and estimated gene trees (500bp alignments). With very high ILS (0.2X), the accuracy for all of the implementations of DISTIQUE are close. As ILS decreases, when DISTIQUE is applied to the entire dataset, the error goes up, which is more pronounced for all-pairs-max. We attribute this pattern to difficulties of estimating long quartet lengths. When DISTIQUE is used to resolve polytomies in the consensus tree, the accuracy improves with decreased ILS, as expected. Note that even with reduced ILS, the consensus tree on estimated gene trees misses more than 25% of branches, and leaves some polytomies for DISTIQUE to resolve. Note that even with reduced ILS, the consensus tree misses more than 30% of branches, and leaves some polytomies for DISTIQUE to resolve. For tree-sum 2 rounds of sampling is somewhat better than one round, but beyond 2 rounds, little improvements are observed. For distance-sum, further improvements are observed as the number of rounds increases, but four rounds seems to give a reasonably good estimate without increasing the number of sampling rounds too much. For tree-sum 2 rounds of sampling is somewhat better than one round, but beyond 2 rounds, little improvements are observed. For distance-sum, further improvements are observed as the number of rounds increases, but four rounds seems to give a reasonably good estimate without increasing the number of sampling rounds too much. The accuracy of DISTIQUE with various distance methods is used. Methods used from the FastME suit: BioNJ, Neighbor-Joining, BalME (-s), BalME (-nni), OLSME (-s), OLSME (-nni). Methods used from the PhyD* suit: BinoNJ and MVR.    Figure S8: Species trees generated using DISTIQUE on Avian biological dataset [1]. Branches that are different between the DISTIQUE tree and both ASTRAL and ASTRID trees are marked in red, while branches that are only different between ASTRAL and DISTIQUE are marked with blue.  Figure S9: Species trees generated using ASTRID (NJst) and ASTRAL on Avian biological dataset [1]. ASTRAL (left) and ASTRID (right) species trees. The branches that are different between DISTIQUE species tree and both ASTRAL and ASTRID species trees are marked with red, while branches that are only different between ASTRAL and DISTIQUE are marked with blue.

Supplementary theoretical results 2.1 Supplementary lemmas
Lemma S1 Given m species trees, each on n−2 leaves and assuming the set of m trees is complete, and each tree is binary and correct, a compatibility supertree generates the correct tree.
Proof. Imagine we didn't and the supertree had a wrong internal branch. For every quartet defined around that branch (selecting a leaf from each of the four sides around the branch), by completeness, there is a tree T i that has that quartet, and because T i is correct and binary, that quartet should be resolved correctly. Since we can argue this for all quartets around the wrong branch, we get a contradiction where the branch is wrong but all quartets around it are correct. Moreover, since our m trees are binary and complete, the supertree is binary. Imagine it wasn't and take an unresolved quartet around its polytomy; by completeness, that quartet is at least in one T i . Thus, the supertree does not refine input tree T i , contradicting the compatibility supertree requirement. Since the supertree has no wrong edges and is binary, it is the correct tree.
Lemma S2 Given a species tree not necessarily fully resolved with n leaves, anchor-sampling strategy in Algorithm S4 yeilds at most O(n) anchors per each round of anchor sampling.
Proof. We have two types of polytomies; a polytomy is short if its degree is smaller than 5, and otherwise it is called a large polytomy. Our anchoring strategy is that per each round of sampling we have d i 2 number of samples for a large polytomy, 6 samples for a polytomy of degree d i = 4, and 10 anchors for a polytomy of degree d i = 5. Note that a species has at most n − 2 internal nodes, n − 3 internal branches and n terminal (trivial) branches. Also, each internal branch might be adjacent to at most two polytomies.
where m is the number of internal nodes. In one extreme, let's consider the case where all the polytomies are large; then, the number of anchors required per each round of anchor sampling is which is clearly O(n). Now let's assume that all the polytomies are small. The species tree has at most n − 3 internal branches which might be in at most two small polytomies. So we might have at most 2(n − 3) small polytomies, then where 10 in above equation is the maximum number of required samples for a short polytomy. Finally, we might have a combination of small polytomies, and large polytomies and the number of required anchors is at most sum of required anchors from Equations 2, and 3, which is clearly O(n).

Difficulties with long branches and need for majority consensus
In this section, we will explain a problematic case that leads us to use a majority consensus. Figure S7 shows an unrooted species tree, with many long branches, with length L. We assume L is long enough that for our given number of gene trees, with high probability, all gene trees will be topologically identical to the species tree. Note that for any number of genes, there are branches long enough where discordance is highly unlikely (we give one example below). Thus, we assume that for branches of length L, we have zero quartet trees that conflict with them.
In our example, the distance of a to c is L and a to b is 3L (note that in our analyses, we only calculate branch length for internal branches). We will show that all-pairs can be misled to give a smaller distance for a to b than a to c.
Recall The definition of distance matrices: Also recall that for coalescent-based analyses, β = ln 3, The first term comes from choosing one anchor from {u 1 , u 2 , u 3 }, and choosing the other anchors from {c} ∪ X. In these cases, a and b are further from each other than the anchors, and because of our assumption about L, the frequency of ab.uv is expected to be zero. The second term comes from choosing both anchors from {c} ∪ X; for these, the frequency of ab.uv is expected to be n.
The first term comes from choosing an anchor from of the {u 1 , u 2 , u 3 , b}, and the other anchor from X; here, for long enough L, frequency of ab.uv would be expected to be zero. The second term comes from choosing both anchors from X, and the last term comes from choosing both anchors from {u 1 , u 2 , u 3 , b}; in these cases we expect the frequency of ab.uv to be n. Comparing the Equations 9 and 10, for |X| > 2, the distance between a and b would be smaller than the distance between a and c. This clearly is in contradiction to our tree, so DISTIQUE without the use of majority consensus, in this case, becomes misleading. However, note that all branches with length L are assumed to generate no discordance, and thus will be in the final tree.
Large L: Assume that L = 16 (in coalescent units) and we have 1000 gene trees. The probability of having only topologies that agree with the species tree in all 1000 gene trees is (1−2/3e −16 ) 1000 = 0.99993. Thus, we expect all n gene trees to have that topology with very high probability. The probability of having only the species tree topology for branches of length 2L and 3L is even larger.

Computing pseudo-counts for allpairs-max
In allpairs-max methods, in case of zero frequencies for some quartet topologies, without changing the definition of the pseudo-count, the relative information about quartets would be lost. For example, assume we have topology ab.cz with long internal branch length, like 16 as mentioned, and ab.dz with internal branch length of 20 (longer than the previous length), and no sample for none of the topologies that contradict the species tree. In this case, the distance between a, and c is equal to the distance of a, and d which is ln 0.5 n+1.5 , where n is the number of samples. So the relative information is lost. In order to avoid this problem, the definition of pseudo count in allpairs-max is slightly changed. In order to have a pseudo count that could capture the relative distances, first, the number of zero quartet topologies are counted. This is called n 0 ab . The pseudo count, in this case, is defined as: Where 0.5 comes from our add half estimator, is probability of zero frequencies, and k i is the number of samples for quartet topology of i.

Tree inference algorithms
Let D denote the input data (e.g., for a summary method, the set of input gene trees, G, was the input data). Regardless of the nature of D, we require having a quartet estimator: Quartet estimator A quartet estimator θ D (Q) is a function that given a quartet of leaves Q = {a, b, c, d}, uses D to estimate the quartet tree topology and the quartet length τ (Q). A quartet estimator is statistically consistent if, as the size of D increases, the estimated quartet topology and length both converge in probability to correct values.
Statistically consistent quartet estimators can be designed for various types of data. For example, the four point condition can estimate quartet trees from sequence data [6], and the log-det method gives a model-based way of estimating corrected branch lengths [7]. As we will show, DISTIQUE uses a consistent coalescent-based quartet estimator.
Given a consistent quartet estimator θ D (.), Theorems 1 and 2 define a family of statistically consistent phylogenetic reconstruction methods that range in running time between Θ(n 2 ) and Θ(n 4 ). Algorithm S1 shows general forms of these algorithms. All-pairs and all-pairs-max, which are the simplest methods in this family, use equations (6) and (7) to compute the distance matrix. They then compute the tree using neighbor joining [8][9][10] (but any consistent distance method with a safety radius [10] could be used).
Algorithm S1 Anchored quartet-based algorithms. θ D (Q) is a quartet estimator and returns the quartet topology and length τ (Q). α and β are constants, and f (x) is a monotonically increasing function bounded above by β for positive x (i.e., 0 < f (x) < β for x > 0). Anchors(.) uses some strategy to select a subset of all possible anchor pairs.
function Anchors(L) return a set of anchor pairs Theorem S1 All-pairs and all-pairs-max phylogenetic reconstruction methods shown in Algorithm S1 are statistically consistent.
Proof (sketch). Since θ D (.) is assumed statically consistent, in limit, it will return the correct quartet topology with arbitrarily high probability, and its estimates of quartet lengths can be made arbitrarily close to true values with any desired probability. From this and Theorems 1 and 2, it follows that distance matrices used in AllPairs and AllPairsMax are arbitrarily close to additive, with high probability. The proof follows from the statistical consistency of neighbor joining for distance matrices that are within a safety radius of additivity [10].
Algorithms Distance-sum and Tree-sum use a subset of all n 2 anchors, combining the results from multiple anchors using methods we described in Section 3.2.3 of the main paper. Both distance-sum and tree-sum methods first select m pairs of anchors with a criterion of choice (e.g., those mentioned for DISTIQUE in Section 3.2.2 of the main paper and detailed in Algorithm S4). For each anchor pair {u, v}, a double anchored distance matrix on L − {u, v} is computed. Treesum computes m trees, each on n − 2 leaves using neighbor joining [8], and then combines the m trees using a supertree method (e.g., SuperFine [11] or MRL [12]). Distance-sum combines distance matrices by averaging them, ignoring missing values. We have not been able to prove consistency for the distance-sum strategy, but tree-sum can be proved consistent with an appropriate choice of the supertree method. Recall a compatibility supertree refines all its inputs in the output when the input trees are compatible. We define a set of trees complete if every quartet of leaves appears in at least one tree.
Theorem S2 The tree-sum phylogenetic reconstruction method shown in Algorithm S1 is statistically consistent if used with a compatibility supertree and when the anchor selection generates a complete set of m trees.
Proof. By Theorem 1 and arguments similar to those used for Theorem S1, each tree T 1 . . . T m on n − 2 leaves is a statistically consistent estimate; thus, for any < 1, there is a dataset size such that each T i is correct with probability 1 − . Taking the largest of these dataset sizes, with probability at least (1 − ) m , every T i is correct. By setting < 1 − (1 − ) 1 m we argue that for any , there is a dataset size where all T i s are correct with probability at least 1 − . By Lemma S1 the use of a compatibility supertree method and the completeness of m binary and correct trees guarantee that the supertree is the correct tree.
Combining these algorithms with our approaches for dealing with long branches (i.e., the use of a consensus tree) results in a somewhat more complicated algorithm. Algorithm S2 shows the detailed steps for our default distance-sum approach.
A note on branch lengths: Anchored distances (when computed exactly) set the length of terminal branches to zero. A constant can be added to all distances for inference methods that expect non-zero terminal lengths. AllPairsMax returns statistically consistent estimates of internal branch lengths, but branch lengths from our other methods should be ignored.

Commands and version numbers
We used ASTRAL version 4.7.8 to find the species trees from gene trees: Here the flag a specifies which averaging method to use the partial quartet tables from complete quartet tables around each polytomy.
For tree-sum we used the following command: