Selection of distance-based tree estimation method for Step 2
First, we evaluated various distance-based tree estimation methods to determine which one would be most accurate for the tree computation phase of ASTRID. Results on datasets with all complete gene trees (no missing species in any gene) are shown in Figure 1 and results on datasets with incomplete gene trees are shown in Figure 2. Note that for datasets with entirely complete gene trees, FastME performed as well as or better than the other distance-based methods, but there were datasets with incomplete distance matrices in which FastME had very poor accuracy. Therefore, we selected FastME to analyze datasets where the distance matrix has no missing entries, since it had the best accuracy. For the datasets with incomplete distance matrices (indicated by for some p, q), we selected BioNJ*, since it generally had among the most accurate results of these PhyD* methods.
Comparison of ASTRID, ASTRAL, and MP-EST
We begin with a comparison between ASTRID, ASTRAL-2, and MP-EST on the avian simulated datasets with high (1X) ILS, varying number of genes and sequence alignment lengths, but where all genes are complete; see Figure 3.
All methods improved with increasing numbers of genes or increasing sequence length; however, the methods differed substantially in terms of their accuracy. Across all conditions we explored, MP-EST had the highest error and ASTRID had the lowest error. ASTRAL-2 was in between, but was closer to ASTRID than to MP-EST. The gap between MP-EST and ASTRID was very large, and increased with the number of genes. For example, at 1000 genes and gene sequence alignments of length 500, MP-EST had 19% RF error while ASTRID had about 7% RF error. The gap between ASTRID and ASTRAL-2 was substantial on the 200-and 500-gene cases, but very small on the 1000-gene case.
Thus, although MP-EST is statistically consistent under the MSC model and hence theoretically robust to ILS, it did not have particularly good accuracy on these data. Among all coalescent-based methods, MP-EST is probably the one that has been used the most in biological data analyses, but its performance here and in [8, 30] demonstrates that it is not competitive with the best methods on datasets with even moderate numbers of species. Therefore, we omit MP-EST from the rest of this study.
Comparison of ASTRID and ASTRAL-2 on complete gene trees
Comparison on avian datasets. Figure 4 shows the performance of ASTRAL-2 and ASTRID on avian simulated datasets under three ILS conditions (moderate, high, and very high). Both methods performed better when provided with more genes, and both performed worse on higher levels of ILS. Overall, ASTRID tended to outperform ASTRAL-2, with the largest effect seen when many genes were available. With 800 genes available, the ASTRID species tree had a RF error rate that was 2.4 percentage points better than ASTRAL-2's under the very high and high ILS model conditions, and 1.2 percentage points better for the moderate ILS model condition. On the moderate ILS model condition, ASTRID had the greatest advantage over ASTRAL-2 for moderate numbers of genes. Above 200 genes, the error rate dropped below ten percent for both ASTRAL-2 and ASTRID, and ASTRID had an average advantage of only about one percentage point.
It is well known that summary methods improve in accuracy as the number of sites per gene or the number of genes increase [31–34]. We explored the impact of varying the sequence length and number of genes on the avian datasets with high (1X) ILS, as well as on true gene trees. Figure 5 shows results on 10, 100, and 1000 genes; results on other numbers of genes have the same trends (data provided in supplementary materials). As expected, both methods improved with increased sequence length, and had their best accuracy on true gene trees. Both methods also improved as the number of genes increased. ASTRID was always at least as accurate as ASTRAL-2, with the biggest improvement for shortest sequences (with 250 bp).
Comparison on mammalian datasets. A comparison of ASTRAL-2 and ASTRID on the mammalian datasets with different levels of ILS (high, moderate, and low) is given in Figure 6. ASTRAL-2 and ASTRID performed fairly similarly on the low (2X branch lengths) and moderate (1X branch lengths) ILS conditions. Under the high ILS level (0.5X branch lengths), ASTRAL-2 was fairly consistently more accurate than ASTRID, with the largest improvement on the 10-gene case.
Comparison on the ASTRAL-2 datasets. We explored performance on the ASTRAL-2 datasets with 200 taxa (model conditions MC1 to MC6, see Figure 7). These model trees varied in ILS level, with MC1 and MC4 having very high ILS, MC2 and MC5 having moderate ILS, and MC3 and MC6 having low ILS. Under MC2, MC3, and MC5, the two methods had essentially identical accuracy. However, under MC1, MC4, and MC6, ASTRAL-2 had an advantage over ASTRID. In MC1 and MC4, the improvement disappeared at 100 genes, but in MC6 ASTRAL-2 was still more accurate than ASTRID on 100 genes.
Comparison on the 15-taxon datasets. The 15-taxon datasets evolved on a caterpillar species tree under very high ILS (AD = 82%), the highest ILS considered in this study. We explored performance under two sequence lengths (100 bp and 1000 bp) and varied the number of genes from 10 to 1000. Results on the 15-taxon datasets (Figure 8) showed very close performance between ASTRID and ASTRAL-2. On the 100 bp alignments and on 1000 bp alignments with at least 100 genes, the two methods could not be distinguished. However, on 1000 bp alignments with at most 50 genes, ASTRAL-2 had an advantage over ASTRID.
Comparison on the 10-taxon datasets. The 10-taxon datasets evolved under two different ILS levels-high and very high, and we explored performance on both true and estimated gene trees; see Figure 9. In general, ASTRID and ASTRAL-2 had very close accuracy on these data, but there were some cases where they had different accuracy levels. For example, on the high ILS condition with estimated gene trees, ASTRAL-2 was more accurate than ASTRID for 200 genes, and ASTRID was more accurate than ASTRAL-2 on 25 genes.
Performance on incomplete gene trees
We explored the impact of missing data on ASTRAL-2 and ASTRID by deleting taxa from gene trees in the 50-taxon datasets (MC8) from the ASTRAL-2 collection, using 150 bp per gene, and varying the number of genes and the amount of missing taxa; see Figure 10. ASTRAL-2 and ASTRID had very similar topological accuracy throughout these experiments. With low amounts of missing data (20% to 40% missing taxa from each gene tree), both methods had very good accuracy (below 5% tree error) by 500 genes. With 60% of the taxa missing from each gene tree, the error rates increased for low numbers of genes (above 20% RF error for up to 100 genes), but then declined to about 10% by 1000 genes. With 80% of the taxa missing from each gene (so that all gene trees have only 10 taxa out of 50), error rates were very high with 25 genes (at least 85% RF), but decreased quickly with increases in the number of genes, so that at 500 genes the error rate was 24%, and then at most 18% at 1000 genes. The trends suggest that the error rates had not plateaued, and that adding additional incomplete gene trees should result in continued improvement.
Analysis of the mammalian biological dataset
We analyzed the mammalian biological dataset originally studied in [35]. The original dataset had 37 species and 447 genes, but there were 23 erroneous genes (as noted by [20]) which we removed before doing the analysis.
We obtained maximum likelihood gene trees and bootstrap replicates of these gene trees from [22]. We then analyzed these data using ASTRAL-2 and ASTRID+FastME and compared these analyses to previously published trees obtained using ASTRAL and MP-EST [7]. We then annotated the branches of the ASTRID+FastME and ASTRAL-2 trees with bootstrap support from 100 multi-locus bootstrapping (MLBS). The ASTRID+FastME and ASTRAL-2 trees were topologically identical to the ASTRAL tree and differed only in the bootstrap support; see Figure 11 for the ASTRID+FastME tree. On the other hand, the support for the placement of Scandentia-one of the major open questions about mammalian evolution-was very low, only 47% (ASTRAL-2 gave it 82%). Hence, neither the ASTRID tree nor the ASTRAL-2 tree resolved the placement of Scandentia with high support.
Running time results
Asymptotic running time
ASTRID has two steps: the first step computes the distance matrix, and the second step uses a selected distance-based method to construct a tree from the distance matrix. When the input has n species and k genes, then calculating the distance matrix can be performed in O(kn2) time. Distance-based tree estimation methods typically run in O(n2) to O(n3) time, but this step no longer depends on k. Hence, the overall running time depends on the selected distance-based method, but is generally dominated by the first phase, especially for typical inputs, for which k >>n. Thus, under the assumption that k > n and that ASTRID uses a distance-based method that runs in O(n3) time, ASTRID's running time is O(kn2).
ASTRAL-2's scaling is more complicated to discuss. Asymptotically, ASTRAL-2 runs in O(nk|X|2) time, where n is the number of species, k is the number of genes, and X is a set of bipartitions it computes to constrain the search space. The size of X is not bounded by a polynomial in the input size, and the technique that ASTRAL-2 uses means that X can be large under conditions with high ILS. Thus the asymptotic running times of ASTRAL-2 and ASTRID (used with various distance methods) are quite different.
Running times on simulated data
In practice, creating the distance matrix took the majority of the running time. On 1000 taxa, creating the distance matrix took several minutes to several hours, depending on the number of genes, but running FASTME took less than one second regardless of the number of genes. However, PhyD* methods were much slower than FASTME; on 1000 taxa, running any of the PhyD* methods took approximately 40 minutes (data not shown). ASTRID depends on FastME, PhyD*, and Dendropy [36].
We recorded running times for ASTRAL-2, ASTRID-FastME, and NJst, on avian simulated datasets with high ILS (1X), as we varied the number of genes (see Figure 12). Note that ASTRID-FastME was by far the fastest of the three methods, and NJst was the slowest. However, the trends suggest that NJst will be faster than ASTRAL-2 for larger numbers of genes. Note also that ASTRID-FastME and NJst both scaled linearly with the number of genes, but that ASTRAL-2's running time scaled superlinearly.
We recorded running times for two variants of ASTRID (one using FastME and the other using BioNJ*), and compared them to ASTRAL-2 on ASTRAL-2 simulated datasets with 1000 taxa (MC11) as we varied the number of genes (Figure 13) and for 500-gene datasets in which we varied the number of taxa (MC 2 and 7-10, see Figure 14). The relative running times show that all methods were very fast for smaller datasets, but were clearly distinguished on the larger datasets, where ASTRID-FastME was much faster than ASTRID-BioNJ* and both variants of ASTRID were much faster than ASTRAL-2. For example, on the dataset with 1000 genes and 1000 taxa, ASTRID-FastME finished in 33 minutes, ASTRID-BioNJ finished in 1 hour and 10 minutes, and ASTRAL-2 finished in 12 hours and 30 minutes.
Running times on biological data
We recorded running times for ASTRID-FastME and ASTRAL-2 on the mammalian biological dataset. Both methods took 6 seconds for a single bootstrap replicate on one core of a 2.7 GHz Intel Xeon processor with 424 genes and 37 taxa.