Volume 16 Supplement 10
Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Genomics
ASTRID: Accurate Species TRees from Internode Distances
 Pranjal Vachaspati^{1} and
 Tandy Warnow^{1}Email author
DOI: 10.1186/1471216416S10S3
© Vachaspati and Warnow 2015
Published: 2 October 2015
Abstract
Background
Incomplete lineage sorting (ILS), modelled by the multispecies coalescent (MSC), is known to create discordance between gene trees and species trees, and lead to inaccurate species tree estimations unless appropriate methods are used to estimate the species tree. While many statistically consistent methods have been developed to estimate the species tree in the presence of ILS, only ASTRAL2 and NJst have been shown to have good accuracy on large datasets. Yet, NJst is generally slower and less accurate than ASTRAL2, and cannot run on some datasets.
Results
We have redesigned NJst to enable it to run on all datasets, and we have expanded its design space so that it can be used with different distancebased tree estimation methods. The resultant method, ASTRID, is statistically consistent under the MSC model, and has accuracy that is competitive with ASTRAL2. Furthermore, ASTRID is much faster than ASTRAL2, completing in minutes on some datasets for which ASTRAL2 used hours.
Conclusions
ASTRID is a new coalescentbased method for species tree estimation that is competitive with the best current method in terms of accuracy, while being much faster. ASTRID is available in open source form on github.
Keywords
incomplete lineage sorting phylogenomics species trees ASTRAL NJst MPEST FastME PhyD* neighbor joiningBackground
Species tree estimation in the presence of gene tree incongruence is a major challenge for many biological analyses. Gene tree incongruence can result from a variety of processes, notably incomplete lineage sorting (ILS) [1], which is modelled by the multispecies coalescent (MSC) [2]. Concatenated maximum likelihood analyses is generally the most common method for species tree estimation from multiple loci, but can be statistically inconsistent, and even positively misleading, in some cases [3], thus converging to an incorrect tree with increasing amounts of sequence data.
In recent years, a number of species tree estimation methods have been developed that are statistically consistent under the MSC, and so will converge in probability to the true species trees as the amount of data increases; see [4–6]. Methods that are statistically consistent under the MSC include ASTRAL [7], ASTRAL2 [8], *BEAST [9], BEST [10], the population tree from BUCKy [11], METAL [12], MPEST [13], NJst [14], SNAPP [15], STEAC [16], STEM [17], and SVDquartets [18]. While little is yet known about some of these methods (either because they have not yet been adequately studied or because they are not yet implemented), only a few of them (MPEST, NJst, and ASTRAL2) have been shown to be able to analyze very large datasets (especially those with large numbers of taxa) with high accuracy. MPEST has been used more than either NJst or ASTRAL2, but NJst is more accurate than MPEST, and ASTRAL2 is more accurate than both [8]. Furthermore, the currently available implementation of NJst is slower than ASTRAL2, and cannot run on some datasets [8, 14].
In this paper, we present ASTRID, a new ILSaware distancebased method for species tree estimation. Our approach is based on NJst, but is substantially faster, and, unlike NJst, functions even when each gene tree contains only a small portion of the data. The input to NJst is a set of unrooted gene trees. In the first step, an n × n matrix D[x, y] is computed, where D[x, y] is the average distance (in terms of number of edges) between x and y among all the gene trees. In the second step, neighbor joining [19], a very popular distancebased method of phylogeny estimation, is used to produce the species tree.
ASTRID improves on NJst by enabling other distancebased methods to be used in the second step. In particular, although NJ cannot be run on datasets with missing entries, other distancebased methods can, and ASTRID enables the use of these other methods. We also explore the use of more accurate distancebased methods. Thus, ASTRID is a very simple modification to NJst. As we will show, ASTRID is much faster than NJst.
The comparison between ASTRID and ASTRAL2 and MPEST, two established coalescentbased summary methods, is also interesting. ASTRID completed in minutes on some datasets where the other methods took hours, and was fast enough to analyze datasets with 1000 species and 1000 genes on a single processor within an hour (ASTRAL2 and MPEST take much more time on datasets of this size). Furthermore, ASTRID clearly dominates MPEST in terms of accuracy, and is competitive with ASTRAL2 (more accurate in some cases, and less accurate in others). Finally, ASTRID has desirable theoretical properties: it runs in polynomial time, and it remains statistically consistent under the MSC model without assuming the molecular clock, nor requiring rooted gene trees as input.
Methods
ASTRID
The input to ASTRID is a set of unrooted gene trees T_{1}, ..., T_{ k } . We let $S=L\left({T}_{i}\right)$ denote the leafset of T_{ i }, and $S={\cup}_{i}L\left({T}_{i}\right)$. Let S = n.
Step 1: Construct n × n matrix $\overline{M}$:
1 For all i = 1, 2, ..., k, compute n × n matrix M_{ i }, as follows. For pairs p, q of species where both are in S_{ i }, set M_{ i }(p, q) to be the number of edges in the path between p and q in T_{ i }. For all other pairs p, q (i.e., where one or both are not in S_{ i }), set M_{ i }(p, q) = 0. Thus, the only nonzero entries in M_{ i } are for pairs of species in T_{ i }.
2 For all {p, q} ⊂ S, let n(p, q) be the number of trees T_{ i } that contain both p and q.
3 Define n × n matrix $\overline{M}$ by setting $\overline{M}\left(p,q\right)=\frac{{\sum}_{i}{M}_{i}\left(p,q\right)}{n\left(p,q\right)}$ if n(p, q) >0, and $\overline{M}\left[p,q\right]=1$ (to denote a missing value) otherwise.
Step 2: Compute tree on $\overline{M}$ using a selected distancebased method
Datasets
Empirical statistics of simulated datasets used in this study
Dataset  # genes  # taxa  ILS level (AD%)  # sites  ref. 

Avian very high ILS (0.5X)  1000  48  60 (VH)  500  [22] 
Avian high ILS (1X)  1000  48  47 (H)  2501500  [22] 
Avian moderate (2X)  1000  48  29 (M)  500  [22] 
Mammalian high ILS (0.5X)  200  37  50 (H)  2501000  [22] 
Mammalian moderate ILS (1X)  200  37  29 (M)  2501000  [22] 
Mammalian low ILS (2X)  200  37  21 (L)  2501000  [22] 
10taxon very high ILS  200  10  89(VH)  100  [23] 
10taxon high ILS  200  10  48 (H)  100  [23] 
15taxon clocklike  1000  15  82 (VH)  1001000  [23] 
ASTRAL2 500K1e6 (MC1)  1000  200  69 (VH)  3001500  [8] 
ASTRAL2 2M1e6 (MC2)  1000  200  33 (M)  3001500  [8] 
ASTRAL2 10M1e6 (MC3)  1000  200  21 (L)  3001500  [8] 
ASTRAL2 500K1e7 (MC4)  1000  200  68 (VH)  3001500  [8] 
ASTRAL2 2M1e7 (MC5)  1000  200  34 (M)  3001500  [8] 
ASTRAL2 10M1e7 (MC6)  1000  200  9 (L)  3001500  [8] 
ASTRAL2 2M1e6 (MC7)  1000  10  17 (L)  3001500  [8] 
ASTRAL2 2M1e6 (MC8)  1000  50  30 (M)  3001500  [8] 
ASTRAL2 2M1e6 (MC9)  1000  100  34 (M)  3001500  [8] 
ASTRAL2 2M1e6 (MC10)  1000  500  34 (M)  3001500  [8] 
ASTRAL2 2M1e6 (MC11)  1000  1000  35 (M)  3001500  [8] 
All datasets included both true and estimated gene trees, obtained by using maximum likelihood methods on the true sequence alignments, as well as species trees estimated on these gene trees obtained in the prior publications. Each gene tree had at most one copy of each species. We computed ASTRID species trees for these datasets, using various techniques for Step 2 (how to compute the species tree given the distance matrix).
We estimated the amount of ILS in the data by quantifying the average gene tree discord in the data, using the average RobinsonFoulds (RF) [21] distance between true gene trees and the model species tree, expressed as a percentage (written AD for "average distance"). We also explored some simulated datasets where the DNA sequence evolution was under the strict molecular clock. Model conditions with AD at most 25% can be considered low ILS, conditions with AD between 26% and 39% can be considered moderate ILS, conditions with AD between 40% and 59% can be considered high ILS, and conditions with AD of at least 60% can be considered very high ILS. In Table 1, we indicate these ILS levels for the different model conditions we study both with the AD value, but also the general level (L for low, M for moderate, H for high, and VH for very high).
Mammalian and avian simulated datasets
These datasets were created in [22] to evaluate method performance under model conditions similar to real data. Species trees were generated with MPEST for the avian phylogenomics dataset with 48 species and 14,446 loci [24], and for a mammalian dataset with 37 species and 447 loci [25]. These species trees were used as basic model trees, with branch lengths in coalescent units. In addition, two other model species trees were created for each dataset by scaling the species tree branch lengths up (to reduce ILS) or down (to increase ILS). The ILS levels of the resultant model species trees were very heterogeneous, ranging from AD = 21% (low) to 50% (high) for the mammalian simulation, and from AD = 29% (moderate) to 60% (very high) for the avian simulation.
Both datasets had sequences of length 500 for all three model conditions. For the default ("1X") branch length condition, the avian dataset also had sequences of length 250, 500, 100, and 1500, and the mammalian dataset had sequences of length 250, 500 and 1000. Sequence evolution on these datasets deviated from the strict molecular clock.
10taxon simulated datasets
These data were presented in [23], and explored two ILS levels (AD = 48% (high) and AD = 89% (very high)). Sequence evolution deviated from the strict molecular clock.
15taxon clocklike simulated datasets
These datasets evolved under a strict molecular clock, and were presented in [23]. The species tree was a caterpillar model tree (i.e., a path with leaves hanging off the path) with very short internal branches, and a long branch to the outgroup species. The ILS level in these data was very high (AD = 82%).
ASTRAL2 simulated datasets
These data were presented in [8], and provided a variety of model conditions with varying ILS levels, tree shapes, numbers of taxa, and sequence lengths per locus. SimPhy [26] was used to generate the species and gene trees, based on two parameters: the number of generations (given as the first number in the model) and the speciation rate (given as the second number). The number of generations simulated ranged between 500 K, 2 M, and 10 M, and the speciation rate varied between 1e6 and 1e7. Model conditions with fewer generations had more ILS. Model conditions with the 1e6 speciation rate had speciation events nearer the tips (leaves) of the trees, while model conditions with the 1e7 speciation rate had speciation events nearer the root. The ILS levels varied from very low (AD = 9%) to very high (AD = 69%). Sequences evolved down the gene trees under multiple GTRGAMMA models that deviated from the strict molecular clock. Maximum likelihood gene trees were computed using FastTree2.
Incomplete gene tree datasets
To explore performance on incomplete gene trees, we modified the ASTRAL2 dataset by randomly removing taxa from trees in the 50taxon datasets. Up to 40 taxa were removed from the 50taxon dataset, and up to 5 taxa were removed from the 10taxon dataset. In each of these cases, maximum likelihood gene trees were estimated using FastTree2 version 2.1.7 SSE3 [27], using the following command:
fasttree nt gtr quiet nopr gamma n 1000 <fastafile> > <genetreefile> where <fastafile> was the input file of aligned sequences and <genetreefile> was the output file.
Distancebased tree estimation methods
In order to explore the design space for ASTRID, we ran various distancebased methods for Step 2 (computing the tree from the distance matrix). For incomplete distance matrices (where some entries are −1, indicating that the pair of taxa do not appear together in any gene tree), we explored the methods in PhyD^{*}[28]: NJ*, BIONJ*, MVR*, UNJ*. These algorithms are all variants on neighbor joining that work on incomplete distance matrices. We also explored FASTME [29], which is a heuristic for the minimum evolution problem.
ASTRAL2
To compute ASTRAL2 species trees on the incomplete gene trees generated for the ASTRAL2 datasets, we ran ASTRAL2 version 4.7.8, using command line arguments
java Xmx4000M jar astral.4.7.8.jar i <genetrees> o <outputtree>
Computing tree error
All trees computed in this study were fully resolved. We report the RF tree error (the proportion of the branches in the model tree missing from the estimated tree), using scripts that are available in the supplementary online materials.
Results
Selection of distancebased tree estimation method for Step 2
Comparison of ASTRID, ASTRAL, and MPEST
All methods improved with increasing numbers of genes or increasing sequence length; however, the methods differed substantially in terms of their accuracy. Across all conditions we explored, MPEST had the highest error and ASTRID had the lowest error. ASTRAL2 was in between, but was closer to ASTRID than to MPEST. The gap between MPEST and ASTRID was very large, and increased with the number of genes. For example, at 1000 genes and gene sequence alignments of length 500, MPEST had 19% RF error while ASTRID had about 7% RF error. The gap between ASTRID and ASTRAL2 was substantial on the 200and 500gene cases, but very small on the 1000gene case.
Thus, although MPEST is statistically consistent under the MSC model and hence theoretically robust to ILS, it did not have particularly good accuracy on these data. Among all coalescentbased methods, MPEST is probably the one that has been used the most in biological data analyses, but its performance here and in [8, 30] demonstrates that it is not competitive with the best methods on datasets with even moderate numbers of species. Therefore, we omit MPEST from the rest of this study.
Comparison of ASTRID and ASTRAL2 on complete gene trees
Performance on incomplete gene trees
Analysis of the mammalian biological dataset
We analyzed the mammalian biological dataset originally studied in [35]. The original dataset had 37 species and 447 genes, but there were 23 erroneous genes (as noted by [20]) which we removed before doing the analysis.
Running time results
Asymptotic running time
ASTRID has two steps: the first step computes the distance matrix, and the second step uses a selected distancebased method to construct a tree from the distance matrix. When the input has n species and k genes, then calculating the distance matrix can be performed in O(kn^{2}) time. Distancebased tree estimation methods typically run in O(n^{2}) to O(n^{3}) time, but this step no longer depends on k. Hence, the overall running time depends on the selected distancebased method, but is generally dominated by the first phase, especially for typical inputs, for which k >>n. Thus, under the assumption that k > n and that ASTRID uses a distancebased method that runs in O(n^{3}) time, ASTRID's running time is O(kn^{2}).
ASTRAL2's scaling is more complicated to discuss. Asymptotically, ASTRAL2 runs in O(nkX^{2}) time, where n is the number of species, k is the number of genes, and X is a set of bipartitions it computes to constrain the search space. The size of X is not bounded by a polynomial in the input size, and the technique that ASTRAL2 uses means that X can be large under conditions with high ILS. Thus the asymptotic running times of ASTRAL2 and ASTRID (used with various distance methods) are quite different.
Running times on simulated data
In practice, creating the distance matrix took the majority of the running time. On 1000 taxa, creating the distance matrix took several minutes to several hours, depending on the number of genes, but running FASTME took less than one second regardless of the number of genes. However, PhyD* methods were much slower than FASTME; on 1000 taxa, running any of the PhyD* methods took approximately 40 minutes (data not shown). ASTRID depends on FastME, PhyD*, and Dendropy [36].
Running times on biological data
We recorded running times for ASTRIDFastME and ASTRAL2 on the mammalian biological dataset. Both methods took 6 seconds for a single bootstrap replicate on one core of a 2.7 GHz Intel Xeon processor with 424 genes and 37 taxa.
Discussion
A few trends are apparent upon examining the data as a whole. ASTRAL2 and ASTRID had, for the most part, very similar levels of accuracy, while MPEST was consistently less accurate. However, there were cases where ASTRID and ASTRAL2 have small but detectably different levels of accuracy. One intriguing trend in the data is the improvement of ASTRAL2 over ASTRID on high ILS datasets; see Figures 6, 7, 8, and 9. In particular, Figures 6 and 7 suggest that increases in ILS should favor ASTRAL2 over ASTRID. Yet, ASTRID is consistently at least as accurate as ASTRAL2 on the avian datasets, which have moderate to very high levels of ILS (Fig. 4). Thus, ILS level might have an impact on the relative accuracy of the two methods, but it is not a determining favor. Similarly, neither method dominates the other based on the number of taxa, number of genes, or amount of gene tree estimation error. Thus, it is very difficult to characterize the conditions under which each method is likely to have an advantage over the other. However, even for the cases where there are differences in accuracy, in general the differences are fairly small. Thus, the main difference between the two methods is computational efficiency, where ASTRID is clearly faster. ASTRID has the biggest running time advantage over ASTRAL2 for large numbers of gene trees, since ASTRID scales linearly in the number of genes while ASTRAL scales superlinearly. This makes ASTRID an especially good method for genomescale datasets that have a large number of genes.
Conclusion
ASTRID is a fast and highly accurate method for species tree estimation that is robust to high levels of ILS, and provably statistically consistent under the multispecies coalescent model. Like ASTRAL2, ASTRID can analyze datasets with unrooted gene trees, an advantage that the two methods have over many other methods (e.g., MPEST) that can only be run on rooted gene trees. ASTRID (like NJst) runs in time that is polynomial in the number of gene trees and species, but ASTRAL2 and other leading coalescentbased methods do not have this guarantee. Thus, ASTRID has many desirable theoretical properties compared to existing methods.
From an empirical viewpoint, ASTRID is also extremely fast and can analyze very large datasets in minutes, where other methods either cannot run or take hours. In particular, ASTRID is much faster than ASTRAL2, especially on datasets with many genes and large numbers of species. ASTRID also produces more accurate trees than MPEST and NJst, and is competitive with ASTRAL2 in terms of accuracy.
However, even better (more accurate) results might be obtained through more extensive modifications to the ASTRID algorithm design. In particular, the accuracy of the tree depends on the particular distancebased method that is used. New distancebased phylogeny estimation methods, such as the absolute fast converging methods [37–40], might provide improved accuracy for very large datasets. Another important direction is developing additional methods for estimating species trees from distance matrices that have good accuracy when the distance matrix has missing data. As we saw here, FastME produced more accurate trees than the PhyD^{*} methods, but it could only be applied to distance matrices without any missing data. An extension of FastME to enable it to handle incomplete distance matrices would also be of great interest.
This study can be expanded in several directions. Future work should more carefully investigate the conditions under which ASTRID is more reliable than ASTRAL2, and explore performance on more biological datasets. This study also only investigated relatively long sequences; a subsequent study should investigate the relative and absolute accuracy of ASTRID and other methods on very short sequences, since recombinationfree loci can be very short [32]. In addition, this study only examined datasets with a single individual per species, yet ASTRID (like NJst) can be run on datasets with multiple individuals; future work should evaluate the absolute and relative accuracy of ASTRID and other methods on such data. This study showed that ASTRID performed well in terms of species tree topology estimation, but we did not explore its accuracy with respect to the estimation of coalescent branch lengths; future work will need to explore how well ASTRID estimates these numeric parameters. Finally, it may well be that ASTRID will be most useful as a starting tree for use within more computationally intensive analyses, including Bayesian MCMC analyses (e.g., *BEAST) or maximum likelihood analyses.
Availability of supporting data
All datasets used in this study are available from prior publications. ASTRID is available in open source form on github at http://pranjalv123.github.io/ASTRID. Supporting materials are available online at http://pranj.al/ASTRID.
Abbreviations
 AD:

Average distance
 ILS:

Incomplete lineage sorting
 MSC:

Multispecies coalescent
 RF:

RobinsonFoulds
Declarations
Acknowledgements
PV was supported by the Roy J. Carver graduate fellowship from the University of Illinois at UrbanaChampaign, College of Engineering. TW was supported by National Science Foundation grant DBI1461364. PV and TW were both supported by the College of Engineering at the University of Illinois at UrbanaChampaign through a gift from the Grainger Foundation.
Publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Genomics Volume 16 Supplement 10, 2015: Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S10.
Authors’ Affiliations
References
 Maddison W: Gene trees in species trees. Syst Biol. 1997, 46 (3): 523536.View ArticleGoogle Scholar
 Kingman JFC: On the genealogy of large populations. J Appl Prob. 1982, 19: 27doi:10.2307/3213548View ArticleGoogle Scholar
 Roch S, Steel M: Likelihoodbased tree reconstruction on a concatenation of alignments can be statistically inconsistent. Theoretical Population Biology. 2015, 100: 5662.View ArticleGoogle Scholar
 Degnan JH, Rosenberg NA: Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol. 2009, 24 (6): 332340. doi:10.1016/j.tree.2009.01.009View ArticleGoogle Scholar
 Liu L, Yu L, Kubatko L, Pearl DK, Edwards SV: Coalescent methods for estimating phylogenetic trees. Mol Phylogenet Evol. 2009, 53 (1): 320328.View ArticleGoogle Scholar
 Knowles LL, Kubatko L: Estimating Species Trees: Practical and Theoretical Aspects. 2011, WileyBlackwell, Hoboken, NJGoogle Scholar
 Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T: ASTRAL: genomescale coalescentbased species tree estimation. Bioinformatics. 2014, 30: 541548. doi:10.1093/bioinformatics/btu462View ArticleGoogle Scholar
 Mirarab S, Warnow T: ASTRALII: coalescentbased species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015, 31 (12): 4452. doi:10.1093/bioinformatics/btv234., [http://bioinformatics.oxfordjournals.org/content/31/12/i44.full.pdf+html]View ArticleGoogle Scholar
 Heled J, Drummond AJ: Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010, 27: 570580. doi:10.1093/molbev/msp274View ArticleGoogle Scholar
 Liu L, Pearl DK: Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007, 56: 504514. doi:10.1080/10635150701429982View ArticleGoogle Scholar
 Larget BR, Kotha SK, Dewey CN, Ané C: BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010, 26: 29102911. doi:10.1093/bioinformatics/btq539.0912.4472View ArticleGoogle Scholar
 Dasarathy G, Nowak R, Roch S: Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Trans Comp Biol Bioinformatics. 2015, 12: 422432. DOI: 10.1109/TCBB.2014.2361685View ArticleGoogle Scholar
 Liu L, Yu L, Edwards SV: A maximum pseudolikelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010, 10: 302doi:10.1186/1471214810302View ArticleGoogle Scholar
 Liu L, Yu L: Estimating species trees from unrooted gene trees. Syst Biol. 2011, 60 (5): 661667. doi:10.1093/sysbio/syr027View ArticleGoogle Scholar
 Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A: Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 2012, 29 (8): 19171932.View ArticleGoogle Scholar
 Liu L, Yu L, Pearl DK, Edwards SV: Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009, 58 (5): 468477. doi:10.1093/sysbio/syp031View ArticleGoogle Scholar
 Kubatko L, Carstens BC, Knowles LL: STEM: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009, 25: 971973.View ArticleGoogle Scholar
 Chifman J, Kubatko L: Quartet inference from SNP data under the coalescent model. Bioinformatics. 2014, doi:10.1093/bioinformatics/btu530, [http://bioinformatics.oxfordjournals.org/content/early/2014/08/27/bioinformatics.btu530.full.pdf+html]Google Scholar
 Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406425. doi:citeulikearticleid:93683Google Scholar
 Mirarab S, Bayzid MS, Boussau B, Warnow T: Statistical binning enables an accurate coalescentbased estimation of the avian tree. Science. 2014, 346 (6215): 1250463View ArticleGoogle Scholar
 Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131147.View ArticleGoogle Scholar
 Mirarab S, Bayzid MS, Bossau B, Warnow T: Statistical binning improves species tree estimation in the presence of gene tree heterogeneity. Science. 2014, 346 (6215): 1250463View ArticleGoogle Scholar
 Bayzid MS, Mirarab S, Warnow T: Weighted statistical binning: enabling statistically consistent genomescale phylogenetic analyses. PLOS One. 2014Google Scholar
 Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT: Wholegenome analyses resolve early branches in the tree of life of modern birds. Science. 2014, 346 (6215): 13201331.View ArticleGoogle Scholar
 Song S, Liu L, Edwards SV, Wu S: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci USA. 2012, 109 (37): 1494214947.View ArticleGoogle Scholar
 Mallo D, Oliviera Martins L, Posada D: SimPhy: Comprehensive Simulation of Gene, Locus and Species Trees at the Genomewide Level. [https://github.com/adamallo/SimPhy]
 Price MN, Dehal PS, Arkin AP: FastTree 2Approximately maximumlikelihood trees for large alignments. PLoS ONE. 2010, 5 (3): doi:10.1371/journal.pone.0009490Google Scholar
 Criscuolo A, Gascuel O: Fast NJlike algorithms to deal with incomplete distance matrices. BMC Bioinformatics. 2008, 9: 166doi:10.1186/147121059166View ArticleGoogle Scholar
 Desper R, Gascuel O: Fast and accurate phylogeny minimumevolution principle. J Comput Biol. 2002, 9: 687705. doi:10.1089/106652702761034136View ArticleGoogle Scholar
 Bayzid MS, Hunt T, Warnow T: Disk covering methods improve phylogenomic analyses. BMC Genomics. 2014, 15 (Suppl 6): 7View ArticleGoogle Scholar
 Mirarab S, Bayzid MS, Warnow T: Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst Biol. 2014, 63Google Scholar
 Gatesy JP, Springer MS: Phylogenetic analysis at deep timescales: Unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol Phylog Evol. 2014, 80: 231266.View ArticleGoogle Scholar
 Bayzid MS, Warnow T: Naive binning improves phylogenomic analyses. Bioinformatics. 2013, 29 (18): 227784. doi:10.1093/bioinformatics/btt394View ArticleGoogle Scholar
 Roch S, Warnow T: On the robustness to gene tree estimation error (or lack thereof) of coalescentbased species tree methods. Syst Biol. 2015, 64 (4): 663676.View ArticleGoogle Scholar
 Song S, Liu L, Edwards SV, Wu S: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci USA. 2012, 109 (37): 1494214947.View ArticleGoogle Scholar
 Sukumaran J, Holder MT: DendroPy: A Python library for phylogenetic computing. Bioinformatics. 2010, 26 (12): 15691571.View ArticleGoogle Scholar
 Roch S: Towards extracting all phylogenetic information from matrices of evolutionary distances. Science. 2010, 327 (5971): 13761379.View ArticleGoogle Scholar
 Warnow T, Moret BME, St John K: Absolute phylogeny: true trees from short sequences. Proc 12th Ann ACM/SIAM Symp Discrete Algs (SODA01). 2001, SIAM Press, Philadelphia, PA, 186195.Google Scholar
 Gronau I, Moran S, Snir S: Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges. J Random Struct Algs. 2012, 40 (3): 350384. doi = 10.1002/rsa.20372View ArticleGoogle Scholar
 Nakhleh L, Roshan U, St John K, Sun J, Warnow T: Designing fast converging phylogenetic methods. Bioinformatics. 2001, 17 (suppl 1): 190198. doi:10.1093/bioinformatics/17.suppl 1.S190View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.