Disk covering methods improve phylogenomic analyses
 Md Shamsuzzoha Bayzid^{1}Email author,
 Tyler Hunt^{1} and
 Tandy Warnow^{1, 2}Email author
https://doi.org/10.1186/1471216415S6S7
© Bayzid et al.; licensee BioMed Central Ltd. 2014
Published: 17 October 2014
Abstract
Motivation
With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multispecies coalescent. Several highly accurate coalescentbased species tree estimation methods have been developed over the last decade, including MPEST. However, the running time for MPEST increases rapidly as the number of species grows.
Results
We present divideandconquer techniques that improve the scalability of MPEST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MPEST, as our study shows on a collection of simulated and biological datasets.
Keywords
Background
A standard approach to species tree estimation uses multiple loci and then concatenates alignments for each locus into a supermatrix, which is then used to estimate the species tree. When genes all evolve down the same tree topology under the same wellbehaved process, then statistical methods of phylogeny estimation (such as maximum likelihood) applied to the concatenated alignment are statistically consistent, and so will return the true tree with high probability given a large enough number of sites or genes. However, when the genes evolve down different tree topologies, which can happen in the presence of gene duplication and loss, horizontal gene transfer, or incomplete lineage sorting, then there are no statistical guarantees for concatenated analyses. Furthermore, simulations have shown that concatenation can return incorrect trees with high confidence in the presence of incomplete lineage sorting [1], a populationlevel process modelled by the multispecies coalescent [2]. Because incomplete lineage sorting is expected to occur under many biologically realistic conditions (and especially in the presence of rapid radiations), coalescentbased species tree methods with statistical guarantees of returning the true tree with high probability (as the number of genes increases) have been developed, and are increasingly popular [3–9].
Only some of these coalescentbased methods are fast enough to be used with phylogenomic datasets that contain hundreds or thousands of genes and more than 30 or so species. For example, the fullyparametric coalescentbased methods, such as BEST [6] and *BEAST [3] that coestimate gene trees and species trees, are limited to approximately 20 species and 100 genes (and even datasets of this size can be extremely difficult) [10, 11]. The other type of coalescentbased method are called "summary methods" because they estimate species trees by combining estimated gene trees. These methods tend to be much faster than the fullyparametric methods, and some of these methods (e.g., MPEST [5]) are able to be used with hundreds to thousands of genes.
However, even the fast summary methods can be computationally intensive on large datasets. For example, MPEST, which has been used in many biological dataset analyses [12–15], uses a heuristic search to solve an NPhard pseudomaximum likelihood optimization problem (based on the triplet gene tree distribution). Our evaluation of MPEST (reported in this paper) shows that the number of species greatly impacts the running time; thus, improving MPEST's scalability (in terms of the number of species) is an important objective.
This paper introduces two general techniques for improving the scalability of coalescentbased species tree estimation methods so that they can analyze datasets with large numbers of species. Each technique uses an initial tree estimated on the set of species to divide the species dataset into small overlapping subsets, applies the species tree estimation method to each subset of species to produce an estimated species tree for that subset, and then combines the estimated species trees (each on a subsets of the species) into a tree on the full set of species. Furthermore, each technique can iterate, and thus return a set of candidate species trees from which the final tree is selected. The only difference between the two techniques is how the dataset is divided into subsets, with one technique using the dataset decomposition technique from DACTAL [16] and the other using a modification of the dataset decomposition technique from RecIDCM3 [17].
We evaluate these two techniques on a collection of simulated and biological datasets, and show that both reduce the running time of MPEST, one of the most popular coalescentbased summary methods. Surprisingly, these two techniques also improve the accuracy of MPEST. Thus, the two techniques improve the scalability of MPEST, a popular coalescentbased species tree estimation, so that it can be run on datasets with large numbers of species and provide improved topological accuracy.
Methods
DiskCovering Methods (DCMs) are metamethods (employing divideandconquer and in some cases also iteration) designed to "boost" the performance of the existing phylogenetic reconstruction methods [17–20]. The major steps of DCMs are: (i) decomposing the dataset into overlapping subsets of taxa, (ii) estimating trees on these subsets using a preferred phylogenetic method, and finally (iii) merging the subtrees to get a tree on the full set of taxa. However, DCMs have not yet been used in the context of species tree estimation from multiple gene trees, which is the focus of this study. Although the approach we present can be used with any coalescentbased method (including ones that coestimate gene trees and species trees, such as BEST and *BEAST), we study the technique specifically for use with MPEST.

Step 1: Compute a starting tree from the input set of gene trees; this is the initial guide tree (we show results using MPEST and Matrix Representation with Parsimony (MRP) [21]).

Step 2: Repeat for a userspecified number of iterations (we show 2 and 5).
 Step 2a: Decompose the set of species into small overlapping subsets of taxa, with target subset size specified by the user (we show 15), using the current guide tree.
Step 2b: For each subset, create a set of gene trees by restricting the input gene trees to the species present in the subset (each such gene tree is called a subset gene tree), and then apply MPEST to the subset gene trees to produce a newly estimated subset species tree.

Step 2c: Combine the subset species trees estimated in Step 2b using a supertree method (we use SuperFine+MRL [22]), thus producing a tree on the full set of species. This is the new guide tree, and is used in the next iteration. We also add this tree to the set of guide trees produced during the algorithm.


Step 3: Score each of the different guide trees produced during the algorithm with respect to the selected optimization criterion and return the tree with the best score.
We provide details for Step 2a and Step 3.
Step 2a: dataset decomposition techniques
We explored three different techniques for decomposing the set of species into subsets: DCM1 [23], DACTAL [16], and a decomposition we call the short subtree graph (SSG) [17]. The DCM1 decomposition improved MPEST but was less computationally efficient than the SSGdecomposition or the DACTALdecomposition. Therefore, we focus the remainder of our discussion on the other two techniques.
Definitions Let T be an edgeweighted guide tree on the set S of taxa. Let e be an internal edge in T, and t_{1}, t_{2}, t_{3}, t_{4} be the four subtrees around the edge e (i.e., removing e and its two endpoints from T breaks T into four subtrees: t_{1}, t_{2}, t_{3}, t_{4}). A short quartet around e contains four leaves, one from each of these four subtrees, where each leaf is selected to be the closest (according to the edge weights) in its subtree to e. Hence, the set of short quartets of a tree are obtained by taking all short quartets around all edges in the tree. We used a "padding" technique where we find a collection of closest leaves (e.g., 2 or 3, rather than just 1) from each of the four subtrees around e, and we call this a padded short quartet.
DACTALbased decomposition
DACTAL uses a paddedRecursiveDCM3 decomposition (PRD), as follows. The input is a guide tree T (without edge weights) and target subset size ms. The PRD decomposition finds a "centroid" edge (i.e., an edge that splits the guide tree into two subtrees containing roughly equal numbers of leaves). The removal of this edge and its endpoints divides the tree into four subtrees, A, B, C and D. For each of these four trees, the set of at most p/4 (where p is the padding size, and p < ms) closest leaves to the edge e are selected, and put into a set X; four leaves selected from different subtrees around the centroid edge, using this technique, are called "padded short quartets", generalizing the concept of short quartets where only the nearest leaf in each subtree is selected [18, 20]. However, if there are ties (i.e., leaves that are equally close to the branch e), then all leaves at the same (very close) distance are included in the set; thus, X > p is possible. Then, the set of leaves present in A ∪ X, B ∪ X, C ∪ X and D ∪ X define four overlapping subsets. If any of these sets is larger than ms, then the decomposition is repeated recursively on that set until all subsets have size at most ms and the padding size requirement is satisfied. However, if the application of the decomposition cannot reduce the subset size, then the subset is returned. Thus, both p and ms are treated as targets rather than hard constraints. For the simulated datasets studied in this paper, we set p = 4, which means that we only used short quartets (one leaf in each subtree around a centroid edge). However, for larger datasets, increasing p might lead to improved analyses.
SSGbased decomposition
The SSGbased decomposition technique we present in this paper is similar to the DCM3 decomposition presented in [17], but modified through the use of the "padding" (described above) so that there is more overlap between subsets.
Given input guide tree T and target maximum subset size (ms), the SSGbased decomposition creates a "padded" short subtree graph G = (V, E) as follows. First, we compute $p=\u230a\frac{ms}{4}\u230b$. We then compute the set of at most p closest leaves in each subtree around a given edge in the graph, and make a clique out of this set of (at most) 4p species. The graph containing all these cliques is the padded short subtree graph. Equivalently, the vertex set V contains the leaves in T (i.e., the species) and (s_{ i }, s_{ j }) ∈ E if and only if there is some edge e in the guide tree T such that s_{ i } and s_{ j } are each among the p nearest leaves to e in their respective subtrees. Because a padded short subtree graph is chordal, it contains at most n = V maximal cliques, and these can be found in polynomial time [24]. Note that typically the number of vertices in the maximal cliques will be at most ms, but some of them can be slightly bigger than ms. Thus, as with DACTAL, ms is a target maximal value, and not a strict upper bound on the size of any subset we analyze.
Step 3: Selecting the best tree across different iterations
We explored two different optimality criteria  the maximum pseudolikelihood score computed by MPEST, which is based on the rooted triplet tree distribution, and a "quartet support score" [25]. The quartet support measures the similarity between a candidate tree T and the input gene trees, and is computed as follows. We decompose each input gene tree into its induced set of quartet trees (i.e., unrooted trees formed by picking four leaves). The quartet support score of a given candidate species tree T is the total, over all the input gene trees, of the number of induced quartet trees that T agrees with. As shown in [5], the tree that optimizes the maximum pseudolikelihood score is a statistically consistent estimator of the true species tree under the multispecies coalescent model. Interestingly, the same is true of the quartet support score, as shown in [25].
Experiments
We explore the performance of MPEST [5] and these boosted versions of MPEST on a collection of simulated and biological datasets. We compare the estimated species trees to the model species tree (for the simulated datasets) or to the scientific literature (for the biological datasets), to evaluate accuracy. The tree error is measured using the missing branch rate (also called the false negative rate), which is the percentage of the internal edges in the model tree that are missing in the estimated tree. We measure the statistical significance of the results by Wilcoxon signedrank test with α = 0.05.
Mammalian simulated datasets
We used datasets generated in another study [25] to explore performance of coalescentbased methods for estimating species trees. These datasets have gene sequence alignments generated under a multistage simulation process, which begins with a species tree estimated on a mammalian dataset (studied in [12]) using MPEST, simulates gene trees down the species tree under the multispecies coalescent model (so that the gene trees can differ topologically from the species tree), and then simulates gene sequence alignments down the gene trees under the GTRGAMMA model. We direct the reader to [25] for full details.
The basic model species tree has branch lengths in coalescent units, and we produced other model species trees by rescaling the branch lengths. This rescaling varies the amount of ILS (shorter branches have more ILS), and also impacts the amount of gene tree estimation error and the average bootstrap support (BS) in the estimated gene trees. The model condition with reduced ILS was created by uniformly doubling (2X) the branch lengths, and two model conditions with higher ILS were generated by uniformly dividing the branch lengths by two (0.5X) and five (0.2X). The amount of ILS obtained without adjusting the branch lengths is referred to as "moderate ILS", and was estimated by MPEST on the biological data. Each model species tree was then used to generate gene trees under the multispecies coalescent model. The branch lengths in the gene trees were then modified to deviate from the strict molecular clock, and sequences were simulated down each gene tree under the GTRGAMMA model.
Maximum likelihood (ML) gene trees were estimated on each sequence alignment using RAxML [26] under the GTRGAMMA model, with 200 bootstrap replicates to produce bootstrap support on the branches. The average bootstrap support (BS) in the biological data was 71%, and the sequence lengths were set to produce estimated gene trees with average BS bracketing that value  500 bp alignments produced estimated gene trees with 63% average BS and 1000 bp alignments produced estimated gene trees with 79% average BS.
The number of genes ranged from 50 to 800 to explore both smaller and larger numbers of genes than the full biological dataset (which had roughly 400 genes). For each model condition (specified by the ILS level, the number of genes, and the sequence length), we created 20 replicate datasets.
Biological datasets
We analyzed two biological datasets  the mammalian dataset from [12] containing 37 species and 424 genes, and the amniota dataset from [13] containing 16 species and 248 genes  using MPEST and both versions of boosted MPEST. We set ms = 15 for the mammalian dataset, and ms = 10 for the amniota dataset.
Results
Running time on simulated datasets
In contrast, each iteration of boosted MPEST requires much less time: 12 minutes per iteration for SSGboosting and 7 minutes per iteration for DACTALboosting, each run sequentially.
Impact of boosting on topological accuracy for simulated datasets
We compared the accuracy and running time for various boosting techniques. We used MPEST to produce the starting tree, and then ran five different iterations of DACTALboosting and SSGboosting, using different subset sizes (from 15 to 22), and using different criteria (maximum pseudolikelihood as computed by MPEST or quartet support) to select the final tree.
Concatenation is expected to be less accurate than coalescentbased methods when there is substantial ILS, and this is what we observed in these experiments. Thus, with the exception of the 2X model condition (which had the least ILS), concatenation was less accurate than both MPEST and boosted MPEST. Interestingly, the improvement of concatenation over boosted MPEST on the 2X model condition was not statistically significant (p = 0.33 and p = 0.4 for SSG and DACTALbased boosting, respectively). Also, on the moderate level of ILS (1X), concatenation and MPEST had very close performance, but boosted MPEST was more accurate than concatenation. However, the differences between boosted MPEST and concatenation were not statistically significant (p = 0.08 and p = 0.11 for DACTAL and SSGbased boosting respectively).
The comparison between concatenation and (boosted) MPEST is also interesting. For the 50gene case, concatenation was more accurate than unboosted MPEST, but DACTALboosted MPEST matched the accuracy of concatenation, and SSGboosted MPEST was slightly more accurate than concatenation. For other cases (100800 genes), the differences between concatenation and MPEST were not statistically significant (p >0.05), but both SSGboosted and DACTALboosted versions of MPEST were more accurate than concatenation. Furthermore, the improvement of boosted MPEST over concatenation was statistically significant for 400 and 800gene cases (p = 0.02 and 0.008 for the 400 and 800gene cases, respectively, for both SSG and DACTALbased boosting).
Results on biological datasets
Amniota dataset. We analyzed data for 248 genes on 16 amniota species from Chiari et al. [13]. Previous studies had placed turtles as the sister to birds and crocodiles (Archosaurs) [27–29]. Chiari et al. [13] used concatenation and MPEST with multilocus bootstrapping on two sets of gene trees  one based on amino acid (AA) and the other based on nucleotide (NT) alignments. Concatenation and MPEST on the AA gene trees resolved the clade as (turtles,(birds, crocodiles)) (i.e., birds and crocodiles were considered sister taxa, consistent with the earlier studies) while MPEST on the NT data produced (birds,(turtles,crocodiles)), and so contradicted the previous studies. Because the concatenation tree and the MPEST(AA) tree agreed and were consistent with previous studies, the resolution with turtles as sister to birds and crocodiles was considered more likely to be correct.
Mammalian dataset. Song et al. [12] analyzed a dataset with 447 genes across 37 mammalian species using MPEST and concatenation. In our analysis of this data we detected 21 genes with mislabelled sequences (incorrect taxon names, confirmed by the authors) which we removed from the dataset. We also identified two additional gene trees that were clearly topologically very different from all other gene trees, and removed these as well. We ran MPEST on the 424 gene trees with SSG and DACTALbased boosting using the MPEST starting tree. All analyses we ran produced the same tree (see Fig. S9 in Additional file 1).
Pseudolikelihood scores and quartet support values
Our analyses of the simulated and biological datasets showed that MPEST always found trees with pseudolikelihood scores that were at least as good as those found by any boosted MPEST analysis, over all the iterations. In other words, the best pseudolikelihood score was always found in the MPEST starting tree. On the other hand, the best quartet support score was nearly always found in a subsequent iteration, for both types of boosting techniques. The first of these observations suggests that MPEST is doing a reasonably good job of solving its optimization problem, since boosting is not improving its search. The second of the observations is also very interesting, since the boosting techniques are not explicitly designed to optimize quartet support, and we have no explanation for this trend.
Robustness to the starting trees
In the experiments shown so far, the starting tree was produced using MPEST. We tested robustness to the starting tree by using MRP, a fast supertree technique, to compute a starting tree. However, unlike MPEST, MRP has not been shown to be statistically consistent in the presence of ILS, and so is not likely to be as accurate as MPEST
Statistical consistency
The following theorem is a direct corollary of Theorem 1 in [16].
Theorem 1: Let T be the true species tree, and let S_{ 1 }, S_{ 2 },..., S_{ k } be the subsets created by a DACTAL or SSGdecomposition with T as the starting tree. Let t_{ i } be the true species tree on S_{ i }, i = 1, 2,..., k. Then the Strict Consensus Merger (and by extension also SuperFine+MRL), applied to the set t_{ 1 }, t_{ 2 },..., t_{ k, } will return the species tree T.
Comment: SuperFine+MRL has two steps: first it computes the Strict Consensus Merger (SCM), and then it resolves high degree nodes in the SCM tree using MRL. Therefore, if SCM produces a fully resolved tree, SuperFine+MRL returns the SCM tree.
Therefore, the following corollary can be easily proven:
Corollary 1: If the starting tree is computed using a method that is statistically consistent under the multispecies coalescent model, then the pipeline based on either the DACTAL or SSG decomposition is statistically consistent under the multispecies coalescent model.
Discussion
The results shown in this study suggest that using iteration and divideandconquer (within the DACTALbased and SSGbased decomposition techniques) improved the topological accuracy of MPEST. Furthermore, the specific choice of dataset decomposition technique (DACTALbased or SSGbased) had little impact on accuracy. The improvement obtained by selecting trees based on their quartet support scores instead of their maximum pseudolikelihood scores is very interesting, and suggests the possibility that although both optimality criteria are statistically consistent ways of searching for species trees under the multispecies coalescent, the quartet support score might have better empirical performance than the pseudolikelihood score, at least under some conditions.
While most of the analyses were based on using MPEST to produce the starting tree, we also showed that using MRP (a supertree method) to produce the starting tree resulted in comparable accuracy after five iterations. Since MRP generally produced less accurate starting trees than MPEST, this suggests that the boosting techniques are robust to the starting tree. Furthermore, MRP was very fast on these datasets, completing in just ten seconds. Thus, when used with MRP as a starting tree, the entire pipeline (computing the starting tree, running five iterations of DACTAL boosting, and selecting the final tree) completes in 35 minutes. By comparison, MPEST run without boosting takes nearly 115 minutes (nearly two hours). Thus, boosting improves the speed of MPEST. If we use SuperFine+MRL or SuperFine+MRP to compute the starting tree, then DACTALboosted MPEST should be fast, even for large numbers of species, since computing the starting tree using SuperFine is typically very fast, even on large datasets [22]. Furthermore, although we do not explore datasets with more than 37 species, the running times in Figure 1 suggest that MPEST may be computationally infeasible for datasets with a few hundred species. By contrast, boosted versions of MPEST are likely to scale close to linearly with the number of species, and are embarrassingly parallel. Thus, largescale analyses of even several hundred species should be feasible using boosted MPEST.
While the improvement in speed was expected, the improvement in accuracy was unexpected, and merits discussion. One possibility is that the performance we observed is mainly the result of some specific property of the simulation conditions we explored in this study, and that a larger study might show a difference in relative performance between boosted and unboosted MPEST. However, both boosted versions of MPEST gave more accurate results on the biological amniota dataset, and so that is not likely to be the answer. As noted, MPEST is a heuristic for maximum pseudolikelihood, and so another possible explanation is that MPEST might have difficulty finding good solutions to its optimization problem on large datasets. However, the trees found by MPEST had ML scores that were at least as good (and most often better) than the trees produced in any iteration by the boosted versions of MPEST. Thus, this was clearly not the reason boosting improves MPEST.
Instead, the data suggests that the boosting technique leads to trees with better quartet support scores, and that using quartet support scores to select the best species tree might be helping these boosted versions of MPEST to produce more accurate trees. This hypothesis is supported by the fact that selecting the best tree based on the quartet support produced improved topological accuracy compared to selecting the best tree based on the pseudolikelihood score, and that the quartet support optimization criterion is statistically consistent under the multispecies coalescent model [25].
Conclusions
MPEST is one of the popular methods for estimating species trees from a collection of gene trees, and has statistical guarantees under the multispecies coalescent model. MPEST is fast on small datasets (with not too many species) but its running time grows quickly with the number of species. We presented two iterative divideandconquer techniques (DACTALboosting and SSGboosting) to use with MPEST, with the goal of enabling MPEST to analyze datasets with large numbers of species more efficiently. We tested these techniques on a collection of simulated and biological datasets, and showed that boosted versions of MPEST were fast and highly accurate using these divideandconquer methods. The improvement in accuracy obtained by using these boosting techniques is not explained by any failure in MPEST to optimize maximum likelihood effectively, but rather suggests the possibility that an alternative optimization criterion  quartet support  may be a highly effective approach to estimating species trees under the multispecies coalescent model.
Methods and commands
Gene tree estimation: RAxML version 7.3.5 [30] was used to estimate gene trees under the GTRGAMMA model, using the following command:
raxmlHPCSSE3 m GTRGAMMA s [input alignment] n [output name] N 20 p [random seed number]
The following command was used for bootstrapping:
raxmlHPCSSE3 m GTRGAMMA s [input alignment] n [output name] N 200 p [random seed number] b [random seed number]
Concatenation: For the concatenated analysis, we computed a parsimony starting tree using RAxML version 7.3.5, and then ran RAxMLlight version 1.0.6. We used the following commands:
raxmlHPCSSE3 y s supermatrix.phylip m GTRGAMMA n [output name] p [random seed number]
raxmlLightPTHREADS T 4 s supermatrix.phylip m GTRGAMMA n name t [parsimony tree]
MPEST: We used version 1.3 of MPEST.
MRP: We created MRP matrices using a custom Java program, and solved MRP heuristically using the default approach available in PAUP^{*} (v. 4. 0b10) [31]. PAUP^{*} generates an initial tree through random sequence addition and then performs Tree Bisection and Reconnection (TBR) moves until it reaches a local optimum. This process is repeated 1000 times, and at the end the most parsimonious tree is returned. When multiple trees are found with the same maximum parsimony score, the "extended majority consensus" of those trees is returned. See Additional file 1 for the PAUP^{*} commands used to run MRP.
Decompositions and SuperFine: We used our custom scripts written in various languages (Perl, Python, C++ and Java) for SSG and DACTALbased decomposition and SuperFine.
Declarations
Acknowledgements
The authors thank the reviewers and also R. Davidson, S. Mirarab, and T. Zimmerman for helpful suggestions that improved the manuscript.
Declarations
This work was partially supported by National Science Foundation grants DEB 0733029 and DBI 1062335 to TW.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
Authors’ Affiliations
References
 Kubatko LS, Degnan JH: Inconsistency of phylogenetic estimates from concatenated data under coalescence. Systematic Biology. 2007, 56 (1): 1724. 10.1080/10635150601146041.PubMedView ArticleGoogle Scholar
 Kingman JFC: The coalescent. Stochastic Processes and Their Applications. 1982, 13 (3): 235248. 10.1016/03044149(82)900114.View ArticleGoogle Scholar
 Heled J, Drummond AJ: Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution. 2010, 27 (3): 570580. 10.1093/molbev/msp274.PubMedPubMed CentralView ArticleGoogle Scholar
 Larget B, Kotha SK, Dewey CN, An´e C: BUCKy: Gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinformatics. 2010, 26 (22): 29102911. 10.1093/bioinformatics/btq539.PubMedView ArticleGoogle Scholar
 Liu L, Yu L, Edwards SV: A maximum pseudolikelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology. 2010, 10 (1): 30210.1186/1471214810302.PubMedPubMed CentralView ArticleGoogle Scholar
 Liu L: BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics. 2008, 24 (21): 25422543. 10.1093/bioinformatics/btn484.PubMedView ArticleGoogle Scholar
 Kubatko LS, Carstens BC, Knowles LL: STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinfomatics. 2009, 25 (7): 971973. 10.1093/bioinformatics/btp079.View ArticleGoogle Scholar
 Liu L, Yu L, Pearl DK, Edwards SV: Estimating species phylogenies using coalescence times among sequences. Systematic Biology. 2009, 58 (5): 468477. 10.1093/sysbio/syp031.PubMedView ArticleGoogle Scholar
 Mossel E, Roch S: Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2011, 7 (1): 166171.View ArticleGoogle Scholar
 Bayzid MS, Warnow T: Naive binning improves phylogenomic analyses. Bioinformatics. 2013, 29 (18): 22772284. 10.1093/bioinformatics/btt394.PubMedView ArticleGoogle Scholar
 Smith BT, Harvey MG, Faircloth BC, Glenn TC, Brumfield RT: Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales. Systematic Biology. 2014, 63 (1): 8395. 10.1093/sysbio/syt061.PubMedView ArticleGoogle Scholar
 Song S, Liu L, Edwards SV, Wu S: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proceedings of the National Academy of Sciences of the United States of America. 2012, 109 (37): 149427. 10.1073/pnas.1211733109.PubMedPubMed CentralView ArticleGoogle Scholar
 Chiari Y, Cahais V, Galtier N, Delsuc F: Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (archosauria). BMC Biology. 2012, 10 (1): 6510.1186/174170071065.PubMedPubMed CentralView ArticleGoogle Scholar
 Zhong B, Liu L, Yan Z, Penny D: Origin of land plants using the multispecies coalescent model. Trends in Plant Science. 2013, 18 (9): 492495. 10.1016/j.tplants.2013.04.009.PubMedView ArticleGoogle Scholar
 Kumar V, Hallstr¨om BM, Janke A: Coalescentbased genome analyses resolve the early branches of the euarchontoglires. PLoS ONE. 2013, 8 (4): 6001910.1371/journal.pone.0060019.View ArticleGoogle Scholar
 Nelesen S, Liu K, Wang LS, Linder CR, Warnow T: DACTAL: divideandconquer trees (almost) without alignments. Bioinformatics. 2012, 28 (12): 274282. 10.1093/bioinformatics/bts218.View ArticleGoogle Scholar
 Roshan U, Moret BME, Williams TL, Warnow T: RecIDCM3: A fast algorithmic technique for reconstructing large phylogenetic trees. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. 2004, 98109.Google Scholar
 Huson D, Nettles S, Warnow T: Diskcovering, a fast converging method for phylogenetic tree reconstruction. Journal of Computational Biology. 1999, 6 (3): 369386.PubMedView ArticleGoogle Scholar
 Huson D, Vawter L, Warnow T: Solving large scale phylogenetic problems using DCM2. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB'99). 1999, 118129.Google Scholar
 Nakhleh L, Roshan U, James KS, Sun J, Warnow T: Designing fast converging phylogenetic methods. Bioinformatics. 2001, 17 (suppl 1): 190198. 10.1093/bioinformatics/17.suppl_1.S190.View ArticleGoogle Scholar
 Ragan MA: Phylogenetic inference based on matrix representation of trees. Molecular Phylogenetics and Evolution. 1992, 1 (1): 5358. 10.1016/10557903(92)90035F.PubMedView ArticleGoogle Scholar
 Nguyen N, Mirarab S, Warnow T: MRL and SuperFine+MRL: new supertree methods. Algorithms for Molecular Biology. 2012, 7 (3):Google Scholar
 Warnow T, Moret BM, St John K: Absolute convergence: true trees from short sequences. Proceedings of the Twelfth Annual ACMSIAM Symposium on Discrete Algorithms (SODA). 2001, 186195.Google Scholar
 Golumbic MC: Algorithmic Graph Theory and Perfect Graphs. 2004, Elsevier B.V., Amsterdam, The NetherlandsGoogle Scholar
 Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson S, Warnow T: ASTRAL: Genomescale coalescentbased species tree estimation. Bioinformatics. 2014, 30 (17): i541i546. 10.1093/bioinformatics/btu462.PubMedPubMed CentralView ArticleGoogle Scholar
 Stamatakis A: RAxMLVIHPC: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 26882690. 10.1093/bioinformatics/btl446.PubMedView ArticleGoogle Scholar
 Hugall AF, Foster R, Lee MS: Calibration choice, rate smoothing, and the pattern of tetrapod diversification according to the long nuclear gene rag1. Systematic Biology. 2007, 56 (4): 543563. 10.1080/10635150701477825.PubMedView ArticleGoogle Scholar
 Iwabe N, Hara Y, Kumazawa Y, Shibamoto K, Saito Y, Miyata T, Katoh K: Sister group relationship of turtles to the birdcrocodilian clade revealed by nuclear DNAcoded proteins. Molecular Biology and Evolution. 2005, 22 (4): 810813. 10.1093/molbev/msi075.PubMedView ArticleGoogle Scholar
 Zardoya R, Meyer A: Complete mitochondrial genome suggests diapsid affinities of turtles. Proceedings of the National Academy of Sciences. 1998, 95 (24): 1422614231. 10.1073/pnas.95.24.14226.View ArticleGoogle Scholar
 Stamatakis A: RAxMLVIHPC: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 26882690. 10.1093/bioinformatics/btl446.PubMedView ArticleGoogle Scholar
 Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, Version 4. Sinauer Associates, Sunderland, MassachusettsGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.