Volume 15 Supplement 6
Disk covering methods improve phylogenomic analyses
© Bayzid et al.; licensee BioMed Central Ltd. 2014
Published: 17 October 2014
With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent. Several highly accurate coalescent-based species tree estimation methods have been developed over the last decade, including MP-EST. However, the running time for MP-EST increases rapidly as the number of species grows.
We present divide-and-conquer techniques that improve the scalability of MP-EST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets.
Keywordsmulti-species coalescent process incomplete lineage sorting MP-EST disk covering methods divide-and-conquer
A standard approach to species tree estimation uses multiple loci and then concatenates alignments for each locus into a super-matrix, which is then used to estimate the species tree. When genes all evolve down the same tree topology under the same well-behaved process, then statistical methods of phylogeny estimation (such as maximum likelihood) applied to the concatenated alignment are statistically consistent, and so will return the true tree with high probability given a large enough number of sites or genes. However, when the genes evolve down different tree topologies, which can happen in the presence of gene duplication and loss, horizontal gene transfer, or incomplete lineage sorting, then there are no statistical guarantees for concatenated analyses. Furthermore, simulations have shown that concatenation can return incorrect trees with high confidence in the presence of incomplete lineage sorting , a population-level process modelled by the multi-species coalescent . Because incomplete lineage sorting is expected to occur under many biologically realistic conditions (and especially in the presence of rapid radiations), coalescent-based species tree methods with statistical guarantees of returning the true tree with high probability (as the number of genes increases) have been developed, and are increasingly popular [3–9].
Only some of these coalescent-based methods are fast enough to be used with phylogenomic datasets that contain hundreds or thousands of genes and more than 30 or so species. For example, the fully-parametric coalescent-based methods, such as BEST  and *BEAST  that co-estimate gene trees and species trees, are limited to approximately 20 species and 100 genes (and even datasets of this size can be extremely difficult) [10, 11]. The other type of coalescent-based method are called "summary methods" because they estimate species trees by combining estimated gene trees. These methods tend to be much faster than the fully-parametric methods, and some of these methods (e.g., MP-EST ) are able to be used with hundreds to thousands of genes.
However, even the fast summary methods can be computationally intensive on large datasets. For example, MP-EST, which has been used in many biological dataset analyses [12–15], uses a heuristic search to solve an NP-hard pseudo-maximum likelihood optimization problem (based on the triplet gene tree distribution). Our evaluation of MP-EST (reported in this paper) shows that the number of species greatly impacts the running time; thus, improving MP-EST's scalability (in terms of the number of species) is an important objective.
This paper introduces two general techniques for improving the scalability of coalescent-based species tree estimation methods so that they can analyze datasets with large numbers of species. Each technique uses an initial tree estimated on the set of species to divide the species dataset into small overlapping subsets, applies the species tree estimation method to each subset of species to produce an estimated species tree for that subset, and then combines the estimated species trees (each on a subsets of the species) into a tree on the full set of species. Furthermore, each technique can iterate, and thus return a set of candidate species trees from which the final tree is selected. The only difference between the two techniques is how the dataset is divided into subsets, with one technique using the dataset decomposition technique from DACTAL  and the other using a modification of the dataset decomposition technique from Rec-I-DCM3 .
We evaluate these two techniques on a collection of simulated and biological datasets, and show that both reduce the running time of MP-EST, one of the most popular coalescent-based summary methods. Surprisingly, these two techniques also improve the accuracy of MP-EST. Thus, the two techniques improve the scalability of MP-EST, a popular coalescent-based species tree estimation, so that it can be run on datasets with large numbers of species and provide improved topological accuracy.
Disk-Covering Methods (DCMs) are meta-methods (employing divide-and-conquer and in some cases also iteration) designed to "boost" the performance of the existing phylogenetic reconstruction methods [17–20]. The major steps of DCMs are: (i) decomposing the dataset into overlapping subsets of taxa, (ii) estimating trees on these subsets using a preferred phylogenetic method, and finally (iii) merging the subtrees to get a tree on the full set of taxa. However, DCMs have not yet been used in the context of species tree estimation from multiple gene trees, which is the focus of this study. Although the approach we present can be used with any coalescent-based method (including ones that co-estimate gene trees and species trees, such as BEST and *BEAST), we study the technique specifically for use with MP-EST.
Step 1: Compute a starting tree from the input set of gene trees; this is the initial guide tree (we show results using MP-EST and Matrix Representation with Parsimony (MRP) ).
Step 2: Repeat for a user-specified number of iterations (we show 2 and 5).- Step 2a: Decompose the set of species into small overlapping subsets of taxa, with target subset size specified by the user (we show 15), using the current guide tree.
Step 2b: For each subset, create a set of gene trees by restricting the input gene trees to the species present in the subset (each such gene tree is called a subset gene tree), and then apply MP-EST to the subset gene trees to produce a newly estimated subset species tree.
Step 2c: Combine the subset species trees estimated in Step 2b using a supertree method (we use SuperFine+MRL ), thus producing a tree on the full set of species. This is the new guide tree, and is used in the next iteration. We also add this tree to the set of guide trees produced during the algorithm.
Step 3: Score each of the different guide trees produced during the algorithm with respect to the selected optimization criterion and return the tree with the best score.
We provide details for Step 2a and Step 3.
Step 2a: dataset decomposition techniques
We explored three different techniques for decomposing the set of species into subsets: DCM1 , DACTAL , and a decomposition we call the short subtree graph (SSG) . The DCM1 decomposition improved MP-EST but was less computationally efficient than the SSG-decomposition or the DACTAL-decomposition. Therefore, we focus the remainder of our discussion on the other two techniques.
Definitions Let T be an edge-weighted guide tree on the set S of taxa. Let e be an internal edge in T, and t1, t2, t3, t4 be the four subtrees around the edge e (i.e., removing e and its two endpoints from T breaks T into four subtrees: t1, t2, t3, t4). A short quartet around e contains four leaves, one from each of these four subtrees, where each leaf is selected to be the closest (according to the edge weights) in its subtree to e. Hence, the set of short quartets of a tree are obtained by taking all short quartets around all edges in the tree. We used a "padding" technique where we find a collection of closest leaves (e.g., 2 or 3, rather than just 1) from each of the four subtrees around e, and we call this a padded short quartet.
DACTAL uses a padded-Recursive-DCM3 decomposition (PRD), as follows. The input is a guide tree T (without edge weights) and target subset size ms. The PRD decomposition finds a "centroid" edge (i.e., an edge that splits the guide tree into two subtrees containing roughly equal numbers of leaves). The removal of this edge and its endpoints divides the tree into four subtrees, A, B, C and D. For each of these four trees, the set of at most p/4 (where p is the padding size, and p < ms) closest leaves to the edge e are selected, and put into a set X; four leaves selected from different subtrees around the centroid edge, using this technique, are called "padded short quartets", generalizing the concept of short quartets where only the nearest leaf in each subtree is selected [18, 20]. However, if there are ties (i.e., leaves that are equally close to the branch e), then all leaves at the same (very close) distance are included in the set; thus, |X| > p is possible. Then, the set of leaves present in A ∪ X, B ∪ X, C ∪ X and D ∪ X define four overlapping subsets. If any of these sets is larger than ms, then the decomposition is repeated recursively on that set until all subsets have size at most ms and the padding size requirement is satisfied. However, if the application of the decomposition cannot reduce the subset size, then the subset is returned. Thus, both p and ms are treated as targets rather than hard constraints. For the simulated datasets studied in this paper, we set p = 4, which means that we only used short quartets (one leaf in each subtree around a centroid edge). However, for larger datasets, increasing p might lead to improved analyses.
The SSG-based decomposition technique we present in this paper is similar to the DCM3 decomposition presented in , but modified through the use of the "padding" (described above) so that there is more overlap between subsets.
Given input guide tree T and target maximum subset size (ms), the SSG-based decomposition creates a "padded" short subtree graph G = (V, E) as follows. First, we compute . We then compute the set of at most p closest leaves in each subtree around a given edge in the graph, and make a clique out of this set of (at most) 4p species. The graph containing all these cliques is the padded short subtree graph. Equivalently, the vertex set V contains the leaves in T (i.e., the species) and (s i , s j ) ∈ E if and only if there is some edge e in the guide tree T such that s i and s j are each among the p nearest leaves to e in their respective subtrees. Because a padded short subtree graph is chordal, it contains at most n = |V| maximal cliques, and these can be found in polynomial time . Note that typically the number of vertices in the maximal cliques will be at most ms, but some of them can be slightly bigger than ms. Thus, as with DACTAL, ms is a target maximal value, and not a strict upper bound on the size of any subset we analyze.
Step 3: Selecting the best tree across different iterations
We explored two different optimality criteria - the maximum pseudo-likelihood score computed by MP-EST, which is based on the rooted triplet tree distribution, and a "quartet support score" . The quartet support measures the similarity between a candidate tree T and the input gene trees, and is computed as follows. We decompose each input gene tree into its induced set of quartet trees (i.e., unrooted trees formed by picking four leaves). The quartet support score of a given candidate species tree T is the total, over all the input gene trees, of the number of induced quartet trees that T agrees with. As shown in , the tree that optimizes the maximum pseudo-likelihood score is a statistically consistent estimator of the true species tree under the multi-species coalescent model. Interestingly, the same is true of the quartet support score, as shown in .
We explore the performance of MP-EST  and these boosted versions of MP-EST on a collection of simulated and biological datasets. We compare the estimated species trees to the model species tree (for the simulated datasets) or to the scientific literature (for the biological datasets), to evaluate accuracy. The tree error is measured using the missing branch rate (also called the false negative rate), which is the percentage of the internal edges in the model tree that are missing in the estimated tree. We measure the statistical significance of the results by Wilcoxon signed-rank test with α = 0.05.
Mammalian simulated datasets
We used datasets generated in another study  to explore performance of coalescent-based methods for estimating species trees. These datasets have gene sequence alignments generated under a multi-stage simulation process, which begins with a species tree estimated on a mammalian dataset (studied in ) using MP-EST, simulates gene trees down the species tree under the multi-species coalescent model (so that the gene trees can differ topologically from the species tree), and then simulates gene sequence alignments down the gene trees under the GTRGAMMA model. We direct the reader to  for full details.
The basic model species tree has branch lengths in coalescent units, and we produced other model species trees by rescaling the branch lengths. This rescaling varies the amount of ILS (shorter branches have more ILS), and also impacts the amount of gene tree estimation error and the average bootstrap support (BS) in the estimated gene trees. The model condition with reduced ILS was created by uniformly doubling (2X) the branch lengths, and two model conditions with higher ILS were generated by uniformly dividing the branch lengths by two (0.5X) and five (0.2X). The amount of ILS obtained without adjusting the branch lengths is referred to as "moderate ILS", and was estimated by MP-EST on the biological data. Each model species tree was then used to generate gene trees under the multi-species coalescent model. The branch lengths in the gene trees were then modified to deviate from the strict molecular clock, and sequences were simulated down each gene tree under the GTRGAMMA model.
Maximum likelihood (ML) gene trees were estimated on each sequence alignment using RAxML  under the GTRGAMMA model, with 200 bootstrap replicates to produce bootstrap support on the branches. The average bootstrap support (BS) in the biological data was 71%, and the sequence lengths were set to produce estimated gene trees with average BS bracketing that value - 500 bp alignments produced estimated gene trees with 63% average BS and 1000 bp alignments produced estimated gene trees with 79% average BS.
The number of genes ranged from 50 to 800 to explore both smaller and larger numbers of genes than the full biological dataset (which had roughly 400 genes). For each model condition (specified by the ILS level, the number of genes, and the sequence length), we created 20 replicate datasets.
We analyzed two biological datasets - the mammalian dataset from  containing 37 species and 424 genes, and the amniota dataset from  containing 16 species and 248 genes - using MP-EST and both versions of boosted MP-EST. We set ms = 15 for the mammalian dataset, and ms = 10 for the amniota dataset.
Running time on simulated datasets
In contrast, each iteration of boosted MP-EST requires much less time: 12 minutes per iteration for SSG-boosting and 7 minutes per iteration for DACTAL-boosting, each run sequentially.
Impact of boosting on topological accuracy for simulated datasets
We compared the accuracy and running time for various boosting techniques. We used MP-EST to produce the starting tree, and then ran five different iterations of DACTAL-boosting and SSG-boosting, using different subset sizes (from 15 to 22), and using different criteria (maximum pseudo-likelihood as computed by MP-EST or quartet support) to select the final tree.
Concatenation is expected to be less accurate than coalescent-based methods when there is substantial ILS, and this is what we observed in these experiments. Thus, with the exception of the 2X model condition (which had the least ILS), concatenation was less accurate than both MP-EST and boosted MP-EST. Interestingly, the improvement of concatenation over boosted MP-EST on the 2X model condition was not statistically significant (p = 0.33 and p = 0.4 for SSG- and DACTAL-based boosting, respectively). Also, on the moderate level of ILS (1X), concatenation and MP-EST had very close performance, but boosted MP-EST was more accurate than concatenation. However, the differences between boosted MP-EST and concatenation were not statistically significant (p = 0.08 and p = 0.11 for DACTAL and SSG-based boosting respectively).
The comparison between concatenation and (boosted) MP-EST is also interesting. For the 50-gene case, concatenation was more accurate than unboosted MP-EST, but DACTAL-boosted MP-EST matched the accuracy of concatenation, and SSG-boosted MP-EST was slightly more accurate than concatenation. For other cases (100-800 genes), the differences between concatenation and MP-EST were not statistically significant (p >0.05), but both SSG-boosted and DACTAL-boosted versions of MP-EST were more accurate than concatenation. Furthermore, the improvement of boosted MP-EST over concatenation was statistically significant for 400- and 800-gene cases (p = 0.02 and 0.008 for the 400- and 800-gene cases, respectively, for both SSG and DACTAL-based boosting).
Results on biological datasets
Amniota dataset. We analyzed data for 248 genes on 16 amniota species from Chiari et al. . Previous studies had placed turtles as the sister to birds and crocodiles (Archosaurs) [27–29]. Chiari et al.  used concatenation and MP-EST with multi-locus bootstrapping on two sets of gene trees - one based on amino acid (AA) and the other based on nucleotide (NT) alignments. Concatenation and MP-EST on the AA gene trees resolved the clade as (turtles,(birds, crocodiles)) (i.e., birds and crocodiles were considered sister taxa, consistent with the earlier studies) while MP-EST on the NT data produced (birds,(turtles,crocodiles)), and so contradicted the previous studies. Because the concatenation tree and the MP-EST(AA) tree agreed and were consistent with previous studies, the resolution with turtles as sister to birds and crocodiles was considered more likely to be correct.
Mammalian dataset. Song et al.  analyzed a dataset with 447 genes across 37 mammalian species using MP-EST and concatenation. In our analysis of this data we detected 21 genes with mislabelled sequences (incorrect taxon names, confirmed by the authors) which we removed from the dataset. We also identified two additional gene trees that were clearly topologically very different from all other gene trees, and removed these as well. We ran MP-EST on the 424 gene trees with SSG and DACTAL-based boosting using the MP-EST starting tree. All analyses we ran produced the same tree (see Fig. S9 in Additional file 1).
Pseudo-likelihood scores and quartet support values
Our analyses of the simulated and biological datasets showed that MP-EST always found trees with pseudo-likelihood scores that were at least as good as those found by any boosted MP-EST analysis, over all the iterations. In other words, the best pseudo-likelihood score was always found in the MP-EST starting tree. On the other hand, the best quartet support score was nearly always found in a subsequent iteration, for both types of boosting techniques. The first of these observations suggests that MP-EST is doing a reasonably good job of solving its optimization problem, since boosting is not improving its search. The second of the observations is also very interesting, since the boosting techniques are not explicitly designed to optimize quartet support, and we have no explanation for this trend.
Robustness to the starting trees
In the experiments shown so far, the starting tree was produced using MP-EST. We tested robustness to the starting tree by using MRP, a fast supertree technique, to compute a starting tree. However, unlike MP-EST, MRP has not been shown to be statistically consistent in the presence of ILS, and so is not likely to be as accurate as MP-EST
The following theorem is a direct corollary of Theorem 1 in .
Theorem 1: Let T be the true species tree, and let S 1 , S 2 ,..., S k be the subsets created by a DACTAL- or SSG-decomposition with T as the starting tree. Let t i be the true species tree on S i , i = 1, 2,..., k. Then the Strict Consensus Merger (and by extension also SuperFine+MRL), applied to the set t 1 , t 2 ,..., t k, will return the species tree T.
Comment: SuperFine+MRL has two steps: first it computes the Strict Consensus Merger (SCM), and then it resolves high degree nodes in the SCM tree using MRL. Therefore, if SCM produces a fully resolved tree, SuperFine+MRL returns the SCM tree.
Therefore, the following corollary can be easily proven:
Corollary 1: If the starting tree is computed using a method that is statistically consistent under the multi-species coalescent model, then the pipeline based on either the DACTAL or SSG decomposition is statistically consistent under the multi-species coalescent model.
The results shown in this study suggest that using iteration and divide-and-conquer (within the DACTAL-based and SSG-based decomposition techniques) improved the topological accuracy of MP-EST. Furthermore, the specific choice of dataset decomposition technique (DACTAL-based or SSG-based) had little impact on accuracy. The improvement obtained by selecting trees based on their quartet support scores instead of their maximum pseudo-likelihood scores is very interesting, and suggests the possibility that although both optimality criteria are statistically consistent ways of searching for species trees under the multi-species coalescent, the quartet support score might have better empirical performance than the pseudo-likelihood score, at least under some conditions.
While most of the analyses were based on using MP-EST to produce the starting tree, we also showed that using MRP (a supertree method) to produce the starting tree resulted in comparable accuracy after five iterations. Since MRP generally produced less accurate starting trees than MP-EST, this suggests that the boosting techniques are robust to the starting tree. Furthermore, MRP was very fast on these datasets, completing in just ten seconds. Thus, when used with MRP as a starting tree, the entire pipeline (computing the starting tree, running five iterations of DACTAL boosting, and selecting the final tree) completes in 35 minutes. By comparison, MP-EST run without boosting takes nearly 115 minutes (nearly two hours). Thus, boosting improves the speed of MP-EST. If we use SuperFine+MRL or SuperFine+MRP to compute the starting tree, then DACTAL-boosted MP-EST should be fast, even for large numbers of species, since computing the starting tree using SuperFine is typically very fast, even on large datasets . Furthermore, although we do not explore datasets with more than 37 species, the running times in Figure 1 suggest that MP-EST may be computationally infeasible for datasets with a few hundred species. By contrast, boosted versions of MP-EST are likely to scale close to linearly with the number of species, and are embarrassingly parallel. Thus, large-scale analyses of even several hundred species should be feasible using boosted MP-EST.
While the improvement in speed was expected, the improvement in accuracy was unexpected, and merits discussion. One possibility is that the performance we observed is mainly the result of some specific property of the simulation conditions we explored in this study, and that a larger study might show a difference in relative performance between boosted and unboosted MP-EST. However, both boosted versions of MP-EST gave more accurate results on the biological amniota dataset, and so that is not likely to be the answer. As noted, MP-EST is a heuristic for maximum pseudo-likelihood, and so another possible explanation is that MP-EST might have difficulty finding good solutions to its optimization problem on large datasets. However, the trees found by MP-EST had ML scores that were at least as good (and most often better) than the trees produced in any iteration by the boosted versions of MP-EST. Thus, this was clearly not the reason boosting improves MP-EST.
Instead, the data suggests that the boosting technique leads to trees with better quartet support scores, and that using quartet support scores to select the best species tree might be helping these boosted versions of MP-EST to produce more accurate trees. This hypothesis is supported by the fact that selecting the best tree based on the quartet support produced improved topological accuracy compared to selecting the best tree based on the pseudo-likelihood score, and that the quartet support optimization criterion is statistically consistent under the multi-species coalescent model .
MP-EST is one of the popular methods for estimating species trees from a collection of gene trees, and has statistical guarantees under the multi-species coalescent model. MP-EST is fast on small datasets (with not too many species) but its running time grows quickly with the number of species. We presented two iterative divide-and-conquer techniques (DACTAL-boosting and SSG-boosting) to use with MP-EST, with the goal of enabling MP-EST to analyze datasets with large numbers of species more efficiently. We tested these techniques on a collection of simulated and biological datasets, and showed that boosted versions of MP-EST were fast and highly accurate using these divide-and-conquer methods. The improvement in accuracy obtained by using these boosting techniques is not explained by any failure in MP-EST to optimize maximum likelihood effectively, but rather suggests the possibility that an alternative optimization criterion - quartet support - may be a highly effective approach to estimating species trees under the multi-species coalescent model.
Methods and commands
Gene tree estimation: RAxML version 7.3.5  was used to estimate gene trees under the GTRGAMMA model, using the following command:
raxmlHPC-SSE3 -m GTRGAMMA -s [input alignment] -n [output name] -N 20 -p [random seed number]
The following command was used for bootstrapping:
raxmlHPC-SSE3 -m GTRGAMMA -s [input alignment] -n [output name] -N 200 -p [random seed number] -b [random seed number]
Concatenation: For the concatenated analysis, we computed a parsimony starting tree using RAxML version 7.3.5, and then ran RAxML-light version 1.0.6. We used the following commands:
raxmlHPC-SSE3 -y -s supermatrix.phylip -m GTRGAMMA -n [output name] -p [random seed number]
raxmlLight-PTHREADS -T 4 -s supermatrix.phylip -m GTRGAMMA -n name -t [parsimony tree]
MP-EST: We used version 1.3 of MP-EST.
MRP: We created MRP matrices using a custom Java program, and solved MRP heuristically using the default approach available in PAUP* (v. 4. 0b10) . PAUP* generates an initial tree through random sequence addition and then performs Tree Bisection and Reconnection (TBR) moves until it reaches a local optimum. This process is repeated 1000 times, and at the end the most parsimonious tree is returned. When multiple trees are found with the same maximum parsimony score, the "extended majority consensus" of those trees is returned. See Additional file 1 for the PAUP* commands used to run MRP.
Decompositions and SuperFine: We used our custom scripts written in various languages (Perl, Python, C++ and Java) for SSG and DACTAL-based decomposition and SuperFine.
The authors thank the reviewers and also R. Davidson, S. Mirarab, and T. Zimmerman for helpful suggestions that improved the manuscript.
This work was partially supported by National Science Foundation grants DEB 0733029 and DBI 1062335 to TW.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
- Kubatko LS, Degnan JH: Inconsistency of phylogenetic estimates from concatenated data under coalescence. Systematic Biology. 2007, 56 (1): 17-24. 10.1080/10635150601146041.PubMedView ArticleGoogle Scholar
- Kingman JFC: The coalescent. Stochastic Processes and Their Applications. 1982, 13 (3): 235-248. 10.1016/0304-4149(82)90011-4.View ArticleGoogle Scholar
- Heled J, Drummond AJ: Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution. 2010, 27 (3): 570-580. 10.1093/molbev/msp274.PubMedPubMed CentralView ArticleGoogle Scholar
- Larget B, Kotha SK, Dewey CN, An´e C: BUCKy: Gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinformatics. 2010, 26 (22): 2910-2911. 10.1093/bioinformatics/btq539.PubMedView ArticleGoogle Scholar
- Liu L, Yu L, Edwards SV: A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology. 2010, 10 (1): 302-10.1186/1471-2148-10-302.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu L: BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics. 2008, 24 (21): 2542-2543. 10.1093/bioinformatics/btn484.PubMedView ArticleGoogle Scholar
- Kubatko LS, Carstens BC, Knowles LL: STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinfomatics. 2009, 25 (7): 971-973. 10.1093/bioinformatics/btp079.View ArticleGoogle Scholar
- Liu L, Yu L, Pearl DK, Edwards SV: Estimating species phylogenies using coalescence times among sequences. Systematic Biology. 2009, 58 (5): 468-477. 10.1093/sysbio/syp031.PubMedView ArticleGoogle Scholar
- Mossel E, Roch S: Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2011, 7 (1): 166-171.View ArticleGoogle Scholar
- Bayzid MS, Warnow T: Naive binning improves phylogenomic analyses. Bioinformatics. 2013, 29 (18): 2277-2284. 10.1093/bioinformatics/btt394.PubMedView ArticleGoogle Scholar
- Smith BT, Harvey MG, Faircloth BC, Glenn TC, Brumfield RT: Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales. Systematic Biology. 2014, 63 (1): 83-95. 10.1093/sysbio/syt061.PubMedView ArticleGoogle Scholar
- Song S, Liu L, Edwards SV, Wu S: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proceedings of the National Academy of Sciences of the United States of America. 2012, 109 (37): 14942-7. 10.1073/pnas.1211733109.PubMedPubMed CentralView ArticleGoogle Scholar
- Chiari Y, Cahais V, Galtier N, Delsuc F: Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (archosauria). BMC Biology. 2012, 10 (1): 65-10.1186/1741-7007-10-65.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhong B, Liu L, Yan Z, Penny D: Origin of land plants using the multispecies coalescent model. Trends in Plant Science. 2013, 18 (9): 492-495. 10.1016/j.tplants.2013.04.009.PubMedView ArticleGoogle Scholar
- Kumar V, Hallstr¨om BM, Janke A: Coalescent-based genome analyses resolve the early branches of the euarchontoglires. PLoS ONE. 2013, 8 (4): 60019-10.1371/journal.pone.0060019.View ArticleGoogle Scholar
- Nelesen S, Liu K, Wang LS, Linder CR, Warnow T: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics. 2012, 28 (12): 274-282. 10.1093/bioinformatics/bts218.View ArticleGoogle Scholar
- Roshan U, Moret BME, Williams TL, Warnow T: Rec-I-DCM3: A fast algorithmic technique for reconstructing large phylogenetic trees. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. 2004, 98-109.Google Scholar
- Huson D, Nettles S, Warnow T: Disk-covering, a fast converging method for phylogenetic tree reconstruction. Journal of Computational Biology. 1999, 6 (3): 369-386.PubMedView ArticleGoogle Scholar
- Huson D, Vawter L, Warnow T: Solving large scale phylogenetic problems using DCM2. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB'99). 1999, 118-129.Google Scholar
- Nakhleh L, Roshan U, James KS, Sun J, Warnow T: Designing fast converging phylogenetic methods. Bioinformatics. 2001, 17 (suppl 1): 190-198. 10.1093/bioinformatics/17.suppl_1.S190.View ArticleGoogle Scholar
- Ragan MA: Phylogenetic inference based on matrix representation of trees. Molecular Phylogenetics and Evolution. 1992, 1 (1): 53-58. 10.1016/1055-7903(92)90035-F.PubMedView ArticleGoogle Scholar
- Nguyen N, Mirarab S, Warnow T: MRL and SuperFine+MRL: new supertree methods. Algorithms for Molecular Biology. 2012, 7 (3):Google Scholar
- Warnow T, Moret BM, St John K: Absolute convergence: true trees from short sequences. Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 2001, 186-195.Google Scholar
- Golumbic MC: Algorithmic Graph Theory and Perfect Graphs. 2004, Elsevier B.V., Amsterdam, The NetherlandsGoogle Scholar
- Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson S, Warnow T: ASTRAL: Genome-scale coalescent-based species tree estimation. Bioinformatics. 2014, 30 (17): i541-i546. 10.1093/bioinformatics/btu462.PubMedPubMed CentralView ArticleGoogle Scholar
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 2688-2690. 10.1093/bioinformatics/btl446.PubMedView ArticleGoogle Scholar
- Hugall AF, Foster R, Lee MS: Calibration choice, rate smoothing, and the pattern of tetrapod diversification according to the long nuclear gene rag-1. Systematic Biology. 2007, 56 (4): 543-563. 10.1080/10635150701477825.PubMedView ArticleGoogle Scholar
- Iwabe N, Hara Y, Kumazawa Y, Shibamoto K, Saito Y, Miyata T, Katoh K: Sister group relationship of turtles to the bird-crocodilian clade revealed by nuclear DNA-coded proteins. Molecular Biology and Evolution. 2005, 22 (4): 810-813. 10.1093/molbev/msi075.PubMedView ArticleGoogle Scholar
- Zardoya R, Meyer A: Complete mitochondrial genome suggests diapsid affinities of turtles. Proceedings of the National Academy of Sciences. 1998, 95 (24): 14226-14231. 10.1073/pnas.95.24.14226.View ArticleGoogle Scholar
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 2688-2690. 10.1093/bioinformatics/btl446.PubMedView ArticleGoogle Scholar
- Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, Version 4. Sinauer Associates, Sunderland, MassachusettsGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.