 Software
 Open Access
 Published:
STELAR: a statistically consistent coalescentbased species tree estimation method by maximizing triplet consistency
BMC Genomics volume 21, Article number: 136 (2020)
Abstract
Background
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multispecies coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.
Results
We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescentbased method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multispecies coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate tripletbased supertree method, and with MPEST and ASTRAL – two of the most popular and accurate coalescentbased methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MPEST and SuperTriplets.
Conclusions
Theoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.
Background
Estimated species trees are useful in many biological analyses, but accurate estimation of species trees can be quite complicated. Species tree inference can potentially result in accurate evolutionary history using data from multiple loci. Therefore, due to the advent of modern sequencing technologies, it is increasingly common to infer trees by analyzing sequences from multiple loci. However, combining multilocus data is difficult, especially in the presence of gene tree discordance [1]. A traditional approach to species tree estimation is called concatenation (also known as combined analysis), which concatenates gene sequence alignments into a supergene matrix, and then estimates the species tree using a sequence based tree estimation technique (e.g., maximum parsimony, maximum likelihood etc.). Although it has been used in many biological analyses, concatenation can be problematic as it is agnostic to the topological differences among the gene trees, can be statistically inconsistent [2], and can return incorrect trees with high confidence [3–6].
Recent modeling and computational advances have produced methods that explicitly take the gene tree discordance into account while combining multilocus data to estimate species trees. Incomplete lineage sorting (ILS) (also known as deep coalescence) is one of the most prevalent reasons for gene tree incongruence [1], which is modelled by the MSC [7]. Rapid radiation is known to happen in many organisms (for example, birds) where ILS is likely to occur [8]. Due to this growing awareness that ILS can be present in many phylogenomic datasets, ILSaware species tree estimation techniques have gained substantial attention from systematists. These types of method are usually known as “summary methods” as they summarize a set of gene trees to estimate the species trees [9].
Several speciestree estimation methods have been proven to be statistically consistent under the multispecies coalescent model, meaning that they will return the true species tree with high probability given a sufficiently large number of true gene trees sampled from the distribution defined by the species tree. Methods that are statistically consistent under the MSC include ASTRAL [10], MPEST [11], *BEAST [12], NJst [13], BUCKy [14], GLASS [15], STEM [16], SNAPP [17], SVDquartets [18], STEAC [19] and ASTRID [20]. *BEAST, which is a Bayesian technique, can coestimate both gene trees and species trees, and simulation studies suggest that *BEAST can be considered the best of the coestimation methods with excellent accuracy, but it is computationally very intensive to run on even moderatelysized dataset [9, 21]. SuperTriplets [22] is a tripletbased supertree method which tries to find an asymmetric median supertree according to a triplet dissimilarity. SuperTriplets is very fast and accurate and was used in a number of important phylogenomic studies [23–28]. ASTRAL and MPEST are two of the most accurate and widely used summary methods that are much faster than *BEAST. ASTRAL has been shown to be faster and more accurate than MPEST and can handle large dataset containing hundreds of species which is not possible by MPEST [10, 29].
ASTRAL finds a species tree that maximizes the number of consistent quartets between the species tree and the gene trees. MPEST maximizes a pseudolikelihood estimate utilizing the underlying triplet distribution of the gene trees. Quartet and triplet based methods are robust to the “anomaly zone” [30, 31] (a condition where the most probable gene tree topology may not be identical to the species tree topology) as there are no anomalous rooted threetaxon species trees [30, 32] and no anomalous unrooted fourtaxon species trees [31, 33].
We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new coalescentbased method which finds a species tree that agrees with the largest number of triplets induced by the gene trees. STELAR is statistically consistent under the MSC model, fast (having a polynomial running time), and highly accurate – enabling genome wide phylogenomic analyses. We report, on an extensive experimental study, the performance of STELAR in comparison with ASTRAL, MPEST and SuperTriplets. These quartet and tripletbased methods are known to be highly accurate and reliable, and are being widely used. Our experiments showed that STELAR is as good as ASTRAL, and is better than MPEST and SuperTriplets in most of the model conditions. Thus, with the desirable property of being statistically consistent and reliable performance on challenging realistic model conditions, STELAR will be a useful tool for genomescale analyses of species phylogeny.
Implementation
Design of ASTRAL and other quartetbased methods [10, 34] and their proofs of being statistically consistent are based upon the fact that unrooted 4taxon species trees do not contain anomaly zone [31, 33]. We use similar design in STELAR, utilizing the fact that rooted 3taxon species trees do not contain anomaly zone [30, 32].
STELAR is provably statistically consistent under the MSC model but its statistical guarantee can fail under model conditions in which gene tree discordance is due to gene duplication and loss or horizontal gene transfer. Furthermore, MSC model requires representing each gene by a single tree. Therefore, MSC model does not allow recombination events inside a gene [32] and consequently the statistical guarantee can fail in the presence of recombination.
Definitions and notation
Let T be a binary rooted tree, leaflabelled by species set \(\mathcal {X}\) with n taxa. Each internal node u in a tree T divides the set of taxa present in the subtree rooted at u into two distinct sets. We call this a subtreebipartition, denoted by \(\mathcal {SBP}_{T}(u)\), which was originally defined in [35]. We denote by T_{u} the subtree under node u of tree T. We denote the leaves in T by L(T), an arbitrary set of three species \(\{a,b,c\} \subset \mathcal {X}\) by r, and a rooted topology on r by t_{r}. We use abc to show that b and c are siblings. Possible three topologies on {a,b,c} are: abc,bca,cab. The probability, under the MSC model, of observing an induced gene tree triplet that matches the species tree topology is higher, and the other two triplet topologies are less probable and equal to each other [30]. The triplet tree topology that appears more frequently than the two alternative topologies is called the dominant topology. \(\mathcal {TR}(T)\) denotes the set of \(\binom {n}{3}\) triplet topologies induced by the tree T. We denote by Tr the triplet tree topology obtained by restricting T to the three species in r.
Problem definition
A natural optimization problem for estimating a species tree from a collection of gene trees under triplet consistency would be first decomposing the input gene trees into their induced triplets, and then estimating a species tree by amalgamating the dominant triplets [36]. Building such a consensus tree from a set of triplets has been shown to be an NPhard problem, and exact and heuristic algorithms have been proposed [37, 38]. This approach considers only the dominant triplets which could be problematic when the frequencies of the alternate triplet topologies are very close to that of the dominant one, and when all the dominant triplets are not compatible with each other meaning that all of them do not agree with a single species tree. Moreover, this approach has to explicitly enumerate and consider \(3*\binom {n}{3}\) possible triplets, making it computationally expensive for larger values of n. An alternate approach would be to consider the relative frequencies of all the triplets, and infer a species tree that agrees with the largest number of triplets induced by the gene trees. Methods that do not explicitly decompose the gene trees into induced triplets and quartets are preferred over the ones that take a set of quartets or triplets as input, since the latter types of methods demand high computational time and memory, and thus are not scalable to large dataset [10, 34]. This is possibly one of the reasons why quartet amalgamation method like QFM [34], despite having excellent accuracy, is not as popular as ASTRAL. Taking these issues into consideration, we introduce a constrained version of the triplet consistency problem, which we call the Constrained Triplet Consensus (CTC) problem, and provide efficient exact and heuristic solutions to the CTC without having to decompose the gene trees into their induced triplets. We formalize the CTC problem as follows.
Input: a set \(\mathcal {G}\) of rooted gene trees on \(\mathcal {X}\) and a set \(\mathcal {SBP}\) of subtreebipartitions.
Output: a species tree ST on \(\mathcal {X}\) that maximizes the triplet agreement with \(\mathcal {G}\) and all its subtreebipartitions are in \(\mathcal {SBP}\).
Note that when \(\mathcal {SBP}\) is set to all possible subtreebipartitions on \(\mathcal {X}\), the triplet consistency problem will find the globally optimal solution (exact solution). Otherwise, when \(\mathcal {SBP}\) contains only the subtreebipartitions in the input gene trees, the search space will include only those species trees where every subtreebipartition is present in at least one of the gene trees in \(\mathcal {G}\), and we call this the constrained version or heuristic version.
Theorem 1
Given a set \(\mathcal {G}\) of true gene trees, solution to the CTC problem (both exact and constrained version) is a statistically consistent estimator of the species tree topology under the MSC model.
Proof
Let \(\mathcal {G}\) be a set of sufficiently large number of true gene trees that has evolved within the model species tree T under the multispecies coalescent model. We know that a rooted 3taxon species tree does not have any anomaly zone [32]. So, as we increase the number of gene trees, each triplet topology induced by the species tree will have more frequency in \(\mathcal {G}\) than its alternatives, with probability that approaches 1. Let \(w_{\mathcal {G}}(t_{r})\) be the number of trees in \(\mathcal {G}\) that agree with t_{r}. Hence, for every triplet r and every possible tree \(T\prime, w_{\mathcal {G}}(Tr) \geq w_{\mathcal {G}}(T\prime r)\) with probability that approaches 1 for a large enough number of gene trees. Therefore, if \(\mathcal {TR}\) is the set of all possible triplets on \(\mathcal {X}\), then
By definition, the exact solution to the CTC problem will find a tree that will maximize the triplet agreement with \(\mathcal {G}\). Therefore, for a sufficiently large number of true gene trees, the optimal tree under CTC will be identical to the true species tree T with probability approaching one.
We now show that the solution to the constrained version is also statistically consistent. If we increase the number of gene trees, with probability approaching 1, at least one of the gene tree topologies in \(\mathcal {G}\) will be identical to the true species tree topology. Therefore, with probability that tends to 1, the set \(\mathcal {SBP}\) will contain all the subtree bipartitions from the true species tree, and therefore the solution to the CTC problem will be identical to the true species tree T. □
Algorithmic design of STELAR
We propose a dynamic programming (DP) solution to the CTC problem. The algorithmic design (and corresponding theoretical results) in STELAR is structurally similar to ASTRAL. This sort of DPbased approach, which implicitly finds a maximum or minimum clique in a graph modelled from the input gene trees, was first used by [39] and later was used in Phylonet [40, 41], DynaDup [35, 42] and ASTRAL. The key idea in STELAR is to find an equation for computing the number of triplets in the gene trees that agree with a given subtree in the species tree, which ultimately enables us to design a dynamic programming algorithm for estimating a species tree by maximizing the triplet agreement.
We call a triplet t_{r}=abc to be mapped to a subtreebipartition at an internal node x in T (\(\mathcal {SBP}(T_{x})=X_{1}X_{2}\)) when {a,b,c}⊆L(T_{x}), and the topology of t_{r} is compatible (consistent) with T_{x} (see Fig. 1). Each triplet mapped to a subtreebipartition (X_{1}X_{2}) will have either two leaves from X_{1} and one leaf from X_{2} or two leaves from X_{2} and one leaf from X_{1}. Therefore, we have the following results.
Lemma 1
Total number of triplets mapped to a subtreebipartition x=(X_{1}X_{2}) of a rooted binary tree T, where X_{1}=n_{1} and X_{2}=n_{2} is:
Lemma 2
For two subtreebipartitions x=(X_{1}X_{2}) and y=(Y_{1}Y_{2}) where \((X_{1} \cup X_{2}) \subseteq \mathcal {X}\) and \((Y_{1} \cup Y_{2}) \subseteq \mathcal {X}\), the number of triplets mapped to both x and y is given by:
Proof
We can obtain two possible subtree bipartitions by taking pairwise intersection: z_{1}=(X_{1}∩Y_{1}X_{2}∩Y_{2}) and z_{2}=(X_{1}∩Y_{2}X_{2}∩Y_{1}). Note that each leaf in z_{i} (i∈1,2) is also a leaf in both x and y, and so each triplet mapped to z_{i} is also mapped to x and y and the number of mapped triplets can be computed by Eq. 1.
We now prove that Eq. 2 counts a mapped triplet only once. Without loss of generality, let t_{r}=abc be mapped to both x and y where a,b∈X_{1},c∈X_{2},a,b∈Y_{1} and c∈Y_{2}. Equation 2 will count t_{r} for z_{1}=(X_{1}∩Y_{1}X_{2}∩Y_{2}) only as z_{2} will have no element from {a,b,c}. Thus the lemma follows. □
We now see how to compute the triplet consistency (TC) score of a species tree ST with respect to a set \(\mathcal {G}\) of gene trees, which we denote by \(TC_{\mathcal {G}}(ST)\). If \(\mathcal {SBP}(T)\) is the set of subtreebipartitions in T, for a subtreebipartition x and a set of rooted binary trees \(\mathcal {G}\), the total number of triplets in \(\mathcal {G}\) that are mapped to x is given by:
Thus, the TC score of an estimated species tree ST is obtained by,
Therefore, we can design a dynamic programming approach which starts with the whole species set \(\mathcal {X}\) and recursively divides it into smaller subsets and picks the partition that maximizes the triplet consistency. At each level, we compute V(A) which gives the score of an optimal subtree on leaf set \(A \subseteq \mathcal {X}\). At the end level, we have computed \(V(\mathcal {X})\) and sufficient information so that backtracking yields the corresponding subtreebipartitions in the optimal species tree.
Dynamic programming
Base case: A=1;V(A)=0Recursive relation: \(V(A) = \max _{(A^{\prime }\vert A  A^{\prime }) \in \mathcal {X}}\{V(A^{\prime }) + V(A  A^{\prime }) + TC_{\mathcal {G}}(A^{\prime }\vert A  A^{\prime })\}\), where \( A^{\prime }A A^{\prime } \in \mathcal {SBP}\).
Running time analysis
Lemma 3
Given a set \(\mathcal {G}\) of k gene trees on n taxa and a subtreebipartition x, computing the triplet consistency score \(TC_{\mathcal {G}}(x)\) takes O(n^{2}k) time.
Proof
A gene tree with n taxa contains O(n) subtreebipartitions. Thus, for a set of k input gene trees, there will be O(nk) subtreebipartitions. To find the triplet consistency score \(TC_{\mathcal {G}}(x)\) of a subtreebipartition x using Eq. 3, we need to score \(\mathcal {M}(x,y)\) for all subtreebipartions y present in the input gene trees. Since we represent each subtreebipartion by a bitset, calculating \(\mathcal {M}(x,y)\) (using Eq. 2) takes O(n) time. Since there are O(nk) subtreebipartions in \(\mathcal {G}\), calculating \(TC_{\mathcal {G}}(x)\) takes O(n^{2}k) time. □
Theorem 2
For a given set \(\mathcal {G}\) of k gene trees on n taxa and a set \(\mathcal {SBP}\) of subtreebipartitions, STELAR runs in \(O\left (n^{2}k\vert \mathcal {SBP}\vert ^{2}\right)\) time.
Proof
For a specific cluster A, V(A) can be computed in \(O\left (\mathcal {SBP}*n^{2}k\right)\) time since at worst we need to look at every subtreebipartition in \(\mathcal {SBP}\), and computing \(TC_{\mathcal {G}}(x)\) takes O(n^{2}k) time. Note that this is a conservative analysis since the number of subtreebipartitions that the DP algorithm needs to consider for computing V(A) for a specific cluster A is, in reality, much less than \(\mathcal {SBP}\) (especially for A where \(A < \mathcal {X}\)). There will be \(O(\mathcal {SBP})\) possible clusters for which we need to compute V(A). Thus, the running time of STELAR is \(O\left (n^{2}k\mathcal {SBP}^{2}\right)\). □
Software implementation
STELAR is a java based crossplatform application. STELAR code uses code from the PhyloNet package by Luay Nakhleh (bioinfo.cs.rice.edu/phylonet). The Phylonet code base was previously used by DynaDup [35, 42] and ASTRAL (with permission from the authors), and STELAR is mostly based on the DynaDup code base.
Experimental studies
We compared STELAR with SuperTriplets, which is a very fast and highly accurate triplet based supertree technique and has been used in several important phylogenomic studies [22–27]. We also evaluated STELAR in comparison with ASTRALIII [43] and MPEST, two of the most widely used and accurate coalescentbased methods. ASTRAL has already been compared and shown to outperform many of the existing species tree estimation methods (e.g., BUCKypop, MRP, greedy consensus, CAML, etc.), especially in the presence of moderate to high levels of ILS. So we did not include those methods in this study. Most of the datasets analyzed in this study are prohibitively large for *BEAST to analyze. We ran MPEST with 10 random starting points and selected the species tree with the highest pseudolikelihood value.
We used previously studied simulated and biological datasets to evaluate the performance of these methods. We compared the estimated species trees to the model species tree (for the simulated datasets) or to the scientific literature (for the biological datasets), to evaluate accuracy. We used normalized RobinsonFoulds distance (RF rate) [44] to measure the tree error. The RF distance between two trees is the sum of the bipartitions (splits) induced by one tree but not by the other, and vice versa. All the trees estimated by ASTRALIII, STELAR and MPEST in this study are binary and so RF rate, False Positive (FP) rate and False Negative (FN) rate are identical. However, the trees estimated by SuperTriplets may produce multifurcations (unresolved trees). We performed Wilcoxon signedrank test (with α=0.05) to measure the statistical significance of the differences between two methods.
Datasets
We studied four collections of simulated datasets: one based on a biological dataset (37taxon mammalian dataset) that was generated and subsequently analyzed in prior studies [10, 29, 45], and three other simulated datasets from [45], [46], and [47]. Table 1 presents a summary of the datasets analyzed in this paper, showing various model conditions with varying numbers of taxa (11 to 37), ILS levels (reflected in the average topological distance between true gene trees and true species tree) and gene sequence lengths. All datasets consist of gene sequence alignments generated under a multistage simulation process that begins with a species tree, simulates gene trees down the species tree under the multispecies coalescent model (and so can differ topologically from the species tree), and then simulates gene sequence alignments down the gene trees.
Mammalian dataset was simulated by taking the species tree estimated by MPEST on the biological dataset studied in [48]. This species tree had branch lengths in coalescent units, which we used to produce a set of gene trees under the coalescent model. Thus, the mammalian simulation model tree has an ILS level based on a coalescent analysis of the biological mammalian dataset, and other properties of the simulation that are set to reflect the biological sequences [48] studied. We explored the impact of varying amounts of ILS, varying numbers of genes (25 ∼800), the impact of phylogenetic signal by varying the sequence length (250, 500, and 1000 base pair (bp)) for the markers. Three levels of ILS were simulated. The basic model species tree has branch lengths in coalescent units, and we produced other model species trees by multiplying or dividing all internal branch lengths in the model species tree by two. This rescaling varies the amount of ILS (shorter branches have more ILS), and also impacts the amount of gene tree estimation error in the estimated gene trees. The basic model condition with moderate amount of ILS is referred to as 1X and the model conditions with higher and lower amounts of ILS are denoted by 0.5X and 2X, respectively.
We used three other simulated datasets: 11taxon dataset (generated by [46] and subsequently studied by [9, 49]), 15taxon dataset (generated and studied by [45]), and 500taxon dataset (generated and studied by ASTRALII [47]). 11taxon dataset was simulated under a complex process to ensure substantial heterogeneity between genes and to deviate from the molecular clock [46]. We analyzed the model condition with high amount of ILS as the model condition with lower amount of ILS is very easy to analyze and even methods without any statistical guarantee produced highly accurate tree with just 25 ∼ 50 gene trees [9]. 15taxon datasets vary in sequence lengths and has high amount of ILS. The 500taxon dataset was generated under the MSC model by introducing rate heterogeneity between genes and deviations from molecular clock (see [47] for more details).
We used two biological datasets: the 37taxon mammalian datasets studied by [48] with 424 genes, and the Amniota dataset from [50] containing 16 species and 248 genes.
Results
Results on mammalian simulated dataset
We analyzed the performance of SuperTriplets, ASTRALIII, MPEST and STELAR on various model conditions with varying amounts of ILS, numbers of genes and lengths of the sequences. Figure 2a shows the average RF rates of three methods on three model conditions with varying amounts of ILS. For all methods, RF rates increase as we increase the amount of ILS. SuperTriplets produced significantly less accurate trees than the other methods, and the trees are not fully resolved. For the dataset with varying amounts of ILS, SuperTriplets produced trees with RF rates 10% ∼18%, whereas the error rates of STELAR, MPEST and ASTRALIII range from 4% ∼ 6%. STELAR and ASTRALIII are better than MPEST on all amounts of ILS and, in some cases, the differences are statistically significant (P<0.05). STELAR and ASTRAL are comparable in terms of tree accuracy and no statistically significant difference was noticed.
Figure 2c shows the impact of the numbers of gene trees on the accuracy of the estimated species trees. STELAR, MPEST and ASTRALIII improved with increasing numbers of genes (this is expected as these methods are statistically consistent and thus increasing the numbers of genes has resulted into improved topological accuracy). However, SuperTriplets failed to leverage the increasing amount of data since it did not improve as we increased the numbers of genes. Moreover, even with 800 genes, SuperTriplets produced unresolved trees. STELAR and ASTRALIII achieved similar accuracy and performed significantly better than MPEST and SuperTriplets across all model conditions.
Results on 11taxon dataset
We analyzed both the estimated and true gene trees. In both cases, we varied the number of genes from (5 ∼ 100). Figure 3 shows the results on 11taxon dataset for both estimated and true gene trees. Similar to the mammalian dataset, SuperTriplets produced the least accurate trees. As expected, all methods improved with increased numbers of genes (except for a few cases for SuperTriplets), and had their best accuracies on true gene trees. The accuracies of ASTRALIII, MPEST and STELAR are almost similar with no statistically significant difference. On true gene trees, ASTRAL, MPEST and STELAR recovered the true species trees across all the replicates with 50 or higher numbers of genes. However, the accuracy of SuperTriplets did not achieve any notable improvement as we increased the number of genes beyond 25. Unlike mammalian dataset, MPEST was slightly better than ASTRALIII and STELAR on a couple of model conditions (albeit the differences are not statistically significant with Pvalues greater than 0.05).
Results on 15taxon dataset
15taxon dataset evolved under a caterpillar model species tree with very high level of ILS [45]. We explored the performance on varying sequence lengths (100 bp and 1000 bp), and numbers of genes (100 and 1000). Unlike other two dataset, SuperTriplets returned well resolved trees without multifurcations and matched the accuracy of other methods on this particular dataset. All the methods improved as the the number of genes and sequence length increased, and had their best accuracy (RF rate =0) on true gene trees (Fig. 4). No statistically significant difference was observed between STELAR and ASTRALIII. MPEST consistently had the highest error rates across all the model conditions on estimated gene trees. On true gene trees, all the methods returned the true species tree on almost all the replicates.
Results on 500taxon dataset
In order to assess the performance of STELAR on relatively larger datasets, we analyzed the 500taxon dataset [47] with 1000 genes. We analyzed 10 replicates of data. Some of the replicates had a few nonbinary gene trees which we discarded in order to run STELAR. After removing the nonbinary gene trees, the numbers of genes in the replicates range from 993∼1000. MPEST is computationally too demanding to run on this large dataset. Figure 5 shows the error rates of STELAR, ASTRALIII and SuperTriplets. Both STELAR and ASTRAL produced competitive results with no statistically significant difference. However, similar to the other dataset analyzed in this study, SuperTriplets was much worse than ASTRALIII and STELAR.
Results on biological datasets
Amniota dataset: We analyzed the Amniota dataset from [50] containing 248 genes across 16 amniota taxa in order to resolve the position of turtles within the amniotes, especially relative to birds and crocodiles. We reanalyzed both the amino acid (AA) and DNA gene trees, obtained from the authors, using ASTRAL, MPEST, STELAR and SuperTriplets. The sister relationship of crocodiles and birds (forming Archosaurs), and placement of the turtles as the sister clade to Archosaurs are considered reliable and were supported by previous studies [51–53].
For amino acid dataset, all the methods placed birds and crocodiles as sister groups (forming Archosaurs) and turtles as sister to Archosaurs clade (see Fig. 6). The unrooted version of STELAR estimated tree is identical to ASTRAL. MPEST also returned a highly similar tree except for different resolutions within the turtles. SuperTriplets produced an unresolved tree with several multifurcations, leaving bird/turtle/crocodile relationship uncertain. It also did not resolve the Squamates clade.
STELAR and MPEST on the DNA gene trees produced the same tree which resolved bird/turtle/crocodile relationship as (birds,(turtles,crocodiles)) and thus did not form Archosaurs. However, ASTRAL on the DNA gene trees resolved the relationships as (turtles,(birds,crocodiles)). Therefore, maximizing triplet agreement and quartet agreement both recovered the (birds,(turtles,crocodiles)) relationships on AA gene trees, but differed on the DNA gene trees. Similar to the AA dataset, SuperTriplets did not resolve the bird/turtle/crocodile relationship. We estimated the quartet scores and the triplet scores of these different tree topologies with respect to the corresponding gene trees (see Tables 2 and 3). As ASTRAL estimated trees need to be interpreted as unrooted trees, we root the ASTRAL trees using the correct outgroup (Protopterus annectens) in order to compute the triplet scores. As expected, ASTRAL achieved the highest quartet agreement scores and STELAR achieved the highest triplet agreement scores.
Mammalian dataset: We reanalyzed the mammalian dataset from [48] containing 447 genes across 37 mammals after removing 21 mislabeled genes (confirmed by the authors), and two other outlier genes. The placement of tree shrews (Tupaia belangeri) is of great importance here as previous studies found that the phylogenetic placement of tree shrews is unstable as they can be classified as a sister group to Primates or to Glires [48, 54–56]. Our analyses using ASTRAL, MPEST and STELAR recovered the same topology which placed tree shrews as sister to Glires (see Fig. 7). This is consistent to the tree estimated by CAML (reported in [48]). However, previous studies with MPEST, using multilocus bootstrapping [57], on the same dataset recovered a tree which placed tree shrews as the sister to Primates [10, 56]. SuperTriplets produced a highly similar tree except that it resulted into an unresolved tree with several multifurcations. It recovered the major clades such as Primates, Glires, Cetartiodactyla, Eulipotyphla, Chiroptera. However, SuperTriplets left the relationships within Primates/Glires/Scandentia unresolved. This is in line with the tree estimated by SuperTriplets on the OrthoMaM database [58], which was reported in [22]. We investigated the quartet and triplet scores of the two alternative hypotheses (regarding the placement of tree shrew) and found that the tree that suggests (tree shrew, Glires) clade has higher quartet and triplet scores (25526915 and 3042547, respectively) compared to the one that suggests (tree shrew, Primates) relationship which yields 25511503 and 3041747 quartet and triplet scores respectively. Therefore, our analyses strongly support the placement of tree shrews as sister to Glires.
Running time
We performed the experiments on small to moderatescale dataset (11 ∼ 37 taxa) on a Linux machine with 7.5 GiB RAM and an i5 1.6 GHz processor. For the 500taxon dataset, we used a Linux machine with 15.6 GiB RAM and an i7 3.60 GHz processor. Our experiments show that SuperTriplets is usually significantly faster than the other methods, and MPEST is computationally more demanding than the others. For 11, 15, and 37taxon datasets, SuperTriplets took a fraction of a second to run on a single replicate. STELAR and ASTRALIII are substantially faster than MPEST. The running time of MPEST is more sensitive to the number of species than the number of genes, and grows rapidly as we increase the number of taxa [29]. On 37taxon simulated dataset with 200 genes (moderate ILS), SuperTriplets took around half a second, both ASTRALIII and STELAR took around 4 s whereas MPEST (with 10 random starting points) took 3300 s (55 min). Running time of all these methods (except for SuperTriplets) increased with increasing amounts of ILS. On 37taxon, 0.5X dataset (high ILS) with 200 genes, ASTRALIII took around 10 s, STELAR took 15 s and MPEST took 4800 s (80 min). This is because the numbers of distinct quartets and triplets in the gene trees increase as we increase the amount of discordance among the gene trees. When we increased the numbers of genes from 200 to 800 in 1X model condition, the running times of ASTRALIII and STELAR increased from 4 s to 10 and 12 s respectively, whereas SuperTriplets took only 0.8 s. MPEST took about 55 min and thus was not affected by the increased number of genes. On 11taxon dataset, we ran the exact versions of ASTRALIII and STELAR, and yet the running time is much less than MPEST. Exact versions of ASTRALIII and STELAR finished in less than a second on 5∼50 genes and less than two seconds on 50∼100 genes, whereas the running times of MPEST range from 13 to 20 s. SuperTriplets took only 0.1 s irrespective of the numbers of genes. For 15taxon dataset, MPEST took 90∼100 s. ASTRALIII and STELAR finished in less than 2 s on 100 gene dataset, and for 1000 genes they took about 50 and 80 s, respectively. The running time of SuperTriplets did not vary much with the change in the number of genes and finished in only around 0.4 s.
There is no significant difference in running time between ASTRALIII and STELAR on relatively smaller datasets. However, on the larger dataset with 500 taxa, ASTRALIII is much faster than STELAR. ASTRALIII took around 18 min whereas STELAR took about 2 h. Interestingly, SuperTriplets – which is much faster than ASTRALIII and STELAR on relatively smaller datasets – is not the fastest on 500taxon dataset, taking around 26 min. ASTRAL has evolved over the years (ASTRAL [10], ASTRALII [47], ASTRALIII [43], etc.) with efficient heuristics to explore the tree space and highly optimized implementation. Although STELAR is not as fast as ASTRAL at its current form, it is sufficiently fast to analyze large dataset with hundreds of taxa. We will continue to improve STELAR to make it fast enough to run on ultra large dataset with thousands of taxa.
Discussion
Extensive experimental studies on simulated dataset with varying model conditions suggest that STELAR, which is an efficient DPbased algorithm for solving the triplet consistency problem, can produce substantially better trees than SuperTriplets (which is a tripletbased supertree method and tries to find a median supertree with respect to a triplet dissimilarity measure), and MPEST (which tries to maximize a pseudolikelihood measure based on the triplet distribution of the gene trees). Thus, within the scope of the experiments conducted in this study, maximum triplet agreement has been shown to be a better optimization criterion than the triplet frequency based maximum pseudolikelihood used in MPEST, and the triplet dissimilarity based maximum parsimony criterion used in SuperTriplets, and as good as the maximum quartet agreement criterion used in ASTRAL. Experiments on Amniota and mammalian biological dataset suggest that STELAR can produce reliable trees on real biological dataset.
Most of the existing methods have assumed that the gene tree discordance is due to only a particular biological reason, for example, ILS. Therefore, both orthologs and paralogs are not being included in many studies such as 1KP project [59], where many genes were discarded that evolved with duplications and losses. Furthermore, existing methods try to optimize a single optimization criterion (e.g., quartet consistency [10], and pseudolikelihood [11]). Although these criteria are statistically consistent, the performance of various methods may vary on practical model conditions with limited numbers of genes and in the presence of gene tree estimation error. These factors underscore the need for modeling species tree estimation as a multiobjective optimization problem, where multiple explanations for observed data can be conjectured, and multiple optimization criteria are simultaneously optimized. Therefore, developing new scalable and statistically consistent estimators of species trees is of great importance. We believe this study advances the stateoftheart in summary methods, and STELAR would be considered as a useful tool in phylogenomics.
STELAR can consistently estimate the species tree topology and is fast enough to handle large numbers of genes and species, making it suitable for large scale phylogenomic analyses. However, STELAR can be expanded in several directions. Future work will need to investigate how the estimation of branch lengths (in coalescent unit) can be incorporated in STELAR. STELAR, at this stage, cannot handle unrooted and nonbinary gene trees. The CTC problem for unrooted gene trees can be formulated as follows. The input is a set \(\mathcal {G} = \{gt_{1}, gt_{2}, gt_{3}, \ldots, gt_{k}\}\) of k unrooted gene trees, and the output is a species tree ST that maximizes the triplet consistency score with respect to a set \(\mathcal {G'} = \{gt'_{1}, gt'_{2}, gt'_{3}, \ldots, gt'_{k}\}\) where \(gt^{\prime }_{i}\) is a rooted version of gt_{i}. That means, the idea is to find the optimal rooted versions of the unrooted gene trees so that the triplet agreement is maximized. Similar approach was previously applied [41] for estimating species trees from a set of unrooted gene trees by minimizing “extra lineages” resulting from deep coalescence. Similarly, the problem can be extended for nonbinary gene trees as well. Thus, future works on formalizing the CTC problem for unrooted and nonbinary gene trees and developing appropriate algorithms will be of great importance. The current version of STELAR is reasonably fast and can handle hundreds of taxa. However, more efficient algorithms and implementation are required to make STELAR scalable to much larger dataset with thousands of taxa. Furthermore, while we have studied various practical and challenging model conditions with varying numbers of genes, and varying amounts of ILS, and gene tree estimation error, future work should investigate the relative performance of STELAR and the widely used existing methods on various other challenging model conditions, for example, model conditions with missing taxa (i.e., incomplete gene trees). This study investigated relatively long sequences (250 ∼2000 bp); subsequent studies should investigate the relative performance of STELAR and other methods on very short sequences, since recombinationfree loci can be very short [60].
Conclusions
We developed STELAR – a method for estimating species trees from a set of estimated gene trees which is statistically consistent under the multispecies coalescent model. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimator of the species tree under the MSC model. STELAR is a dynamic programming solution to the CTC problem. STELAR has an exact version (which provides the exact solution and takes exponential amount of time) as well as an efficient polynomial time heuristic version. Considering the high accuracy and scalability, we believe STELAR will be considered as a potential tool for constructing species phylogenies from genes sampled throughout the whole genome.
Availability and requirements
Project name: STELARProject home page: https://islamazhar.github.io/STELAR/Operating system(s): Platform independentProgramming language: JavaOther requirements: Java 9.0 or higherLicense: Apache License 2.0Any restrictions to use by nonacademics: Not applicable.
Availability of data and materials
STELAR is freely available as an open source software at https://islamazhar.github.io/STELAR/. All the dataset analyzed in this paper are from previously published studies and are publicly available.
Abbreviations
 CTC:

Constrained triplet consensus
 DP:

Dynamic programming
 ILS:

Incomplete lineage sorting
 MSC:

Multispecies coalescent
 SBP:

Subtreebipartition
References
 1
Maddison WP. Gene trees in species trees. Syst Biol. 1997; 46:523–36.
 2
Roch S, Steel M. Likelihoodbased tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 2015; 100:56–62.
 3
Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007; 56:17.
 4
Edwards SV, Liu L, Pearl DK. Highresolution species trees without concatenation. Proc Nat Acad Sci. 2007; 104(14):5936–41.
 5
Leaché AD, Rannala B. The accuracy of species tree estimation under simulation: a comparisonof methods. Syst Biol. 2011; 60(2):126–37.
 6
DeGiorgio M, Degnan JH. Fast and consistent estimation of species trees using supermatrix rooted triples. Mol Biol Evol. 2009; 27(3):552–69.
 7
Kingman JFC. The coalescent. Stoch Proc Appl. 1982; 13:235–48.
 8
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT, et al.Wholegenome analyses resolve early branches in the tree of life of modern birds. Science. 2014; 346(6215):1320–31.
 9
Bayzid MS, Warnow T. Naive binning improves phylogenomic analyses. Bioinformatics. 2013; 29(18):2277–84.
 10
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: genomescale coalescentbased species tree estimation. Bioinformatics. 2014; 30(17):541–8.
 11
Liu L, Yu L, Edwards SV. A maximum pseudolikelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010; 10(1):302. https://doi.org/10.1186/1471214810302.
 12
Heled J., Drummond A. J.Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010; 27:570–80.
 13
Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011; 60(5):661–7. https://doi.org/10.1093/sysbio/syr027.
 14
Larget B, Kotha SK, Dewey CN, Ané C. BUCKy: Gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinf. 2010; 26(22):2910–1.
 15
Mossel E, Roch S. Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE Comp Biol Bioinform. 2011; 7(1):166–71.
 16
Kubatko LS, Carstens BC, Knowles LL. Stem: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinf. 2009; 25:971–3.
 17
Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 2012; 29(8):1917–32.
 18
Chifman J, Kubatko L. Quartet inference from snp data under the coalescent model. Bioinformatics. 2014; 30(23):3317–24.
 19
Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009; 58(5):468–77.
 20
Vachaspati P, Warnow T. Astrid: accurate species trees from internode distances. BMC Genomics. 2015; 16(10):3.
 21
Smith BT, Harvey MG, Faircloth BC, Glenn TC, Brumfield RT. Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales. Syst Biol. 2013; 63(1):83–95.
 22
Ranwez V, Criscuolo A, Douzery EJ. Supertriplets: a tripletbased supertree approach to phylogenomics. Bioinformatics. 2010; 26(12):115–23.
 23
Romiguier J, Ranwez V, Delsuc F, Galtier N, Douzery EJ. Less is more in mammalian phylogenomics: Atrich genes minimize tree conflicts and unravel the root of placental mammals. Mol Biol Evol. 2013; 30(9):2134–44.
 24
Cornillot E, HadjKaddour K, Dassouli A, Noel B, Ranwez V, Vacherie B, Augagneur Y, Bres V, Duclos A, Randazzo S, et al.Sequencing of the smallest apicomplexan genome from the human pathogen babesia microti. Nucleic Acids Res. 2012; 40(18):9102–14.
 25
Romiguier J, Cameron SA, Woodard SH, Fischman BJ, Keller L, Praz CJ. Phylogenomics controlling for base compositional bias reveals a single origin of eusociality in corbiculate bees. Mol Biol Evol. 2015; 33(3):670–8.
 26
Scornavacca C, Galtier N. Incomplete lineage sorting in mammalian phylogenomics. Syst Biol. 2017; 66(1):112–20.
 27
Mavrodiev EV, MartínezAzorín M, Dranishnikov P, Crespo MB. At least 23 genera instead of one: The case of iris l. sl (iridaceae). PLoS One. 2014; 9(8):106459.
 28
Douzery EJ, Scornavacca C, Romiguier J, Belkhir K, Galtier N, Delsuc F, Ranwez V. Orthomam v8: a database of orthologous exons and coding sequences for comparative genomics in mammals. Mol Biol Evol. 2014; 31(7):1923–8.
 29
Bayzid MS, Hunt T, Warnow T. Disk covering methods improve phylogenomic analyses. BMC genomics. 2014; 15(6):7.
 30
Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006; 2:762–8.
 31
Degnan JH. Anomalous unrooted gene trees. Syst Biol. 2013; 62(4):574–90.
 32
Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol. 2009; 24(6):332–340. https://doi.org/10.1016/j.tree.2009.01.009.
 33
Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011; 62(6):833–62.
 34
Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic tree reconstruction from quartets: A heuristic approach. PLoS One. 2014; 9(8):104008.
 35
Bayzid MS, Mirarab S, Warnow T. Inferring optimal species trees under gene duplication and loss. In: Proc. of Pacific Symposium on Biocomputing (PSB). Singapore: World Scientific: 2013. p. 250–61.
 36
Ewing G. B., Ebersberger I., Schmidt H. A., Von Haeseler A.Rooted triple consensus and anomalous gene trees. BMC Evol Biol. 2008; 8(1):118.
 37
Wu BY. Constructing the maximum consensus tree from rooted triples. J Comb Optim. 2004; 8(1):29–39.
 38
Byrka J, Guillemot S, Jansson J. New results on optimizing rooted triplets consistency. Discret Appl Math. 2010; 158(11):1136–47.
 39
Hallett MT, Lagergren J. New algorithms for the duplicationloss model. In: Proc ACM Symp Comput Biol RECOMB2000. New York: ACM Press: 2000. p. 138–146.
 40
Than CV, Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comp Biol. 2009; 5(9):e1000501. https://doi.org/10.1371/journal.pcbi.1000501.
 41
Yu Y., Warnow T., Nakhleh L.Algorithms for mdcbased multilocus phylogeny inference: beyond rooted binary gene trees on single alleles. Journal of Computational Biology. 2011; 18(11):1543–1559.
 42
Bayzid MS, Warnow T. Gene tree parsimony for incomplete gene trees: addressing true biological loss. Algoritm Mol Biol. 2018; 13:1.
 43
Zhang C, Rabiee M, Sayyari E, Mirarab S. Astraliii: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinforma. 2018; 19(6):153.
 44
Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(12):131–47.
 45
Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescentbased estimation of the avian tree. Science. 2014; 346(6215):1250463.
 46
Chung Y, Ané C. Comparing two Bayesian methods for gene tree/species tree reconstruction: A simulation with incomplete lineage sorting and horizontal gene transfer. Syst Biol. 2011; 60(3):261–75.
 47
Mirarab S, Warnow T. Astralii: coalescentbased species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015; 31(12):44–52.
 48
Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Nat Acad Sci. 2012; 109(37):14942–7.
 49
Bayzid MS, Warnow T. Estimating optimal species trees from incomplete gene trees under deep coalescence. J Comput Biol. 2012; 19(6):591–605.
 50
Chiari Y, Cahais V, Galtier N, Delsuc F. Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (archosauria). Bmc Biol. 2012; 10(1):65.
 51
Zardoya R, Meyer A. Complete mitochondrial genome suggests diapsid affinities of turtles. Proc Nat Acad Sci. 1998; 95(24):14226–31.
 52
Hugall AF, Foster R, Lee MS. Calibration choice, rate smoothing, and the pattern of tetrapod diversification according to the long nuclear gene rag1. Syst Biol. 2007; 56(4):543–63.
 53
Iwabe N, Hara Y, Kumazawa Y, Shibamoto K, Saito Y, Miyata T, Katoh K. Sister group relationship of turtles to the birdcrocodilian clade revealed by nuclear dna–coded proteins. Mol Biol Evol. 2004; 22(4):810–3.
 54
Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genomescale coestimation of species and gene trees. Genome Res. 2013; 23(2):323–30.
 55
Kumar V, Hallström BM, Janke A. Coalescentbased genome analyses resolve the early branches of the euarchontoglires. PLoS One. 2013; 8(4):60019.
 56
Mirarab S, Bayzid MS, Warnow T. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst Biol. 2014; 65(3):366–80.
 57
Seo TK. Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol Biol Evol. 2008; 25(5):960–71.
 58
Ranwez V, Delsuc F, Ranwez S, Belkhir K, Tilak MK, Douzery EJ. Orthomam: a database of orthologous genomic markers for placental mammal phylogenetics. BMC Evol Biol. 2007; 7(1):241.
 59
Wickett NJ, Mirarab S, Nguyen N, Warnow T, Carpenter E, Matasci N, Ayyampalayam S, Barker MS, Burleigh JG, Gitzendanner MA, et al. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Nat Acad Sci. 2014; 111(45):4859–68.
 60
Gatesy J, Springer MS. Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol Phylogenet Evol. 2014; 80:231–66.
Acknowledgements
Not applicable.
Funding
The authors received no financial support for this research.
Author information
Affiliations
Contributions
MSB and RR conceived and designed the study; MI, KS and TD implemented STELAR; KS, TD, MI, MSB and RR proved the theoretical results; MI, KS and TD performed the experiments; MSB, KS, TD and MI wrote the first draft and all the authors took part in finalizing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Md. Shamsuzzoha Bayzid.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Islam, M., Sarker, K., Das, T. et al. STELAR: a statistically consistent coalescentbased species tree estimation method by maximizing triplet consistency. BMC Genomics 21, 136 (2020). https://doi.org/10.1186/s128640206519y
Received:
Accepted:
Published:
Keywords
 Phylogenomics
 Multispecies coalescent process
 Gene tree incongruence
 Incomplete lineage sorting